Giving everyone a little data science, Juju and Zeppelin

Everyone has some data sitting somewhere that they're not putting to good use. They've got some script pulling web logs, or perhaps a database feeding some customer or employee data. We at Canonical also have data sitting around, and lately I was poking at some of it and figured there has to be a better way to enable us to collaborate and pull meaning out of that data sitting there. 

Fortunately we've got tools to help us do just that. I spoke to a member of our big data team and he says to me "Rick, what you want is to hook Zeppelin up to that data." He was right!

What is Zeppelin? Well it's a big data query and visualization tool that's part of the Apache project. Normally folks use it to sit in front of streaming back ends like Spark. It also supports SQL so just about anyone with a MySQL, Postgresql, or SQLite database can stick this in front of it and build out dashboards of useful info. The best part is that we can give others access and different parts of the company can draw their own insights from a shared collection of data. 

We can show this off really quickly using Juju and JAAS. To help demonstrate, I'll use the famous Northwind data set. Let's stick the data in Postgresql so that we can query it from our new fancy Zeppelin dashboard tool. First goal, deploy Postgresql and Zeppelin with Juju Charms that are available in the store. 

$ juju register jimm.jujucharms.com
$ juju add-model zeppelin-demo google
$ juju deploy postgresql
$ juju config postgresql admin_addresses="127.0.0.1"
$ juju deploy zeppelin --to 0
$ juju expose zeppelin

What we've got here is setting up Postgres on GCE along with our Zeppelin we'll wire up to the database in order to put our data in there. If there was already data available then it's not required here. Also note that we could put it in MySQL or we could copy a sqlite database up to the Zeppelin instance and wire it up to query that file instead. 

From here, let's clone the Northwind database and copy the files over to be able to load the data into Postgresql. We can then run the `create_db.sh` script to create the database, load it with data, and setup our user we'll use to connect.

git clone git@github.com:pthom/northwind_psql.git && cd northwind_psql
juju scp -v create_db.sh postgresql/0:/home/ubuntu/
juju scp -v northwind.sql postgresql/0:/home/ubuntu/
juju run --unit postgresql/0 -- \
    "sudo su postgres -c 'cd /home/ubuntu && sh create_db.sh'"

Now we can look up where our Zeppelin is sitting with `juju status zeppelin` and browse to the `Public address`.  Note that the ports exposed are also listed. In my case it's http://104.196.195.224:9080

Once there we get a nice little welcome screen. From here we need to wire Zeppelin up to our database. Zeppelin does this via `Interpreters` that need to be configured. Click on the navigation menu for that and we'll click the "+ Create" button to setup our SQL interpreter. 

The postgresql database, user, and password are all setup by the Northwind script we ran so we copy those values over. 

Now that the interpreter is setup we can create a new notebook and test out a sample query. Head to the "Notebook" and select "+ Create new note". We'll call it "HR Dashboard". What would an HR person be interested in? Let's see who has a birthday coming up. In the textbox we need to tell Zeppelin to use our `psql` interpreter. 

%psql

CREATE OR REPLACE FUNCTION indexable_month_day(date) RETURNS TEXT as $BODY$
  SELECT to_char($1, 'MM-DD');
$BODY$ language 'sql' IMMUTABLE STRICT;


SELECT lastname, firstname, title, birthdate 
FROM employees 
WHERE 
    indexable_month_day(birthdate) >= to_char(current_date, 'MM-DD')
AND  
    indexable_month_day(birthdate) < to_char(
        current_date + interval '120 days', 'MM-DD');

There's a slight trick here in that we need to add a function and then we need to run a query. Zeppelin will only show the result of the first thing run, so we first run the function, and then move it under the query and run it again. We can execute a query with the play button in the upper right or use shift-enter.

Look at that, Michael and Robert have birthdays coming up in the coming months. You can easily see how you might create a sales dashboard that would tie to the same data, but instead look at things such as our most popular products, best customers, or items that are on sale currently. 

Dashboards can be exported and imported. If you run a conference you might run a view of signups over time and track things like the number of sponsors that have committed. Then after the conference back up your dashboards and tear down your Zeppelin until the next conference goes live. 

Try to find your data hanging around and make it more productive by putting a dashboard and web ui in front of it. You'll be surprised by the ways that others put it to work. Thanks Zeppelin and Juju for making this easy to do. 

References:

Setting up Zeppelin and MySQL