Operations in Production

As we setup our important infrastructure we set up monitoring, alerting, and keep a close eye on them. As the services we provide grow and need to expand, or failures in hardware attempt to wreck havoc, we’re ready because of the due diligence that’s gone into monitoring the infrastructure and applications deployed. To do this monitoring, we often bake it into our deployment and configuration management tooling. One thing I often see is that folks forget to monitor the tools that coordinate all of that deployment and configuration management. It’s a bit of a case of “who watches the watcher?”

While building and operating JAAS (Juju as a Service) we’ve had to make sure that we follow the same best practices for our Juju hosting infrastructure that you’d use on your production systems you run in JAAS. This means that our Juju Controllers (the state and event tracking back end) need to be watched to make sure that we’re ready to prevent any issues we can see coming and know about issues as soon as possible for the ones we cannot. 

Fortunately, there’s some great established tools for doing this. We use Prometheus, Grafana, and Telegraf to help us keep our JAAS infrastructure running smoothly. What’s better, is that there are Charms available for each that enable us to quickly and easily deploy, configure, and operate the monitoring setup. This means you can replicate the operational knowledge we’ve gained by reusing the same exact Charms we use in production.

NOTE: for this setup we’re going to use extra hardware to monitor our production Juju Controllers. This approach is best practice to ensure that we’re not going to be adding any load to what we’re measuring with the measuring itself. It also means that when our Controllers are HA we can wire them each up to a single set of monitoring endpoints. You can also replicate this setup, but done with greater density, by placing these applications on the Controller nodes themselves if you’re hardware constrained or have softer production requirements. 

Getting our Controller set up

Let’s walk through an example set up. First, we’ll create a new controller on GCE where we’re going to run our production applications. Note that this will work for any Controller on any cloud you choose to run Juju on. 

$ juju bootstrap google production
...
Bootstrap complete, "production" controller now available.
Controller machines are in the "controller" model.
Initial model "default" added.

One interesting thing in the output at the end is that your “Controller machines” are in the controller model. We want to watch those so let’s switch to that model and work from there. 

$ juju switch controller
production:admin/default -> production:admin/controller

The first thing we need is to get our applications for our monitoring stack into the model. Let’s deploy them. 

$ juju deploy cs:~prometheus-charmers/prometheus 
$ juju deploy cs:telegraf

Telegraf needs to be told to work together with Prometheus and send metrics to it. Let’s wire that up.

juju relate telegraf:prometheus-client prometheus:target

Now Telegraf is meant to watch the system metrics, such as RAM, disk, and CPU. It then allows Prometheus to scrape that data and track it over time. We need to get Telegraf onto each of our controllers we want to watch. Telegraf is what we call a subordinate charm. It’s intended to sit and watch existing things that are running. We’ll “fake” a running charm on our controllers by deploying the Ubuntu charm to them using the “--to 0” option. Once Ubuntu is deployed we can relate it to Telegraf which will setup the data output and we’ll sanity check things are wired up properly. 

$ juju deploy --to 0 cs:ubuntu
$ juju relate telegraf:juju-info ubuntu:juju-info

We can check to see if Telegraf is ready to go and monitoring our Controller machine by visiting the URL it exposes the data on. Note that since we’re checking that from our browser we’ll need to temporarily expose it and allow firewall access to the url. 

$ juju expose telegraf
$ chrome http://104.196.65.112:9103/metrics

That looks good, now let’s secure that firewall port.

$ juju unexpose telegraf

Setting up Prometheus

With Telegraf reporting to Prometheus now check that Prometheus sees the data. Again, we’ll expose it so we can view it from our local browser. 

$ juju expose prometheus
$ chrome http://104.196.46.127:9090/

From here we could start to build graphs of our data, but let’s go to the “Status→Targets” menu and make sure we see our Telegraf watching our Controller. 

Adding Juju-specific metrics

Telegraf is great for outputting metrics of the system load and the like. However, we have also baked metrics into Juju itself that we can enable. To do so, we’ll need to setup a user that Prometheus can use to pull metrics from our Juju Controller. 

# Note that you don't need to register with this 'bot' user
$ juju add-user prometheus
$ juju change-user-password prometheus
$ juju grant prometheus read controller

For the moment, you need to pass the config to Prometheus that adds a new target for the Juju metrics. I’ve got a sample config file here that you can use to send with the following config call. Note that you need to add the IP address of the Controller to the configuration. I’ve configured this file with my controller IP from this demo. I’ve also set the password to match what I used in the above commands to setup the prometheus bot user.

Sample config

- job_name: juju
  metrics_path: /introspection/metrics
  scheme: https
  static_configs:
    - targets: ['104.196.65.112:17070']
  basic_auth:
    username: user-prometheus
    password: monitor_all_the_things
  tls_config:
    insecure_skip_verify: true
$ juju config prometheus scrape-jobs=@controller-prom.yaml

Visualizing that data

With our data flowing in it’s time to watch it.  Let’s setup Grafana to talk to our Prometheus. 

$ juju deploy cs:~prometheus-charmers/grafana
$ juju config grafana admin_password=monitor_all_the_things
$ juju add-relation prometheus:grafana-source grafana:grafana-source
$ juju expose grafana
$ juju status grafana

We need to configure Grafana through the web UI from here so the last line above prints the IP address and port where it is available. We log in with the username admin and the password we used in the config line above.

chrome http://35.185.96.89:3000

From here, we’ll need to add our dashboard. I’ve got a dashboard you can load in to get started with. You can get it from here and import it using the menu item in Grafana for “Dashboards → Import”.  You’ll need to tell it to wire up to our juju-controller data source we setup above and you should get something that looks like this:

From here we can now see how many models we’re running on our production infrastructure. How many machines those models are taking up. We then start to get the Telegraf data of the controller itself. How’s it doing on memory, cpu, load, and disk space. 

Where to go from here

From here we’d obviously look to add more controllers to get to an HA status. We would setup alerting because what good is measuring if we don’t get a giant hint that things are heading out of whack? We might add additional dashboards to monitor other vital infrastructure. What’s good to know is that the teams running JAAS are doing all of this for you. They’ve put together the practices to make sure that you can rely on the services provided and they’re ahead of any potential problems.

https://twitter.com/mitechie