Designing for long term operations, upgrades, and rebalancing

When you're building a tool for a user there's a huge amount of design, decision making, and focus put on getting the user started. There's good reason for this. Users have an ocean full of tools at their disposal, and if you're building something for others to use you need to win the first five minute test.

You've got about a five minute window for a user to make useful headway in understanding your tool and visualizing how it will aid them in their work. Often this manifests in tools that demo well, but that then have to be left behind when the real work hits. I like to use text editors as an example of this. So many users start out with notepad, nano, or some other really light editor. In time, they learn more and the learning curve of something like VIM, Eclipse, or Emacs really pays off. There's a big gap in folks that make that leap though. If you want to get the most folks in the door it needs to have that hit the ground running feel to it.

When you're designing a tool to help users deploy and run software there's a lot of focus on the install process. Nearly every hosting provider has worked with some sort of "one button install" tool. They're great because it gets users started quickly in that five minute window. Over time though, users find that those one click tools end up being very shallow. They need to add users, create new databases, back up data, restart daemons when appropriate, the list goes on and on. Two operational tasks which are particularly interesting are upgrades and rebalancing.

Juju is an operations tool that can track the state for many different models, or state of operated applications,  potentially run by many different users. These models evolve over time and are running production workloads for years. We know of many models that are nearly a year and a half old. In that same time Juju has had five 1.2x minor versions and three 2.x minor versions. That means you’d want to keep up with improvements in performance, features, and security by upgrading nearly every other month. Aiding in this the Juju 2.1 release includes a new feature, known as model migrations, specifically to help operators managing their infrastructure over the long term.

The general idea is that the largest danger is in doing complex upgrade steps such as database migrations and making sure that everything running is able to communicate on the new version of the software. In Juju 2.1, the Juju infrastructure that manages the state of the models as well as API connections from clients (known as the “controller”) uses model migrations allowing an operator to bring up a new controller and migrate the model state over to a new version. In that process the data is shipped over, it’s gone through any migrations that need to occur, the data is sanity checked between old and new controller, and the system is put together in such a way that if anything fails to check out it can roll back and use the previously running controller. That's some reassurance to lean on when you're doing an important upgrade. Since one controller operates many models, the operator is in control of which models get migrated at what time and allows for a very controlled rollout to the new software in a way that permits safely checking that all remains in the green as the new version of the Juju software is adopted.

Another big use case for model migrations is to balance out the load on Juju controllers over time. As an organization we all grow and change in our needs and over time it's important to be able to move to new hardware, shift services that generate heavy load to dedicated hardware, etc. Juju is tested to perform at thousands of managed machines, however there are dependencies such as the size of the controller machines that track state and over time a normal part of a growing organization is to put into play machines with newer CPUs, more memory, or just flat out beefier hardware with more cores.

I'd love to hear about the tool you use and which ones have fallen short on aiding you in managing your long term operational needs and what other types of long term operations you want your tools to assist with. Hit me up on twitter @mitechie

In a follow up post I'll walk through exactly how to perform a migration to the new Juju 2.2 release using the built in model migration feature.