At Crunch, we’re creating a computing platform for our services, and we’re using AWS Auto Scaling groups to ensure they’re always available.

Auto Scaling groups

Auto Scaling groups have some really neat features, including the ability to perform rolling upgrades. This means that we can update our services without stopping them.

However, just occasionally, something can go wrong.

Sometimes a new instance just fails to start cleanly. That’s OK, it happens. And AWS is smart enough to fix it by rolling back to the old instance configuration. But this introduces a new problem: there’s no real way to know the order in which AWS will stop the service instances.

Say we have a cluster of three instances of a service, and we need to keep two of them running all the time to maintain a quorum. We then start a rolling upgrade, and the new instance fails to start, which is OK as we still have two running and the cluster is still available.

But AWS can terminate *any* of the three instances. If one of the remaining two ‘good’ instances is stopped, we no longer have a quorum and the service is no longer available.

Our solution

We decided to fix this. We’ve written our own software to make sure that we keep a working cluster during a rolling upgrade, even in the unlikely event that something goes wrong.

Our “asg_rolling_upgrade“ software iterates over the instances in the auto scaling group, stopping each instance with the old configuration and replacing it with a new instance with the new configuration. But we control the order of stopping the instances; our code works from the oldest running instance to the newest.

If an instance doesn’t start cleanly or fails to join the cluster, the upgrade stops and one of our SysAdmins can investigate. Meanwhile, we still have a safe working cluster.

Give it a whirl

We’re feeling pretty pleased with ourselves over this, so decided to share what we’ve done. If you want to take a closer look, our asg_rolling_upgrade is available on Github.

Feel free to give it a try for yourself!

Trevor Marshall is Platform Lead at Crunch. His software experience ranges from Java development to real-time simulation systems. Away from the computer, Trevor is a Morris dancer and has just acquired a classic campervan.

Want to get involved? We’re hiring!