Last Friday, as many of you will have no doubt noticed, Crunch experienced a brief but not-insignificant outage. Our technical team take incredible pride in our uptime and reliability, so we thought it would be prudent to detail what exactly went wrong, how we went about rectifying it, and how we’re going to avoid it happening again in the future.

What happened?

The downtime began in the small hours of Friday morning. Rather than slow response times or intermittent service, this was a sudden, complete crash. This was the first time we’ve experienced such an immediate and pronounced outage.

Our technical team contacted our hosting provider and we quickly determined the outage was the result of a hardware failure. Our system runs on a series of servers which pass data back and forth between one another – the engineers on-site discovered the culprit was a faulty power supply on one of these intermediary servers.

Once we had identified the problem our technical team re-routed traffic to a temporary environment which replicated the downed server, allowing those users attempting log in to gain access. They also worked closely with the on-site engineers to bring our server back online as quickly as possible.

Thankfully the problem turned out to not be as severe as we had anticipated and a quick hardware fix enabled us to bring our full service back online. Once the user data that had been created on the temporary server was mirrored back to our central user data store, we restored our primary server and were back up and running at about 11am.

Shouldn’t redundancy mean this doesn’t happen?

Indeed it should. There are two elements in play on the Crunch system, both with different levels of redundancy. The most important is the user data – this is everything input into Crunch (invoices, addresses, expenses – everything). This is encrypted and backed up constantly in several locations, and nothing short of a nuclear holocaust could stop us being able to restore it. We have never lost any user data, nor was any put at risk during this outage.

The other element is the Crunch software, which is the back-end code and front-end user interface used to access the data. This software runs on the series of servers mentioned above, which we collectively refer to as the “app server”. Although the app server has enough capacity to handle heavy load and is constantly monitored to ensure tip-top performance, it currently does not have a redundant backup. This is why, when the power supply failed on one link in the app server chain, the entire system came crashing down.

What are you doing to stop this happening again?

Unfortunately this failure came at the worst possible time (as they tend to do), as we’re in the process of implementing a much more robust solution which is designed to negate this exact type of failure. This new infrastructure includes both a beefing-up of hardware to ensure performance as we continue to grow and the inclusion of several redundant backups for our app server.

In response to last week’s outage we’re accelerating the schedule for the rollout of our app server upgrades and backups (the full details of which we’ll be explaining in a separate blog post closer to launch). We’re also expediting an already-underway review of our server infrastructure with a view to increasing reliability, and improving our worst-case-scenario disaster recovery plan.

In a perfect world Crunch would have been structured this way from the outset, but the simple cost of banking-grade server infrastructure and security has prevented us from implementing such a solution until recently. We have always been aware of this weakness and have been moving to correct it as soon as it became possible.

While we work towards our new solution we’re launching an interim backup service in the next few days, which will provide a mirror of our live service at another data centre. Should we suffer another hardware failure, a quick redirection will mean we can continue as normal while the issue is fixed.

We’re also going to be working with our hosting provider to make sure their hardware maintenance policy is up to scratch. As we saw with the recent Blackberry outage, even the largest networks can be knocked out by a single hardware failure.

We would never be so foolish as to promise our service will never go down, but unfortunately that is the nature of computer hardware and software – sometimes they break. But that doesn’t mean we won’t work tirelessly to ensure our downtime is kept to an absolute minimum.

We have been quite fastidious (some would even say pedantic) in our quest to build continuity of service into Crunch. Our telephony system is one example – we have a secondary telephony system on “hot standby” should our primary PBX fail, and we have baked in the ability to redirect phone numbers at a moment’s notice. Should the very worst happen – for example a fire – it would simply be a question of redirecting our office lines to our mobiles and firing up our laptops.

We’d like to thank our clients for their patience during our (thankfully relatively short) outage, as well as apologise, cap-in-hand, to those who were inconvenienced – and assure everybody that we’re working as hard as we can to make sure our reliability is second-to-none.

If you have any questions about our service, as always, you can contact your account manager, ping us on Twitter, or leave a comment on this post.