As clients may have noticed, the Crunch app has had one or two hiccups since Friday 28th February, and delivered less than the service that you expect from us. We think we have now resolved the issues that were causing these problems, and we’re confident that the service is back to its best. But we thought we should explain what happened, what we’ve done to fix it, and what we’ll be doing to reduce the chance of it happening again.
Fridays are always very busy on the Crunch app and Friday 28th, being the last day of the month, was exceptionally busy. We had a very large numbers of user logged in (35% of our entire client base, in fact – our normal load is less than half that) and there were also payroll and End of Year processes running. You can see how pronounced the spike on Friday was in this graph –
The ever-growing number of Crunch clients means that these load spikes increase every month, and due to the perfect storm of events last Friday the load was off the scale – and I’m afraid we failed to anticipate it. We were monitoring the app throughout the day and could see that it was slow, but we thought it best not to interrupt service in order to increase the server horsepower. In hindsight we probably should have been more proactive.
Why was the app unstable last week?
As the old saying goes, it never rains but it pours. Friday’s issues had repercussions in other areas of our system which we were not able to rectify until last Tuesday. We were also in the middle of a lot of planned upgrade work, the impact of which has been exacerbated by these issues.
Many of you will have received an RTI Full Payment Submissions email from us last week – Wednesday evening’s maintenance period was to deploy these changes. We had two other brief maintenance periods on Thursday to deploy more fixes and to set some 2014/15 tax thresholds.
We know that three outages for maintenance within 24 hours is unacceptable, but the work was necessary to get the app back to normal and make sure our RTI system is ready to issue its first Full Payment Submissions.
What are you doing about it?
We’re taking a phased approach to solving the server load issue:
1) We have already significantly boosted the computing power available to our application and database servers (all of which run on Amazon’s Elastic Cloud Compute service, which makes such upgrades very straightforward – more details on that here). Hopefully you have already noticed that the app is more responsive.
2) As well as increasing capacity on our existing servers, we are investigating adding brand new servers to further alleviate localised problems. We will continue to monitor load and add more capacity as and when it is needed.
3) This process has helped us identify a few elements of our software which could be more efficient, and we will roll out improvements to these areas over the next few weeks.
Is everything back to normal?
Yes – in fact with our upgraded servers things should be better than ever! On Wednesday we plan to roll out the first of our performance enhancements, which will improve performance further still. In the meantime, you can be sure we are monitoring everything very carefully. As a result of last week, we have also put much better monitoring tools in place.
We cannot apologise enough for this period of instability, and you can rest assured we are doing everything in our power to keep Crunch running smoothly over the coming weeks, months and years. Our technical team did a fantastic job dealing with these issues, and all told we only experienced 4 hours and 20 minutes of downtime over the affected period, most of which was late at night.
We’re also in the process of almost doubling the size of our development team, which will enable us to roll out these updates with increasing frequency (incidentally, if you know anyone please send them our way).
If you have any questions about what happened or the fixes we deployed, do drop a comment below. Feel free to get as technical as you like!
Photo by Craig Sunter