Of all the things I wanted to do yesterday, testing TTLLP’s disaster recovery plan was not one of them. I often suggest customers make disaster recovery plans, but I don’t like having to use them.
One of our server farms suffered a serious hardware failure and slowly died during the day. (It didn’t catch fire – the title is a silly reference to lp0 on fire.) The current stinging electricity bills means that there’s an emphasis on power-saving now that wasn’t there 4 or so years ago, when we still built servers ourselves, before WEEE and cheap big-brand systems. Today’s servers aren’t quite as resilient, because each extra piece of hardware in use takes extra power. There are some much cleverer tactics now, but they don’t catch everything before it goes “bang” and we can’t just bring the “spare” hardware into action if it’s not installed yet.
Replacement hardware will be installed as soon as possible, as well as some extra hardware that will help protect against a similar failure in future. The slow failure of the server also means that we lost some emails, voicemails and instant messages from 10am to 2pm yesterday. If one was yours, please resend.
Most services have been rerouted or temporarily replaced with help from friendly associates. Three important services that haven’t yet are: 1. our financial system (which gets updated in batches anyway), 2. our task tracking (which is newer than our recovery plan and I need reminding about its backups, but at worst all the current data exists in other places) and 3. our newsletter server. I’ve copied the customer newsletter subscriptions to an older listserver, to send out news of our recovery, but the member newsletters (with GPG integration and so on) will wait for the replacement hardware. I’m hoping we can recover some things from the old server filesystem, else I’ll recreate it from the last release (that’ll teach me to hack it on the live server!). In the meantime, I’ve still most of yesterday’s work to do…