I love backups. In fact one could say I’m a little obssessed with backups. At HostNexus we backup shared/reseller servers to big beefy backup servers across a private network at the data center. Back in the old days we used to use disk-to-disk backups and this is still the default backup method for managed dedicated servers. But as shared/reseller servers fill and get busy the amount of data can get quite large and disk-to-disk backups start to take alot of time and cause very high loads on the server so enter the remote backup system.
Last week Theta had a bit of a meltdown. Theta is one of our older servers and has RAID 1 compared to the Dual Quad Core RAID 10 servers we currently deploy. RAID 1 is two drives mirrored and if one fails you simply take the server down, switch out the failed drive, boot up server and rebuild the array. Pretty simple stuff and 9 times out of 10 it does go this way (yep, only 9/10). Theta got hit by that 1 out of 10 chance last week. It all started with a simple and standard RAID notification:
20090805200803 – Controller 0
ERROR – (0x0F:0x0002): Unit degraded: Unit #0
I scheduled a maintenance window 24 houts later, the server was taken down, drive swapped out but when the server came up the RAID controller didn’t recognise any drives. Suspecting a faulty controller it was swapped out but the result was the same. No drives, nada, zip, nilch. Hmmm, okay, so back in goes the original drive. The RAID card does show one drive this time and the server is fully booted. The original drive then starts to make very nasty noises and the server crashes. The RAID array has died catastrophically and all data has been kissed good-bye.
It’s the server admin’s nightmare. A system designed to protect against 100% data loss has failed. But all is not lost, Bacula has got your back. Bacula is open source network backup software and we’ve been using it since January this year (2009). We found our old system was somewhat unreliable but Bacula has performed really well for us. This was our first real bare-metal restore (meaning restoring a server in its entirety). Theta was rebuilt, reloaded with a new OS and we started the restore. It’s not super-fast but it is solid. It took about 6 hours to restore the whole server which is not bad across a network.
All in all it was a good experience. Clients did suffer extra downtime but everyone was VERY supportive. After a severe server meltdown with 100% data loss were able to get everything up in about 8 hours and our backup system passed a hardcore test with flying colours. And if you’re on any of our cheap web hosting plans you are covered by Bacula.