Emergency maintenance and why it's necessary

Posted By: cody

Last Updated: Thursday September 17, 2009

This is a short rant regarding emergency maintenance. Typically the only time there is unscheduled downtime (or very abruptly scheduled downtime) is because the issue is urgent for one reason or another. For the most part there are only two to three reasons why we may pull a machine offline abruptly:

Network maintenance (usually out of our control)
Security updates (a recently released exploit, fix, etc)
Hardware issue

Now for the most part the last one tends to be our achilles heel. As most of you know we run a RAID-10 setup on all of our servers for redundancy purposes in case of a drive failure - normally this isn’t an issue as all of our drives are hot swappable so we can simply replace the drive on-the-fly and have the array rebuild. When it becomes an issue is when the RAID card itself is reporting degraded drives across the board or the recently replaced drive is still showing as degraded (or in some cases completely dead). Now this puts us in a pickle - typically if this happens we have the RAID card firmware upgraded, or replace the RAID card as a whole (both in which require about 10-15 minutes of downtime). We can either run with no redundancy and post proper maintenance schedules or simply provide a notice of 30-60 minutes and pull the machine offline for a very short period. We typically opt for the latter instance since any data loss will simply balloon the issue into something catastrophic - and when playing with faulty RAID cards we much rather not play a game of russian roulette with our customers data.

Unfortunately as a consequence of our beliefs we occasionally have to pull a machine offline for a short period to ensure everything is running smooth. Data loss is serious and we believe the age old adage “An ounce of prevention is worth a pound of cure”. We’ve had a huge amount of drive issues here at HawkHost - up to several a month on various servers (the majority cause little or no downtime) and have had no data loss.

So if you find that a machine is pulled offline abruptly please take a moment and investigate why that may be the case. We’re very verbose and willing to give you an explanation! Unless there is an issue at the data center we never pull servers offline unless it’s absolutely necessary.