Emergency maintenance and why it’s necessary

This is a short rant regarding emergency maintenance. Typically the only time there is unscheduled downtime (or very abruptly scheduled downtime) is because the issue is urgent for one reason or another. For the most part there are only two to three reasons why we may pull a machine offline abruptly:

  • Network maintenance (usually out of our control)
  • Security updates (a recently released exploit, fix, etc)
  • Hardware issue

Now for the most part the last one tends to be our achilles heel. As most of you know we run a RAID-10 setup on all of our servers for redundancy purposes in case of a drive failure – normally this isn’t an issue as all of our drives are hot swappable so we can simply replace the drive on-the-fly and have the array rebuild. When it becomes an issue is when the RAID card itself is reporting degraded drives across the board or the recently replaced drive is still showing as degraded (or in some cases completely dead). Now this puts us in a pickle – typically if this happens we have the RAID card firmware upgraded, or replace the RAID card as a whole (both in which require about 10-15 minutes of downtime). We can either run with no redundancy and post proper maintenance schedules or simply provide a notice of 30-60 minutes and pull the machine offline for a very short period. We typically opt for the latter instance since any data loss will simply balloon the issue into something catastrophic – and when playing with faulty RAID cards we much rather not play a game of russian roulette with our customers data.

Unfortunately as a consequence of our beliefs we occasionally have to pull a machine offline for a short period to ensure everything is running smooth. Data loss is serious and we believe the age old adage “An ounce of prevention is worth a pound of cure”. We’ve had a huge amount of drive issues here at HawkHost – up to several a month on various servers (the majority cause little or no downtime) and have had no data loss.

So if you find that a machine is pulled offline abruptly please take a moment and investigate why that may be the case. We’re very verbose and willing to give you an explanation! Unless there is an issue at the data center we never pull servers offline unless it’s absolutely necessary.

This entry was posted in General. Bookmark the permalink.

4 Responses to Emergency maintenance and why it’s necessary

  1. Chris says:

    I want how does RAID work? like if a drive fails or the RAID card stuff up how do you not lose data?

  2. Tony says:

    We run raid-10 which has information here: http://www.acnc.com/04_01_10.html

    It essentially means every set of data is mirrored once. So having one drive fail does not mean data will be lost. The raid card removes the drive from it’s array of disks and says this one is bad it needs replaced. When a new drive is inserted it then copies the data over to the new drive.

    As far as the raid card if it fails the data is still there on the disks. So it’s just a matter of replacing the raid card.

  3. Roger says:

    It’s not just you guys, conventional hard drive mechanisms are less reliable every day, I work at PC OEM business and I can tell you, hard drive failure has become one of the most common reason for RMA claims nowdays.

    I think it’s time for IT busness to start looking into SSD technology.

    SSD driver are now quite reliable and no that much expensive compared to business class hard drives.

  4. Cody says:

    @Roger

    We plan on eventually utilizing SSD for more of the IO bound things such as mail and MySQL – though we have to wait a bit longer until the technology and prices drop a bit to justify it. It will indeed be a glorious day once we can use them 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *