The day started out like every other one with me waking up and starting my walk to the computer while half asleep. I used to go shower and such like I was going to an office but after a while I figured it be best if I walk to the home office in my boxers to see what happened while I was gone. I check out our support system just a few tickets but nothing that required my attention. I checked my email and not a whole lot of mail just a bunch of orders in and the usual log of all the credit cards charges from our batch run. It was looking like a great day not to busy some orders to check out but other than that great.
I always make sure to SSH into the machines a sort of habit of mine from back when we had a few. Thankfully with SecureCRT it’s just select a few folders and I’m logged into all the machines. After doing my usual checks of everything else I heard a beep come from SecureCRT and the first thing that came to mind is oh no something must have broke. I start checking the servers and I get to Pluto the machine that has never had anything really happen to it. It’s has hard drives fail and be replaced without any down time and it’s never had any issues with any of it’s hardware. About the only issue it ever had was when a UPS blew in the server room it was in knocking it off line as well as some of our other machines. The error was about the memory and it claimed to be a fatal error of some kind. I did not like the sound of it yet the machine was still functioning as it should so that meant I needed to google and use my resources available to me.
The first thing I do is ask Cody what the heck does the message mean? While I’m waiting for him to respond back I load up google and search and I find out that usually it suggests the memory is on it’s way out. Cody finally messages me back with make a ticket with SoftLayer they’ll know what the heck that means. Just as google suggested SoftLayer quickly determined this almost always means the memory on the machine needs replaced to avoid a major headache later on. This meant a maintenance window needed to be scheduled for the server. As I mentioned early I was in my half asleep state so I paste Cody our options and he says 1-4 is good we should do it then. Being the half asleep person that I am I updated our ticket with that time. In the past we’d do it ASAP or schedule something out of the suggested window times as they tend to be bad for us. A few hours later after I had already posted notice on our forums I had an oh crap moment of did I just really schedule a window for 1:00AM CDT to 4:00AM CDT window? At this point I was awake so that’s how I noticed and I realized that there was no way out of this we had informed everyone necessary already. Since I was the fool to schedule it I’d be the sucker to have a Friday night / Saturday morning maintenance window to deal with since someone had to be there in case something bad happened.
My day went on as normal doing some work then calling it a day. I had pizza for dinner and watched some family guy then two and half men so the usual suspects while I eat then decide what I’m going to do tonight. I realized there wasn’t a whole lot I could do but the maintenance window was hours away so I decided to watch a movie to at least pass some time. The movie finished and I realized I was still a long ways away from the maintenance window starting and I thought to myself what in the world am I going to do. I went back to watching tv and ended up watching that’s 70s show and south park two shows I never really watch except they were the only decent shows on. The window was still a ways away and I pulled out the xbox 360 and figured the window starts soon I better not play any game I won’t want to give up on in a bit so I can be on the computer. I played a few games of Geometry wars to pass the time and finally it was 1:00 AM CDT which meant the window may start.
Unfortunately for me the window is 3 hours long so it ended up where it had not actually started yet as the machine was not off line. Cody a few days earlier linked me to a the daily wtf post that he thought was funny. I only really read posts on the site while I was working in an office and got bored from time to time. I spent about an hour and a half reading all the posts on the site in the past three months as I got that bored. I also made sure to post classics like The Source Control Shingle on our twitter while I was bored. I also made sure to send Cody an email about how I hate maintenance windows especially ones this late and of course linked him to a post on the daily wtf he might like NPR Is Reading My Email, Just Fix It!, & More Support Stories mentioning the just fix it one that reminded me of some of the support tickets we get from time to time. So after the hour and a half I got notice the Pluto server was finally going off line and I thought to myself finally! I jumped on live chat because I was still bored and figured I’ll talk to some customers who come on complaining about the server being off line as everyone always does. Usually when I try to man the live chat there is always someone who comes on who will sit there for several hours asking questions or frankly being an annoyance to me asking about if their web site looks like or what am I up to or something totally out of the scope of anything we do. To my surprise this was not the case maybe it had something to do with the fact it was now 2:30 AM CDT and everyone was sleeping. About the only customers we had who would be complaining or coming on would be from Asia. Sure enough there were a few with their broken English and what not they asked what’s going on I linked them to the post and they posted some more rambled english about how Hawk Host rocks or at least I hope that’s what they meant.
As the window went on I resorted back to reading and doing basically nothing as I must have talked to anyone who cared about the server being down at 2:30AM CDT already. I watched Jon Stewart Grills ‘Death Panel’ Originator (Canadian link) which provided me a few minutes of amusement. The machine finally came back up a few tickets later and chats and everyone was happy the server was back online. The best part there were no longer memory error messages being printed to console so it was a victory. So in the end I waited around for nothing bad to happen but we always have someone to be there in case something bad does happen.
What I’ve learned about maintenance windows is don’t be a fool doing a late night one when you have nothing to do at such a late time. If I was a midnight shift guy I might have found something to do but I was not. So I basically had nothing planned to do and waited around then waited some more for the window to complete. In the future I’ll make sure to have someone who actually is around during the late hours to be the guy who sits around waiting for nothing bad to happen. Or at least schedule a more reasonable time that is late but not so late that it’s agony waiting for it to happen. Also in the case of any window have something to do because we’re not the ones messing with the hardware so it becomes pretty boring pretty fast.
Well that’s the post I hope everyone enjoys the story and looking at it now I realized it seemed much better when I was half asleep thinking maybe it was a good idea to talk about my maintenance window adventure. In hindsight though it’s not as exciting now that I’m awake and not going insane over sitting around waiting.
wow, that was a read and a half lol. :), It’s good to see you will spend good sleep time sitting around waiting for a server to come back online 😉