Why We Upgrade Servers

Posted By: Tony Baird

Last Updated: Friday September 11, 2009

I’m sure lots of our customers wonder why do we do hardware upgrades once a year or at least every two years. With the migration of our Skyline server to new hardware I figured it was a good time to explain why we do it and also how we do it with minimal service impact.

History Lesson

We’ve done several upgrades over the years I’d like to quickly run through just how many specifications we’ve had:

Dual Xeon 2.8ghz (2 CPU cores total) 2GB RAM using two 80GB PATA drives (one backup)
Dual Opteron 246 (2 CPU cores total) 2GB RAM using two 250GB SATA drives (one backup)
Xeon 3220 (4 CPU cores total) 4GB RAM using four 250GB SATA drives raid-10
Xeon 5430 (4 CPU cores total) 4GB RAM using four 250GB SATA drives raid-10
Dual Xeon 5430 (8 CPU cores total) 6GB RAM using four 300GB Raptor drives raid-10
Dual Xeon 5450 (8 CPU cores total) 12GB RAM using four 300GB 15K SAS drives raid-10
Dual Xeon 5520 (8 CPU cores total) 12GB RAM using four 300GB 15K SAS drives raid-10

For the most part specifications jumped quite a bit each time. The only one that did not was the Xeon 3220 to Xeon 5430. The newer CPU’s were a newer Xeon model but overall it was not a huge jump. A big reason for it at the time was to get rid of most our 32bit machines and add the capability later on for extra CPU’s and expanded room for memory. In the end though we ended up just using newer machines instead. Some of the older moves were also when we used different datacenters so the migrations were not as seamless as they are these days.

I’d like to point our some key moves in this progress. We used to run backups on the drives of the servers. This was pretty much how hosting worked with any company as raid was really expensive to do even if you owned the equipment it was not common. Now raid is standard in a lot of cases raid-10 for reliability and performance. With that use of raid meant we added a dedicated backup server as well which in itself was an upgrade. Eventually the upgrade machine became a R1Soft system rather than rsync backups but that’s been told before.

As for the Dual 5450 and Dual 5520 they are mostly the same. We’ll be using both depending on the VLAN the server is on. We ran into an issue with Nehelem’s on our main Dallas VLAN which houses all our web servers. They require an extra network port to run the IPMI system we use and unfortunately the VLAN we’re on all it’s racks available do not have this extra port available. So for some of our upgrades we’ll see 5450 and others we’ll see 5520 and most likely all new machines we’d use 5520’s. This is probably the first time we’ve ever had a mix and match of different CPU’s. Which for web hosting is not a huge deal to begin with.

Why do we do it?

We’re obviously making profit on each server so why do we do it exactly? Well the obvious one is the requirements of web sites grow. Even the small sites use more PHP and MySQL than ever before. The more features added to wordpress or whatever script is does take it’s toll over time depending on what it adds. The other thing is simply a newer machine gives us greater capacity as well as improved performance. Less servers means less work for us so the old Dual Opterons hold 1/8 of what we can put on a newer machine we have now. That might even be an understatement. With the increased capacity it does not mean we cram them full by any means. It just means we had room left over before and we will again just more general use capacity available as well as more burst capacity as well. So in the end the users on the server do win they get a better machine and it’s not just more users on that machine.

We also do it just because as we grow our buying power increases. We do not own our servers we rent them thus as we rent more machines our cost goes down per machine. The pricing we can negotiate is much better than it was even a year ago. We can say to our provider look we have x amount of machines we’d like to get a good deal and we’ll use this configuration for a while. This is a big reason we do it we can leverage our buying power to get higher end machines.

How Do we do it?

If you’ve been on a server migrated in oh the past two years you’ve noticed probably that we can do it exceptionally well. When we moved to SoftLayer a big advantage was we no longer had to deal with routing of IP’s to each server. We now route them to our VLAN which means as long as we have space on it we can route IP’s as we please. We do not just do this for migration purposes either two machines could share the same IP block. So in quite a few cases now the IP beside the one your site is on may not be on the same server. New IP ranges obviously that may not be the case but as older servers have customers leave and if they’re on a dedicated IP the IP is thrown back into the pool for the entire VLAN.

Using the routing advantage we have we can find out which sites are on a specific IP and transfer their data to the new server. Once all the sites on the IP are switched over to the new machine we use arp to tell the network side instantly the IP has switched locations. This means no DNS resolution down time at all. We have our users in various IP’s on each server to reduce other issues so this helps with making a seamless migration as well. The fact is the majority of users do not know that we’ve even switched their site to a new server. We inform people but most do not read it or care as long as it works. That’s how effective it is a user does not even realize they’ve been switched over.

It’s not a perfect system though we do run into issues. The big one is the fact that while data is moving across a site could have changes done to it. This is mostly a problem with a discussion forum which is writing data to mysql. Once switched over it’s not really an option to migrate the data again. The same problem could come up data being lost as we re-sync the mysql data again. The other thing that has come up the odd time is we cause a routing issue for an IP. It’s basically a human error where we re-route the IP and forget to check that it worked fine. It’s a rare occurrence for us to happen but it has. We’ve learned a lot since we started doing it though that it’s no longer the huge worry it once was due to experience of doing it and that the issue could come up so we double check it now.

Conclusion

Well that’s summarizes why we do what we do and how we do it so well. It helps us by adding capacity while not having to manage more servers. It puts users on newer hardware and being put on a server with a lot more burst room than previously. We do this all with the majority of users not having an issue what so ever.