I’m sure lots of our customers wonder why do we do hardware upgrades once a year or at least every two years. With the migration of our Skyline server to new hardware I figured it was a good time to explain why we do it and also how we do it with minimal service impact.
History Lesson
We’ve done several upgrades over the years I’d like to quickly run through just how many specifications we’ve had:
- Dual Xeon 2.8ghz (2 CPU cores total) 2GB RAM using two 80GB PATA drives (one backup)
- Dual Opteron 246 (2 CPU cores total) 2GB RAM using two 250GB SATA drives (one backup)
- Xeon 3220 (4 CPU cores total) 4GB RAM using four 250GB SATA drives raid-10
- Xeon 5430 (4 CPU cores total) 4GB RAM using four 250GB SATA drives raid-10
- Dual Xeon 5430 (8 CPU cores total) 6GB RAM using four 300GB Raptor drives raid-10
- Dual Xeon 5450 (8 CPU cores total) 12GB RAM using four 300GB 15K SAS drives raid-10
- Dual Xeon 5520 (8 CPU cores total) 12GB RAM using four 300GB 15K SAS drives raid-10
For the most part specifications jumped quite a bit each time. The only one that did not was the Xeon 3220 to Xeon 5430. The newer CPU’s were a newer Xeon model but overall it was not a huge jump. A big reason for it at the time was to get rid of most our 32bit machines and add the capability later on for extra CPU’s and expanded room for memory. In the end though we ended up just using newer machines instead. Some of the older moves were also when we used different datacenters so the migrations were not as seamless as they are these days.
I’d like to point our some key moves in this progress. We used to run backups on the drives of the servers. This was pretty much how hosting worked with any company as raid was really expensive to do even if you owned the equipment it was not common. Now raid is standard in a lot of cases raid-10 for reliability and performance. With that use of raid meant we added a dedicated backup server as well which in itself was an upgrade. Eventually the upgrade machine became a R1Soft system rather than rsync backups but that’s been told before.
As for the Dual 5450 and Dual 5520 they are mostly the same. We’ll be using both depending on the VLAN the server is on. We ran into an issue with Nehelem’s on our main Dallas VLAN which houses all our web servers. They require an extra network port to run the IPMI system we use and unfortunately the VLAN we’re on all it’s racks available do not have this extra port available. So for some of our upgrades we’ll see 5450 and others we’ll see 5520 and most likely all new machines we’d use 5520’s. This is probably the first time we’ve ever had a mix and match of different CPU’s. Which for web hosting is not a huge deal to begin with.
Why do we do it?
We’re obviously making profit on each server so why do we do it exactly? Well the obvious one is the requirements of web sites grow. Even the small sites use more PHP and MySQL than ever before. The more features added to wordpress or whatever script is does take it’s toll over time depending on what it adds. The other thing is simply a newer machine gives us greater capacity as well as improved performance. Less servers means less work for us so the old Dual Opterons hold 1/8 of what we can put on a newer machine we have now. That might even be an understatement. With the increased capacity it does not mean we cram them full by any means. It just means we had room left over before and we will again just more general use capacity available as well as more burst capacity as well. So in the end the users on the server do win they get a better machine and it’s not just more users on that machine.
We also do it just because as we grow our buying power increases. We do not own our servers we rent them thus as we rent more machines our cost goes down per machine. The pricing we can negotiate is much better than it was even a year ago. We can say to our provider look we have x amount of machines we’d like to get a good deal and we’ll use this configuration for a while. This is a big reason we do it we can leverage our buying power to get higher end machines.
How Do we do it?
If you’ve been on a server migrated in oh the past two years you’ve noticed probably that we can do it exceptionally well. When we moved to SoftLayer a big advantage was we no longer had to deal with routing of IP’s to each server. We now route them to our VLAN which means as long as we have space on it we can route IP’s as we please. We do not just do this for migration purposes either two machines could share the same IP block. So in quite a few cases now the IP beside the one your site is on may not be on the same server. New IP ranges obviously that may not be the case but as older servers have customers leave and if they’re on a dedicated IP the IP is thrown back into the pool for the entire VLAN.
Using the routing advantage we have we can find out which sites are on a specific IP and transfer their data to the new server. Once all the sites on the IP are switched over to the new machine we use arp to tell the network side instantly the IP has switched locations. This means no DNS resolution down time at all. We have our users in various IP’s on each server to reduce other issues so this helps with making a seamless migration as well. The fact is the majority of users do not know that we’ve even switched their site to a new server. We inform people but most do not read it or care as long as it works. That’s how effective it is a user does not even realize they’ve been switched over.
It’s not a perfect system though we do run into issues. The big one is the fact that while data is moving across a site could have changes done to it. This is mostly a problem with a discussion forum which is writing data to mysql. Once switched over it’s not really an option to migrate the data again. The same problem could come up data being lost as we re-sync the mysql data again. The other thing that has come up the odd time is we cause a routing issue for an IP. It’s basically a human error where we re-route the IP and forget to check that it worked fine. It’s a rare occurrence for us to happen but it has. We’ve learned a lot since we started doing it though that it’s no longer the huge worry it once was due to experience of doing it and that the issue could come up so we double check it now.
Conclusion
Well that’s summarizes why we do what we do and how we do it so well. It helps us by adding capacity while not having to manage more servers. It puts users on newer hardware and being put on a server with a lot more burst room than previously. We do this all with the majority of users not having an issue what so ever.
Very pleased with the speed improvement in the latest hardware upgrade. Great for WordPress, Joomla, and Magento sites. This is clearly a win-win situation for both parties. I like the fact that Hawk Host continually improves on their services and hardware on a regular basis…which makes them quite an exceptional host!
For future migrations: How about displaying a maintenance notice at the site level in order to block public access until the migration is completed? This will help to prevent data loss or data integrity problems.
Some ideas to think about but not sure how feasible:
1) Write a script which copies the maintenance notice to the root directory of each domain and then remove it when done
2) temporary redirection to a standard maintenance notice that’s centrally located
3) integrate your maintenance scripts with cPanel or write a cPanel plugin/addon
1) Users who have static data will complain about the notice.
That basically covers #2 and #3.
We typically can have sites migrated without issue. It’s an exception for people to have data sync issues. Those users typically have either a very large forum which data is changing every say few minutes. The other case is a user who has say 15GB of small files and some dynamic portions like a forum so it takes to long to move the data that it’s difficult to prevent.
It’s not just forums but also those store websites with transaction processing. It’s not really a big deal for forums to lose some data but when it comes to transaction processing especially that of orders and payments, then it’s really really important to have data integrity and prevent any kind of data loss.
Alternatively, if redirection or an automated script is not viable, then you need to work with the owners of dynamic data sites when it comes to migration. The onus will be on the owners to set their sites offline (if they want to) during the migration. I know Joomla and Magento can easily be set to display a maintenance notice.
We give notice of it happening and we advise any concerns contact us. If a user wanted to schedule a time for it to be done we could do that.
As far as working with users if that was the case it would never actually get done. A lot of users do not even respond to emails sent to them. So we basically could never upgrade hardware as we could never get all the users. So a migration that takes us two days would take 6 months to complete because we’d have to coordinate and get responses from an entire server. It would be very costly and in the end we’d probably end up having to keep both servers as some would never respond back.
We’ll unfortunately just have to deal with the fact there may be 1 user out of each server that has an issue with the way we do migrations. Our policy is still better than a large majority of hosts. Those hosts would change the users IP so we’re talking about multiple days where data could come out of sync. With the majority of sites we’re talking about a window of a few minutes. There are a few exceptions to that but it’s still much better than days. We can migrate an entire server in the time frame it takes for DNS propagation to happen for another host and we do ours in batches not just one everyone is out all at once.
Migrating accounts by batches is a better idea.
I don’t mean for you to wait forever for users to respond. There should be a deadline for any response. Those who care about their site’s data integrity will definitely get back to you.
I still think there is a way to automate but then I’m not the one doing the migration. 🙂
We already do it in batches. We look at each IP on the server and we move each IP one by one so it’s in batches. We post notices we encourage users to subscribe but most do not as they do not want to know about maintenance windows. Mass emails to users has caused us to deal with complaints and cancellations in the past. So we take the approach of if you care you’ll subscribe otherwise you can continue to have your head in the sand about it.
As far as automation well we simply select the sites of on that IP move them then re-route the IP. It’s as automated as we’d want it as we want some control over it.