In the seven-year history of NameHero, this is by far the most challenging and stressful 48 hours, and perhaps the most stressful in my career.
What Happened? Full Timeline Of Events
* On Friday, November 10th at 5:31 a.m. Central Time our monitoring system alerted to an outage on Node 209, a Cloud Node that provides service to our Web Hosting customers. This node was hosted on a dedicated hypervisor that contained dual AMD Epyc CPUs with 2x6TB Raid 1 NVMEs. Immediately, engineers attempted to access the node and get it back online.
* Unfortunately SSH and direct console logins were unresponsive, therefore a graceful reboot (meaning all pending processes are safely stopped) through the hypervisor was attempted.
* Upon restarting the node, it immediately went into a kernel panic and would not boot.
* Data center technicians went in through the hypervisor and immediately began a file system check (fscheck) to examine the storage partitions for corruption and attempt to repair them. Unfortunately this was unsuccessful.
* Data center technicians then began to inspect the Raid array for a failed drive. This specific hypervisor was using 2x6TB NVMe dives in a Raid 1 (mirror) configuration. This means that one drive mirrors the other, and in the event of a drive failure, the healthy one automatically takes over.
* It was discovered that the virtual file system created by the Raid array was corrupted; therefore a utility to rebuild and rebuild the array was attempted. Unfortunately this process failed.
* At this point it was decided that further attempts to fix the file system could completely compromise the data rendering it useless, therefore it was decided by Data center technicians to clone the Cloud VM so that aggressive repairs could be completed, with an exact replica to fall back on.
* Given the size of the virtual file system, we knew this process was going to take sometime, therefore we decided to begin preparing to restore from offsite backups. This included deploying a brand new hypervisor, matching the exact same specs, to rule out any potential faulty hardware.
* After running the VM clone process for several hours, it was determined the fastest way to restore service to accounts was to revert to offsite backups that were taken four hours prior to the crash. A new Cloud VM was then deployed on the new Hypervisor and configured via our standard Web Hosting image.
* At 6:00 p.m. Central Time, Friday, offsite backups began restoring, bringing services back online.
* By early Saturday morning, November 11th, more than 65% accounts had been successfully restored and back online.
* Unfortunately we became aware of account usernames that began with the letter "a" or "b" were failing to restore due to the way our third-party backup vendor, JetBackup had them indexed. We confirmed backups for these accounts were available and consulted with JetBackup on how to handle. They advised we would need to reindex our backups but could not safely do so until the disaster restore process had completely finished.
* By late Saturday evening, we were still waiting on three large accounts to complete to begin the re-index process.
* Early Sunday morning, December 12th, the disaster restore process had completed and the re-index process through JetBackup was safely completed. All further restores were immediately started using the highest possible CPU limits as possible.
* As of writing this statement all but 3 accounts have been safely restored from offsite backups and are now back online. The final 3 are still running through the restore process but should be back online soon (depending on their size).
Who Was At Fault? Why Did It Take So Long?
I hate downtime; I think 5 minutes of it is unacceptable. An hour of downtime is outrageous; but anything longer, especially 24+ hours, requires some change and reevaluation of our disaster policy.
All of our Web Hosting nodes deployed at NameHero run on top-of-the-line hypervisors, mostly with dual socket AMD Epyc CPUs and 2xNVMe drives in a Raid 1 (mirror) configuration. Each node is backed up nightly as well as weekly offsite.
If a storage drive fails, given the Raid 1 configuration, the other automatically takes over while we schedule a time to take the physical hypervisor offline and replace it. This doesn't happen often, but isn't unheard of. Over the last 7 years we've successfully handled such issues with very little downtime.
Additionally, if we become alerted of possible faulty hardware, the Cloud VM is either immediately relocated to another hypervisor, in real-time without any downtime, or the hypervisor is briefly pulled offline and all hardware is replaced, besides the storage drives, within 30 minutes. While this is extremely rare, it has happened a couple times over the last 7 years with very little impact to our customers.
In the event of a disaster, such as a data center fire, or file system completely corrupted (as the case here), we have 2 copies of remote offsite backups (one nightly, one weekly); stored in completely different data centers (one for nightly, one for weekly). This gives us two alternate remote locations to restore data from. Because these backups are remote, they take longer to restore as the data must first be downloaded before it can be restored. This is consider a "worst case" scenario and is our "last resort" method as it does take the longest. Over the last 7 years, we have never had to resort to this until now.
With all of this outlined, disaster policies were completely handled as we have been trained to do. We attempted every method possible before resorting to our last effort, but when it comes to the integrity of the data (making sure data is preserved) vs. downtime; we have to choose the data. We knew that 24 - 48 hours of downtime was going to be bad, but we also knew that data loss would be catastrophic.
While ultimately the customer is responsible for their own data, NameHero does make a "best effort" to generate and store complimentary daily and weekly backups for each customer on our Shared nodes. We have a full-time team that manually inspects backups daily and constantly completes checks. There is however no substitute for having a copy of your own backup on your local computer/Dropbox/cloud storage. If you do not have one, I do suggest generating one at least once a month by going into cPanel -> JetBackup and downloading a "Full Account Backup."
So who is to blame? I'm Pissed!
Trust me, I completely get it if you're angry; I'm pissed too! Downtime sucks, waiting for someone else to restore your website is a powerless feeling, and emotions run very high. However I cannot point the finger at anyone; our nighttime supervisor called me 1 hour after the node did not come back online after exhausting all other methods. Our data center team took all necessary measures and utilized all the tools at their disposal.
Since coming online Friday at 6:30 a.m. I have not been off the computer. I understood from the very beginning this was going to be a challenging situation so I personally conducted the restoration process, going through account-by-account and verifying. I had three important engagements I could not cancel, but attended each with my laptop in-front of me, including taking my kids to see Santa, and even briefly fell asleep on my laptop earlier this morning :).
My point being, I did everything physically and mentally possible, but do take complete and full responsibility; so if you need to be angry at someone you can be angry at me. Please don't take things out on our wonderful Superhero Team, they have all worked extremely hard to relay the information I passed down to them. I understand some of my timeline ended up being incorrect; we had to deal with the whole re-indexing process, but in these disaster type situations, there is way too much at stake to try to short cut any of this process. Everything much be handled in an orderly fashion to ensure data is transferred and restored correctly.
Unfortunately when dealing with servers/computer hardware in general there is always a risk of sometime of failure. While we have many preventative measures in place, things can still happen, and this is one of those things that haunts my dreams and keeps me up at night.
Where Do We Go From Here?
I live by the acronym A.D.P - Adapt, Develop, and Progress. Though this was an extremely challenging situation, there was a lot of learning that went on and things we all do in the future:
* While the entire situation was not ideal, one very big positive, we were able to restore our offsite backups successfully. While the timeline was not the greatest; we still were able to utilize our disaster policies and get accounts back online and matched within our systems.
* If you don't already, PLEASE make sure to retain at least one of your backups. It's like an insurance policy, you never need it, until you need it, and then you're counting the stars you paid for that policy. As noted above, these can easily be generated/downloaded right inside of cPanel -> JetBackup. As long as you have a copy, our team can restore these very quickly.
* This year we opened our privately owned data center in Kansas City, Missouri. I wanted to be able to offer more in terms of high availability and even additional redundancy to the discounted web hosting market that other cloud platforms could not provide (at least where it would be affordable to the end user). Therefore we've been testing a number of different configurations including real-time replication with instant high availability fail-over (i.e. one physical hypervisor goes offline, another is already synced and can fail-over automatically). As well as a new snapshot system that takes snaps throughout the day and stores them on an external high-powered hypervisor. These things are relatively simple with single customer VPS, but a lot more complicated with large storage shared servers.
* I am going to go through our disaster policy and reevaluate the time it takes for us to begin restoring backups. With this being the final effort to getting services restored we have to be very careful with the data, only initiating this process when completely necessary, but going to look at what specific situations cause for this and what we can do to turn these around faster.
* We still plan on conducting a full investigation to the failed node to pinpoint exactly what caused the file system corruption. I will release further details as I have them.
In summary, I hope this timeline helps explain the situation we were up against. This was an extremely frustrating and challenging situation and I know it is going to take sometime to earn back everyone's trust. But I'm going to continue to roll up my sleeves and ensure I'm doing everything humanly possible to continue moving forward offering high-speed, reliable, secure, and affordable cloud web hosting solutions to individuals and small businesses.
I wish you all a very happy, safe, and healthy holiday season!
- Ryan, Founder/CEO