I was speaking with a potential client last week and they brought up a great question: How Do You Monitor For Uptime?
This individual had seen our 99.9% uptime claims, but wanted to know what the “real” uptime was.
I completely understand this as for years prior to starting NameHero I battled this with other web hosts.
They would make claims of 99.9% uptime but it was really based on their “network” not actual servers, meaning that they had Internet connectivity 99.9% of the time, not that the servers would stay online.
When I started NameHero, it was a huge priority to me to be able to offer an extremely reliable infrastructure for our customers. Having worked online since 1998, I know first hand how much just one hour of downtime can affect an online business.
In order to achieve this, it required investing in top-of-the-line hardware, expensive software, and choosing a datacenter partner that also shared this same goal.
This was a large order and something that has to be constantly tweaked and monitored.
But once all of this is in place, we have to carefully monitor it so we can see the actual results.
To do this, we have several levels of monitoring:
- NOC (network management center) Monitoring – To start, we have a dedicated 24x7x365 NOC monitoring team that watches our servers AND network constantly. Being their only job is to “monitor” this also includes proactive monitoring where they’re watching inbound network traffic to try and avoid an outage (such as when a DDOS attack maybe starting). Once an issue is identified they’re able to begin working towards resolution immediately. Most often issues are identified long before they cause an issue that’s visible to our customers. In the case that an issue does create a service disruption every effort possible is immediately taken. This even includes completely replacing server hardware in an instant (especially if the situation is severe).
- Third Party Server Uptime Monitoring – We also use Pingdom as a third-party monitoring service where we setup a specific location on each server that is constantly pinged to see if it is available. When the ping fails to respond, we immediately get a notification (this also includes to my personal cell phone). Some prefer to wait a few minutes to see if it’s just a small network blurb, as sometimes it is, but we feel it’s best to immediately begin an investigation to minimize downtime, especially if it is a real issue.
- Server Response Monitoring – Another key metric many web hosts fail to monitor (or wish not to) is server response time. For us, it’s very important to monitor how long it takes for a server to respond to a request. If a server begins to trend upward in response time, meaning it’s taking longer and longer to respond, then an outage could be coming. We closely monitor this to see if there is an underlying issue that maybe approaching. Given our Web Hosting and Reseller Hosting packages are on a “shared” infrastructure we also have to constantly evaluate server resources to see if more need to be made available.
By closely analyzing these three monitoring layers we are able to achieve both proactive and reactive responses to issues. The majority of the time, we are also able to mitigate issues well ahead of any downtime to our customers.
The other part to simply “monitoring for uptime” is incident handling.
It would be foolish of me to say we never have any issues because our infrastructure is “perfect.” Unfortunately when dealing with servers/computers there is no such thing.
A server with the best software coupled with the most expensive software in the world’s best datacenter is not exempt from having issues.
Just like my Mac Pro that I use on a daily basis, it still needs to be updated, which requires a reboot, and I’ve also had the motherboard fail and had to replace it.
In short, stuff is still going to come up. But the important part to continuing to achieve high uptime is how we deal with such issues.
In just the last two weeks we’ve had to fend off a massive DDOS attack, replace every piece of hardware in one server, and add resources to another. Multiple updates have also taken place.
With our three main levels of monitoring in place, when incidents arise, we promptly take action to begin resolving them.
As mentioned above, our entire network is monitored constantly, meaning we’re well aware of many issues prior to ever having an outage or when we do have an outage, we’re made aware instantly.
Incidents that aren’t immediately causing an outage are scheduled during non-peak hours but are handled promptly.
Incidents that cause a complete outage, means “all hands on deck,” they take priority over everything else. Even myself, NameHero CEO, jumps right into action when such a case arises. We do whatever it takes to bring a service back online.
Regardless the incident size, we also believe in transparency and flow of information to our customers. Our Network Status page inside our customer interface allows the ability to see what exactly is going on and how it is being dealt with.
I don’t like to make claims that I cannot backup. Because of this, we believe in complete transparency with our customer, hence the reason we release Uptime Reports each year that details the majority of our Network.
I also personally analyze monthly reports to see what happened, where we can improve, and what we should do moving forward.
For the month of March 2019, here’s what the overall results looked like:
Now it’s important to note 71 Outages doesn’t mean complete outage. Some of these are network blurbs as I mentioned above (since we have our monitoring to alert us immediately after a ping cannot be detected) and we also have some other non-production servers monitored here where we are testing things which means we’re constantly poking, rebooting, and modifying things.
At NameHero we hate downtime and we try everything within our power to avoid it. We’re not exempt from it, but when it comes our way, we have plans in place to take care of it!
Ryan Gray is the founder and CEO of NameHero, one of the fastest growing independent web hosts in the United States. Ryan has been working online since 1998 and has over two-decades experience in Internet Entrepreneurship.