Server Uptime - Current Average 90 Day Server Uptime: 99.973709

Data Last updated: Wed Jul 23 18:40:01 2008

What is Uptime?

Many providers will mention uptime when describing their service, but few will actually describe what they mean by it and how it is measured. There are two simple reasons for this. First, the murkier the definition, the easier it is to get away with failure. Second, monitoring uptime is not by any means a trivial technical task.

In layman's terms, downtime is when a client cannot connect to a site, or can connect but the connection is much slower than usual and too slow to be useful. The first problem in measuring uptime is identifying downtime. The second, and a much more complex problem is identifying the root cause of the problem. The root cause could be an actual service fault, an issue with any intermediate device on the Internet or even the client workstation.

Rather than try to solve this enormously complex problem (or pretend it does not exist), we prefer to take simple steps to measuring some smaller aspect of uptime and do it properly.

Currently we only consider physical server uptime in our calculation. This means that an individual VPS being down is not considered as downtime. The reason we do not monitor individual VPS uptime is because we give the clients the control to shut the VPS's down at any time and have no way of distinguishing intentional downtime from downtime caused by an accidental action.

How it is done.

Each of our servers runs an OpenVPS monitoring agent (openvps-mon). This agent collects various server metrics such as CPU load, file system utilization, memory utilization, I/O load, etc (a total of about 25 data points) and sends these data to a receiving agent every minute or so. We call these data packets "heartbeats". For uptime tracking only the actual heartbeat presence is considered, the server metrics play no role in the up/down determination.

The receiving daemon (openvps-recv) applies an adaptive algorithm to prior heartbeat intervals to learn when to expect future heartbeats. When two expectations pass (~2 minutes) without a heartbeat, openvps-recv marks the server to be down.

Why not just use ping to monitor servers?

It's possible for the server's resources to be maxed and the system to be completely unusable, yet still be able to respond to pings. This is because ICMP responses are generated entirely within the kernel. Therefore ability to respond to ICMP pings is hardly an accurate indication of uptime. Many providers use ping in their "official" uptime measurement for this very reason.

What if openvps-recv goes down?

The data is recorded using Tobi Oetiker's excellent time-series storage and visualization tool RRDTool. RRDTool supports unknown data. When the receiving daemon is down, the data is assumed to be unknown (as opposed to "down") and these datapoints do not affect the calculation result.

Of course when the receiving daemon is down no up/down information is recorded, and if an actual down event were to occur, it would not be noticed. To work around this, we deploy multiple receiving daemons residing on distinct physical servers.

How do you treat Administrative Downtime?

Sometimes a server has to be down administratively, e.g. for a scheduled kernel upgrade. This is referred to as "administrative downtime". Administrative downtime usually happens within a predefined maintenance window and is preceeded by an advance notification. The OpenVPS monitoring system makes a distinction between administrative and actual downtime and keeps track of both administrtive and actual downtime separately. Administrative downtime is not considered in our average uptime number.

Future Plans

Eventually we plan on expanding this system to use the actual server metrics in combination with the RRTool's Aberrant Behavior Detection capabilities. Such a system could apply the Holt-Winters time series forecasting algorithm to prior records of system metrics (e.g. CPU utilization pattern throughout the day and/or week) and automatically detect behavior that is inconsistent with prior patterns.

We also plan on expanding this system so that individual VPS's could be monitored as well.