With so many systems that are interdependent, the adage “If you can’t measure it, you can’t manage it,” is possibly more important than ever. Unfortunately, many administrators manage by gut feelings, rather than measured metrics.
Often, when you actually measure, you’ll find that reality is counter to your beliefs. It is up to administrators and senior management to establish reliability metrics and thresholds in policy. When a threshold is crossed for a given period of time, a course of action must be taken to bring metrics back to acceptable levels. One oft-cited metric is uptime.
Uptime measures the time that a computer system or service has been “up and running.” It is typically seen as a measure of reliability. When used carefully, and within a framework of metrics, uptime can be a valuable measurement.
However, like statistics, metrics can be bent to almost any meaning. A system could have a long continuous uptime, for example, but as a result of poor maintenance, the database on that system could be giving out bad data. Most company senior management personnel do not understand the effect that demanded uptime will have on costs, and ask for “five-nines” or “99.999%” uptime. Each “nine” increases the vigilance that an IT department must give to systems.
It also increases a system’s cost, as money must be spent on higher quality components or redundant systems, and possibly both. Analyzing this metric shows that, the higher the metric is, the more difficult it may be to meet. For example, the true length of a year on Earth is 365.2422 days, or about 365.25 days.
The following table gives rounded values for each level of “nines” and how much uptime per year that specifies, and perhaps more to the point, how much downtime per year that allows.
IT departments must also respond to other metrics, such as providing return on investment (ROI) and service level agreements (SLAs). These topics are beyond the scope of this post. However, all members in an IT department should familiarize themselves with these and other metrics.
Maintaining High Availability
One factor that impacts high availability of software and hardware is the age of this software and hardware. Software ages only with respect to other software components. For example, when an OS is upgraded, many software components suddenly become outdated (“legacy”). Hardware aging often results in physical failures.
To create highly available systems, a system administrator must account for each system, and their dependencies and possible failure, as shown in the following list of vulnerable systems:
– Power supply
– Logic board
– Communication cards (SCSI, Fibre Channel, and so on)
– Service (software) Each of these systems can protect against failure in a number of ways, with the most common being redundancy. Redundant systems provide a spare that can take over in the event of failure, avoiding a single component that could take a service offline in the event of failure. Remember Murphy’s Law: “Whatever can go wrong, will go wrong”; these components form a chain of dependencies that is only as strong as its weakest link.