1 Definition and Measurements of System Availability
Definition and Measurements of System Availability
Intuitively, system availability means the system is operational when you have work to do. The system is not down due to problems or other unplanned interruptions. In measurement terms, system availability means that the system is available for use as a percentage of scheduled uptime. The key elements of this definition include:
- The frequency of system outages within the time frame for the calculation
- The duration of outages
- Scheduled uptime
The frequency of outages is a direct reliability statement. The duration of outages reflects the severity of the outages. It is also related to the recovery strategies, service responsiveness, and maintainability of the system. Scheduled uptime is a statement of the customer's business requirements of system availability. It could range from 5 x 8 (5 days a weeks, 8 hours a day) to 7 x 24 (7 days a week, 24 hours a day) or 365 x 24 (365 days a year, 24 hours a day). Excluding scheduled maintenance, the 7 x 24 shops require continuous system availability. In today's business computing environments, many businesses are 7 x 24 shops as far as system availability is concerned .
The inverse measurement of system availability is the amount of down time per system per time period (for example, per year). If scheduled up-time is known or is a constant (for example, for the 7 x 24 businesses), given the value of one measurement, the other can be derived. Table 1 shows some examples of system availability and hours in down time per system per year.
The 99.999% availability, also referred to as the "five 9s" availability, is the ultimate industry goal and is often used in marketing materials by server vendors . With regard to measurement data, a study of customer installations by the consulting firm Gartner Group (1998) reported that a server platform actually achieved availability of 99.998% (10 minutes downtime per year) via clustering solutions. For a single system, availability of that same server platform was at 99.90%. There were servers at 99.98% and 99.94% availability also. At the low end, there was a PC server platform with availability below 97.5%, which is a poor level in availability measurements. These are all known server platforms in the industry.
Table 13.1. Examples of System Availability and Downtime per System per Year
System Availability (%) (24 x 365 basis) |
Downtime per System per Year |
---|---|
99.999 |
5.3 minutes |
99.99 |
52.6 minutes |
99.95 |
4.4 hours |
99.90 |
8.8 hours |
99.8 |
17.5 hours |
99.7 |
26.3 hours |
99.5 |
43.8 hours |
99.0 |
87.6 hours |
98.5 |
131.4 hours |
98.0 |
175.2 hours |
97.5 |
219.0 hours |
Business applications at major corporations require high levels of software quality and overall system availability. Servers with system availability less than 99.9%, which is the threshold value for high availability, may not be adequate to support critical operations. As reported in Business Week ("Software Hell," 1999), at the New York Clearing House (NYCH), about $1.2 trillion in electronic interbank payments are cleared each day by two Unisys Corporation mainframe computer systems. The software was developed for operations that must not fail and the code is virtually bug-free. For the seven years prior to the Business Week report, NYCH had clocked just 0.01% downtime. In other words, its system availability was 99.99%. This kind of high-level availability is a necessity because if one of these systems is down for a day, its ramifications are enormous and banks consider it a major international incident. The same report indicated NYCH also has some PC servers that it uses mostly for simple communications programs. These systems were another story with regard to reliability and availability in that they crashed regularly and there was a paucity of tools for diagnosing and fixing problems.
In a study of cost of server ownership in enterprise relations management (ERM) customer sites, the consulting firm IDC (2001) compared the availability of three server platforms, which we relabeled as platforms A, B, and C for this discussion. The availability of these three categories of servers for ERM solutions are 99.98%, 99.67%, and 99.90%, respectively. Since system availability has a direct impact on user productivity, IDC called the availability-related metrics productivity metrics. Table 13.2 shows a summary of these metrics. For details, see the original IDC report.
Table 13.2. Availability-Related Productivity Metrics for Three Server Platforms for ERM Solutions
User Productivity |
Platform A Solution |
Platform B Solution |
Platform C Solution |
---|---|---|---|
Unplanned Downtime Hours per Month |
0.24 |
2.7 |
1 |
Percent of Internal Users Affected |
42 |
63 |
53 |
Unplanned User Downtime (Hours per Year/100 Users) |
1,235 |
20,250 |
6,344 |
Availability (%) |
99.98 |
99.67 |
99.9 |
From IDC white paper, "Server Cost of Ownership in ERM Customer Sites: A Total Cost of Ownership Study," by Jean S. Bozman and Randy Perry. Copyright 2001, IDC, a market intelligence research firm. Reprinted with permission. |
System availability or platform availability is a combination of hardware and software. The relationship between system availability and component availability is an "AND" relationship, not an "OR" relationship. To achieve a certain level of system availability, the availability of the components has to be higher. For example, if the availability of a system's software is 99.95% and that of the hardware is 99.99%, then the system availability is 99.94% (99.99% x 99.95%).