Lessons Learned from OO Projects
In process Metrics for Outage and Availability
To improve product reliability and availability, sound architecture and good designs are key. Root causes and lessons learned from customer outages in the field can be used to improve the design points for the next release of the product. In terms of in-process metrics when the product is under development, however, we don't recommend premature tracking of outages and availability during the early phases of testing. Such tracking should be done during the product-level testing or during the final system test phase in a customerlike environment. During early phases of testing, the defect arrival volume is high and the objective is to flush out the functional defects before the system stabilizes. Tracking and focus at these phases should be on testing progress, defect arrivals, and defect backlog. When the system is achieving good stability, normally during the final phase of testing, metrics for tracking system availability become meaningful. In Chapter 10, we discuss and recommend several metrics that measure outages and availability: number and trend of system crashes and hangs , CPU utilization, and Mean Time to unplanned IPL (initial program load, or reboot). While some metrics may require tools, resources, and a well-established tracking system, tracking the system crashes and hangs can be done by paper and pencil, and can be implemented easily by small teams .
For projects that have a beta program, we recommend tracking customer outages in beta, especially those customers who migrated their production runs to the new release. The same focus as the field outages should be applied to these outages during the beta program. Outages during the beta program can also be used as a predictive indicator of the system outages and availability in the field after the product is shipped. The difference is that during beta, there are still chances to take improvement actions before the product is made available to the entire customer population. We have experience in tracking system crashes during customer beta for several years . Due to small numbers , we haven't established a parametric correlation be-tween beta outages and field outages yet. But using nonparametric (rank-order) correlation methods and comparing releases, we did see a positive correlation between the two ”the more crashes during beta, the more outages and less system availability in the field.