Statistical Process Control in Software Development
For manufacturing production, the use of control charts is synonymous to statistical process control (SPC) or statistical quality control (SQC). In Chapter 5 we discussed the differences between software development and manufacturing processes. The use of control charts in software development, while helpful, is far from achieving SPC because of the many reasons discussed. The most important reason is that there is no established correlation between the mean and control limits of a control chart (of some in-process parameter) and the level of end-product quality, so the meaning of process capability is not established. The quality of the delivered software is also affected by many factors. It is not meaningful to use a control chart for the final product quality either because when the data are available, the software development is complete. It is not possible to effect changes at that time. In manufacturing production the unit of analysis is parts, and the production process is an ongoing, real-time process. A certain process mean and control limits are associated with a specific quality level (e.g., defective parts per million). When the process is out of control, the samples or the parts involved are excluded from the final deliverables.
It appears that control charts in software are a tool for enhancing consistency and stability with regard to process implementation. With ingenuity and meaningful application, control charting can eventually be used as a high-level process control for product quality at the organizational level ”see the discussion on control-charting defect removal effectiveness in Chapter 5. The use of control charts is more appropriate at the process and organizational levels, and for the maintenance phase, rather than at the project level during the development process.
At the project level, software development appears to be far more complex and dynamic than the scenarios that control charts describe ”the dynamics of design and development (versus production), the phases of a development process, the life-cycle concepts, the S curve patterns, the complexities involving predictive validity, the effort and outcome nature of in-process metrics, and so forth. One needs to employ a variety of tools, methods , and techniques, together with effective project management, in order to achieve the state of statistical quality control, in the general meaning of the term . We contend that the metrics, models, methods, and techniques discussed in this book, of which control chart is one, are what is needed to achieve SPC or SQC in software development. We need an overall defect model (or models) or defect removal strategy for the overall plan. We need phase-based parameters and specific models to relate the results of the implementation status at various development phases to the end-product quality outlook. We need models with theoretical backings and at the same time with relevance to actual experience of the organization. We need various metrics and measurements to support the models. We need to employ the effort/outcome model to make sure that we are reading the in-process quality status of the project right. We need comparisons to baselines, or projection models, to derive the quality outlook of the product when it is delivered.
The metrics and models ought to be able to relate in-process characteristics to end-product quality. Control charts and other quality control tools can be used to enhance the chance the model parameter values can be achieved. For example, if the model sets a certain in-process-escape-rate target (e.g., Figure 9.19 in Chapter 9) for the product, control charts can be used to identify the components or the inspections that are out of the control limits. Without a good overall model, the use of control charts may become a piecemeal approach. When a project meets the in-process targets according to the metrics and models used, and at the end achieves the end-product quality goal, then we can say it is under statistical process control, in the relaxed sense of the term.
The preceding statements are confined to the narrow meaning of quality control. In a broader sense, other parameters such as schedule, resource, and cost should be included. To achieve the broader meaning of quality control, good planning and project management, effective development process, effective use of metrics and models, and all key factors related to process, people, and technology have to function together well.
Finally, with regard to the use of statistical methods in software development, I offer my observations with respect to test of significance, or forming confidence intervals. First, in software development, we found that the testing for differences between two patterns (e.g., between a defect arrival curve during system test and a target curve, or between two code integration patterns) is useful and I recommend the Kolmogorov-Smirnov test (Rohatgi, 1984). Second, the traditional practices of forming 95% confidence intervals and using the 5% probability ( a = 0.05) for rejecting the null hypothesis in tests of significance deserve a closer look. Our experience indicates that the 95% confidence intervals are often too wide to be useful (even when other factors such as the multiple common causes in control charting are accounted for). Using the 5% probability as the basis for significance tests, we seldom found statistically significant differences, even when the situations war-ranted additional actions and would cause measurable differences in the end-product's quality level.
In tests of statistical significance, a null hypothesis and an alternative hypothesis are involved. The alternative hypothesis is normally the research hypothesis and the null hypothesis usually assumes there is no difference (between the two items being tested ). The rationale is to use some probability level as the criterion for accepting or rejecting the null hypothesis, and therefore supporting or refuting the alternative hypothesis (research hypothesis). If the null hypothesis is true and it is rejected via the test, then we commit an error. The rejection of a true null hypothesis is called a type I error, a . If the null hypothesis is false but we failed to reject it via the test, then we also commit an error. The acceptance of a false null hypothesis is called a type II error, b . When a is set at the 0.05 level, we are saying that we are accepting a 5% type I error ”if there is no difference in the items being compared, 5 out of 100 times we say there is a difference. The calculation of type II error, and the subject of the power of the test are more complicated. But in general, the smaller type I error, the larger the type II error, and the weaker the power of the test in detecting meaningful differences. In other words, using a smaller a value in significance tests will reduce false alarms, but at the same time, the probability in giving false positive results will increase.
Therefore, the use of a certain a criteria in tests for significance is relative. When the null hypothesis (of no difference) is accepted, the situation merits careful thought. A nonsignificant result does not prove that the null hypothesis is correct ” merely that the data do not give adequate grounds for rejecting it. In setting the a level for significance tests, the consequences of wrongly assuming the null hypothesis to be correct should also be weighed. For example, an assumption that two medications have the same frequency of side effects may not be critical if the side effects are minor temporary annoyances. If these side effects can be fatal, it is a different matter. In software development, a false alarm probably means that additional improvement actions will be taken. Failing to detect a meaningful difference in the in-process indicators, on the other hand, may result in less than desirable quality in the field. In our experience, many situations warrant alerts even when a nonsignificant result is obtained using a = 0.05. As a result, we have used the .1, .15, and even .20 a levels in our applications as warranted by the specific situations. Our recommendation is not to rely solely on the traditional criteria for conducting significance tests, forming confidence intervals, or setting control limits. It is always beneficial to use empirical experience (e.g., what level of deterioration in the in-process indicators will cause a measurable difference in field quality) to substantiate or to adjust those criteria. For those who challenged the probability assumption of software reliability models, such as the argument discussed in the last section, control limits, confidence intervals, and significance tests that are based on probability distribution would not even be meaningful.