Data Quality Control
Few would argue that software measurement is critical and necessary to provide a scientific basis for software engineering. What is measured is improved. For software development to be a true engineering discipline, measurement must be an integral part of the practice. For measurements, metrics, and models to be useful, the data quality must be good. Unfortunately, software data are often error-prone (as discussed in Chapter 4). In our view, the data quality issue is a big obstacle to wide acceptance of software measurement in practice. It is perhaps even more important than the techniques of metrics and models per se. It is the most basic element on which the techniques of metrics and models in software engineering can build. Garbage in, garbage out; without adequate accuracy and reliability of the raw data, the value added from metrics, models, and analysis will diminish. Therefore, strong focus should be placed on the quality of the data in the collection and analysis process. Data collection and project tracking must encompass validation as an integral element. Any analysis and modeling work should assess the validity and reliability of the data and its potential impact on the findings (as discussed in Chapter 3).
Note that the data quality problem goes far beyond software development; it appears to permeate the entire information technology and data processing industry. The accuracy of data in many databases is surprisingly low; error rates of roughly 10% are not uncommon (Huh et al., 1992). In addition to accuracy, the most pertinent issues in data quality appear to be completeness, consistency, and currency. Furthermore, the magnitude of the problem often multiplies when databases are combined and when organizations update or replace applications. These problems usually result in unhappy customers, useless reports , and financial loss. In a survey conducted by Information Week (Wilson, 1992), 70% of the responding information system (IS) managers said their business processes had been interrupted at least once by bad data. The most common causes were inaccurate entry, 32%; incomplete entry, 24%; error in data collection, 21%; and system design error, 15%. Information technology has permeated every facet of the institutions of modern society, so the impact of poor-quality data is enormous . In recent years , the data quality in many business and public databases does not seem to have improved. However, because data mining as a way of improving business has been receiving attention, this could be a starting point of data quality improvement in the business world. Business requirements could become a driving force for the improvement.
In the fields of quality and software engineering, experts have noticed the implications of poor data quality and have started making efforts toward improvements. For instance, back in 1992, at the International Software Quality Exchange conference (ISQE, 1992), which was organized by the Juran Institute, members of the panel on prospective methods for improving data quality discussed their experiences with some data quality improvement methods . The proposed approaches included:
- Engineering (or reengineering) the data collection process and entry processes for data quality
- Human factors in data collection and manipulation for data quality
- Establishing joint information systems and process designs for data quality
- Data editing, error localization, and imputation techniques
- Sampling and inspection methods
- Data tracking ”follow a random sample of records through the process to trace the root sources of error in the data collection and reporting process
Emerging trends are encouraging for those who are concerned with data quality and the practice of metrics and measurements in software engineering. First, more experts have started addressing the data quality issues and providing advice with regard to the data collection process, measurement specifications and procedures, and data validation methods at conferences and in publications . Second, the practice of metrics and measurements appears to have been gaining a wide acceptance by development teams and organizations in their software engineering efforts. More usage and analyses of data will certainly drive improvements in data quality and enhance the focus on data validation as a key element of the data collection and analysis process.
Of course, in the practice of software metrics and measurement, data quality control is just a starting point. The process hinges on translating raw data into information and then into knowledge that can lead to effective actions and results.
Raw data ”> Information ”> Knowledge ”> Actions ”> Results
To translate raw data into meaningful information, we need metrics and models. To translate information into knowledge, we need analysis of the metrics and models in the context of the team's experience. To formulate effective actions, we further need analysis and information on cause-and-effect relationships and good decisions. To support action implementation and to evaluate the results, we again need data, measurements, metrics, and models.