Debugging by Thinking: A Multidisciplinary Approach (HP Technologies)

11.4 Software-defect root causes

This section documents an approach to categorizing software-defect root causes. It is a synthesis of the best of several approaches used in different software-development organizations.

11.4.1 General information

If you’re going to keep records on the root causes of individual defects, you need to keep some general information for identification purposes.

The first kind of general information that it’s valuable to track is who reported the defect. Keeping individual names isn’t necessary. Just track them by the following categories:

How can you use this information? Consider the situation where you have many memory-allocation problems. If they aren’t being caught until they reach external customers, the cost to assist the customer and the potential for lost business is considerable. A wise manager could easily make the case for investing a significant sum for licensing a tool that finds such problems automatically.

The second kind of general information that is valuable to track is the method or tool that was used to identify the defect. Here is a list of categories of methods and tools to track:

How can you use this information? Consider the situation where you’re detecting most of your defects by periodically rotating testing through parts of a very large test suite. This may consume most or all of your machine resources at night or on the weekends. If it’s effective, but the rotation has a long period, defects remain in the product until the rotation completes. A wise manager could easily make the case for investing in additional hardware resources to reduce the period of the rotation.

The third kind of general information that is valuable to track is the scope of the defect. The particular names used for these levels depend on the programming language you’re using. You can categorize the scope of a defect using these categories for a large number of popular languages:

How can you use this information? Consider the situation where you’re asked to assess the risk of a change to fix a given defect. The wider the scope, the more risk there is in accidentally breaking something else while making a correction.

The fourth kind of general information that is valuable to track is how long it took to diagnose and correct the defect. You can categorize these times into the following orders of magnitude:

How can you use this information? Consider the situation where you’re assigned a very high-priority defect submitted by a customer. Your supervisor wants to know when you will have it fixed. If you have a reasonably large database of analyses, you should be able to give him or her the average time, with two standard deviations added for safety, for the defect based on the symptom. Once you have diagnosed the root cause, you can update your estimate for the time remaining with a high degree of precision.

11.4.2 Symptom description

The first kind of symptom description that is valuable to track is the character of runtime problems. Most defects show up when the application is executed. Here is a list of categories of symptoms that show up at run time:

These symptom categories are valuable in estimating the time it’s likely to take to diagnose and correct a defect. Problems with display cosmetics and input validation tend to be highly localized and more easily fixed. Problems where an application silently generates wrong results, or never terminates, or terminates prematurely, tend to be much more difficult to track down.

The second kind of symptom description that is valuable to track is the character of link-time problems. Some defects show up when an application is linked. These defects are more likely to be in other people’s software, particularly libraries and middleware. If these problems aren’t with other people’s software, they’re often a result of a faulty build process. Here is a list of categories of symptoms that show up at link time:

The third kind of symptom description that is valuable to track is the character of compile-time problems. Some defects show up when an application is compiled. These defects are more likely to be in other people’s software, particularly compilers and programming tools. Here is a list of categories of symptoms that show up at compile time:

11.4.3 Design errors

Now we get into the actual root-cause descriptions. We divide these into two major categories: design and coding errors. The first section describes errors that occur during the design phase.

We don’t consider problem reports that originated in erroneous requirement specifications. Our view of software defects is that a defect occurs when there is a difference between what the user or customer or his representative asked for and what the software actually does. The requirements are what was asked for; by definition, there can be no difference between requirements and themselves. If user requirements don’t match what is communicated to the people developing the software, the methods described in this book won’t help with diagnosing the resulting problems.

11.4.3.1 Data structures

The first category of design defects includes those caused by a deficient data-structure design. Here is a list of possible data-structure design defects:

In problems with data definitions, these categories refer both to individual data elements and to complex data aggregates. The following are examples of data-structure errors.

A data definition is missing when it’s referred to in an algorithm but isn’t defined. A data definition is incorrect when it must have a different data type, domain, or structure than the one specified for references to this entity in an algorithm to be meaningful. A data definition is unclear when it can be implemented in more than one way and at least one of the ways will cause an algorithm to work incorrectly. A data definition is contradictory when it’s contained in more than one other data structure in inconsistent ways. Data definition items are out of order when two or more algorithms depend on different orderings of the items.

In problems with shared-data access, these categories refer to synchronization required by multithreading, multiprocessing, or specialized parallel execution. A shared-data access control is missing when two independent streams of instructions can update a single data item, and there is no access control on the item. A shared-data access control is incorrect when two independent streams of instructions can update a single data item, and the access control can cause deadlock. A shared-data access control is out of order when two independent streams of instructions can update a single data item, and the access control allows updates to occur in invalid order.

11.4.3.2 Algorithms

The second category of design defects includes those caused by a deficient algorithm design. Here is a list of possible algorithm design defects:

The following are examples of algorithm errors. A logic sequence is missing if a possible case in an algorithm isn’t handled. A logic sequence is superfluous if it’s logically impossible for a case in an algorithm to be executed. A logic sequence is incorrect if it doesn’t compute values as described in the requirements specification. A logic sequence is out of order if it attempts to process information before that information is ready.

An input check is missing if invalid data can be received from a user or a device and it isn’t rejected by the algorithm. An input check is superfluous if it’s logically impossible for input data to fail the test. An input check is incorrect if it detects some, but not all, invalid data. An input check is out of order if invalid data can be processed by an algorithm, prior to being checked for validity.

An output definition is missing if information specified in the requirements document isn’t displayed. An output definition is superfluous if it’s logically impossible for the output to ever be generated. An output definition is incorrect if values displayed to the user aren’t in the correct format. An output definition is out of order if information specified in the requirements document isn’t displayed in a sequence that is useful to the user.

A special-condition handler is missing if it’s possible for an exception to be generated and there is no handler, implicit or explicit, to receive it. A special-condition handler is superfluous if it’s logically impossible for an exception to be generated that would activate the handler. A special-condition handler is incorrect if it doesn’t resolve the cause of an exception and continues processing. A special-condition handler is out of order if a handler for a more general exception is executed before a handler for a more specific exception.

11.4.3.3 User-interface specification

Here is a list of possible user-interface specification defects:

The following are examples of user-interface specification errors. A user-interface specification item is missing if there is no way for a user to select or enter a required value. A user-interface specification item is superfluous if it allows the user to select or enter a value that is ignored. A user-interface specification item is incorrect if it specifies the display of information in a format that is incompatible with the data type.

A user-interface specification item is unclear if it’s possible to implement in two different ways, such that one will cause the user to enter or select the wrong information. User-interface specification items are out of order if the target users speak a language that is read left to right, and the user must work right to left or bottom to top.

11.4.3.4 Software-interface specification

Here is a list of possible software-interface specification defects:

The following are examples of software-interface specification errors. A software-interface specification item is missing if the name, the data type, the domain, or the structure of a parameter to a procedure or system call has been omitted. A software-interface specification item is superfluous if the parameter is never used in one of the algorithms of the design. A software-interface specification item is incorrect if the name, the data type, the domain, or the structure of a parameter to a procedure or system call is inconsistent with the usage of that parameter in another part of the design.

A software-interface specification item is unclear if a parameter to a procedure or system call can be implemented in more than one way and at least one of the ways will cause an algorithm to work incorrectly. Software-interface specification items are out of order if the parameters to a procedure or system call are permuted from their usage in one or more algorithms.

11.4.3.5 Hardware-interface specification

Here is a list of possible hardware-interface specification defects:

Hardware-interface specification errors are analogous to their software counterparts.

11.4.4 Coding errors

In this section, we describe errors that occur during the coding phase. We have used general terms to describe programming constructs wherever possible. We have also collected the union of the sets of coding errors possible in a number of programming languages. This means that if you’re only familiar with one language, such as Java, some of the descriptions may not be familiar to you.

11.4.4.1 Initialization errors

Here is a list of possible initialization errors:

In these cases, a simple variable refers to a variable that can be stored in a hardware data type, to an element of an array, or to a member of an inhomogeneous data structure, such as a structure or class. An aggregate variable refers to composite data structures, whose members are referred to by ordinal number, such as arrays, or by name, such as a C struct or C++ class. When either a simple or aggregate variable is sometimes uninitialized, there is some control-flow path along which the variable isn’t initialized.

11.4.4.2 Finalization errors

Here is a list of possible finalization errors:

11.4.4.3 Binding errors

Here is a list of possible binding errors:

The following are examples of binding errors. A variable is declared in the wrong scope when a variable of local scope hides a variable of more global scope.

Using procedure calls with the wrong number of arguments is possible in older languages such as Fortran 77 or C. Fortran 95 and C++ provide the means to prevent this error.

The two most common argument-passing mechanisms are call-by-reference and call-by-value. In languages that use call-by-reference, such as Fortran, it’s possible to pass a constant as an actual argument and assign it to the corresponding formal parameter. This can result in constants having varying values!

It is possible to define functions that don’t return values in Fortran 77 or C. Fortran 95 and C++ provide the means to prevent this error as well.

11.4.4.4 Reference errors

Here is a list of possible reference errors:

In the first four cases, the statement was syntactically correct, but the programmer referred to a different procedure or variable than he or she meant. In the last case, a reference did not occur.

11.4.4.5 Static data-structure problems

Here is a list of possible errors with static data structures:

The following are examples of static data-structure errors. An item has the wrong data type when it doesn’t have sufficient precision to accommodate all legitimate values. An aggregate has the wrong size when it doesn’t have sufficient capacity to accommodate all generated values.

11.4.4.6 Dynamic data-structure problems

Here is a list of possible errors with dynamic data structures:

The following are examples of dynamic data-structure problems. An array subscript is out of bounds when it’s nonintegral, is less than the lower bound on a given dimension, or is greater than the upper bound on a given dimension. An array subscript is incorrect when it’s a valid subscript but isn’t the element that should have been referenced.

Dereferencing or freeing uninitialized pointers, null pointers, and pointers to free memory are all errors because only pointers to allocated blocks of memory can be dereferenced or freed. Freeing pointers to static or automatic memory is an error because only dynamic memory, allocated on the heap, can be explicitly freed.

11.4.4.7 Object-oriented problems

Here is a list of possible errors that occur in object-oriented code:

The following are examples of object-oriented problems. In C++, all classes that contain references or pointers need an ordinary constructor, a copy constructor, an assignment operator, and a destructor. It is a common error to omit the copy constructor or assignment operator. In Java, classes that contain references need an ordinary constructor and a copy constructor. In C++, a base class should provide destructors that are virtual.

In Java, it’s possible to provide all the method signatures required to implement an interface without actually providing the semantic content they must have. Similar problems can occur in Java and C++ with abstract and derived classes.

11.4.4.8 Memory problems

Here is a list of possible memory problems:

Memory, stack, and heap corruption are normally symptoms of problems from the dynamic data-structure section (incorrect use of pointers or array subscripts). Occasionally these are problems in their own right. In these cases, there is a related problem in the underlying hardware or operating system.

Some systems have a segmented address space, and in these systems, it’s possible to create a pointer that is invalid for a particular address space. Similarly, some systems have special requirements for the alignment of items addressed by a pointer, and it’s possible to create a pointer to a piece of data that isn’t addressable because of alignment.

11.4.4.9 Missing operations

Here is a list of possible missing operations:

Return codes or flags are normally associated with operating system calls and C functions. Most operating system calls are defined as if they were written in C, even if they were actually compiled with a C++ compiler. Because C doesn’t have exception handling, it’s necessary to return status codes. A return code of 0 usually means the operation was successful. Statements or procedure calls can be missing because they weren’t included in the algorithm design or because they were simply omitted during coding.

11.4.4.10 Extra operations

Here is a list of possible extra operations:

Statements or procedure calls can be extraneous because they were included in the algorithm design under assumptions that are no longer valid or because they were duplicated during coding.

11.4.4.11 Control-flow problems

Here is a list of possible control-flow problems:

The following are examples of control-flow problems. Statements are controlled by the wrong condition when the sense of an if test is reversed. Loop iterations are off by one when either too few or too many iterations are executed. A loop terminates prematurely when the condition that controls it changes its truth value at the wrong time. A loop never terminates when the condition that controls its truth value never changes. A case is missing in a multiway branch when the default case is applied to a situation that should have special handling. A wrong case is taken in a multiway branch when flow of control falls through from one case to another because a break statement is missing.

11.4.4.12 Value-corruption problems

Here is a list of possible value-corruption problems:

Arithmetic overflow occurs either when an integer result exceeds the largest representable integer or when the exponent of a floating-point result is positive and the value of the exponent is too large to fit in the exponent field of the floating-point representation being used. Arithmetic underflow occurs when the exponent of a floating-point result is negative and the absolute value of the exponent is too large to fit in the exponent field of the floating-point representation being used.

Precision loss occurs in floating-point computations because floating-point numbers are a finite precision approximation of real numbers. When a program performs millions of floating-point operations, if it doesn’t manage the precision loss due to this approximation carefully, the resulting values can be meaningless. An invalid floating-point comparison occurs when a floating-point number is compared for equality without a tolerance factor.

11.4.4.13 Invalid expressions

Here is a list of possible invalid expression problems:

The following are examples of invalid expression problems. When a wrong variable is used, the expression is syntactically correct, but you referred to a different variable than you meant. Operator input is invalid when providing an argument that isn’t in the domain of the operator, such as dividing by zero. An arithmetic operator is wrong when you used addition but should have used subtraction. A relational operator is wrong when you reverse the sense of a comparison.

Arithmetic expression order is wrong when you use arithmetic operator precedence, associativity, or parentheses incorrectly. Relational expression order is wrong when you use relational operator precedence, associativity, or parentheses incorrectly. A Boolean operator is wrong when you use logical OR when you should have used logical AND. Boolean expression order is wrong when you use Boolean operator precedence, associativity, or parentheses incorrectly.

A term is an operator and its corresponding inputs. Inputs can be constants, variables, or function calls.

11.4.4.14 Typographical errors

Here is a list of possible typographical errors:

These are low-level mistakes that usually result in one of the other root causes listed previously. They are the direct cause, however, when they occur in data that is embedded in a program. If the data is used to control the logic of the program and isn’t there merely for printing so a human can read it, these typographical errors can cause significant problems in program execution.

11.4.4.15 External software problems

Two classes of problems arise from using software written by other people. The first class occurs when we use their software incorrectly. The second class occurs when the problem is actually in the other software.

Here is a list of possible problems with using other people’s software:

The following are examples of using other software incorrectly. A compilation is wrong when the code generation options selected are a mismatch for the target hardware architecture. A linker can be used incorrectly by binding the wrong versions of libraries to an application. The build of a program using a tool like the UNIX™ make or an Integrated Development Environment (IDE) can be wrong if not all of the dependencies have been acknowledged. In this case, some parts of the application may not be rebuilt when they should be. A library can be used incorrectly by selecting the wrong Application Programming Interface (API) call or by structuring that call incorrectly.

Here is a list of possible problems within other people’s software:

11.4.4.16 Root-cause analysis tools

You can download an interactive tool that implements the root-cause classification scheme described in this section from the supporting Web site for this book at www.debuggingbythinking.com.

It allows the programmer to enter new items from any of the categories, which are added to the current classification scheme. Every time the program terminates, it sorts the items in each category by frequency or by recency and writes the counts and dates to the file. The detailed records are stored in a standard comma-delimited ASCII file, which can be imported into any spreadsheet program for analysis.

Категории