Debugging by Thinking: A Multidisciplinary Approach (HP Technologies)

2017-07-07 02:10:07

11.4 Software-defect root causes

This section documents an approach to categorizing software-defect root causes. It is a synthesis of the best of several approaches used in different software-development organizations.

11.4.1 General information

If you’re going to keep records on the root causes of individual defects, you need to keep some general information for identification purposes.

The first kind of general information that it’s valuable to track is who reported the defect. Keeping individual names isn’t necessary. Just track them by the following categories:

Original developer

Project colleague

Internal partner

External customer

How can you use this information? Consider the situation where you have many memory-allocation problems. If they aren’t being caught until they reach external customers, the cost to assist the customer and the potential for lost business is considerable. A wise manager could easily make the case for investing a significant sum for licensing a tool that finds such problems automatically.

The second kind of general information that is valuable to track is the method or tool that was used to identify the defect. Here is a list of categories of methods and tools to track:

User report

Code review

Unit testing

System testing

Periodic testing

Release testing

Static analysis tools

Dynamic analysis tools

How can you use this information? Consider the situation where you’re detecting most of your defects by periodically rotating testing through parts of a very large test suite. This may consume most or all of your machine resources at night or on the weekends. If it’s effective, but the rotation has a long period, defects remain in the product until the rotation completes. A wise manager could easily make the case for investing in additional hardware resources to reduce the period of the rotation.

The third kind of general information that is valuable to track is the scope of the defect. The particular names used for these levels depend on the programming language you’re using. You can categorize the scope of a defect using these categories for a large number of popular languages:

Methods/functions/procedures

Classes/modules/files

Packages/components/relocatables

Archives/libraries/executables

How can you use this information? Consider the situation where you’re asked to assess the risk of a change to fix a given defect. The wider the scope, the more risk there is in accidentally breaking something else while making a correction.

The fourth kind of general information that is valuable to track is how long it took to diagnose and correct the defect. You can categorize these times into the following orders of magnitude:

One hour or less

Several hours

One day

Several days

One week

Several weeks

How can you use this information? Consider the situation where you’re assigned a very high-priority defect submitted by a customer. Your supervisor wants to know when you will have it fixed. If you have a reasonably large database of analyses, you should be able to give him or her the average time, with two standard deviations added for safety, for the defect based on the symptom. Once you have diagnosed the root cause, you can update your estimate for the time remaining with a high degree of precision.

11.4.2 Symptom description

The first kind of symptom description that is valuable to track is the character of runtime problems. Most defects show up when the application is executed. Here is a list of categories of symptoms that show up at run time:

The application generates all correct values but in incorrect order.

The application generates only some correct values.

The application generates some incorrect values.

The application generates all correct values but with cosmetic display problems.

The application accepts bad input.

The application rejects good input.

The application runs indefinitely.

The application terminates prematurely.

These symptom categories are valuable in estimating the time it’s likely to take to diagnose and correct a defect. Problems with display cosmetics and input validation tend to be highly localized and more easily fixed. Problems where an application silently generates wrong results, or never terminates, or terminates prematurely, tend to be much more difficult to track down.

The second kind of symptom description that is valuable to track is the character of link-time problems. Some defects show up when an application is linked. These defects are more likely to be in other people’s software, particularly libraries and middleware. If these problems aren’t with other people’s software, they’re often a result of a faulty build process. Here is a list of categories of symptoms that show up at link time:

Procedure calls across components don’t match procedure interfaces.

System calls don’t match system call interfaces.

Generated code was targeted at incompatible hardware architectures.

A symbol doesn’t have a definition.

A symbol has more than one definition.

The third kind of symptom description that is valuable to track is the character of compile-time problems. Some defects show up when an application is compiled. These defects are more likely to be in other people’s software, particularly compilers and programming tools. Here is a list of categories of symptoms that show up at compile time:

The compiler accepts this invalid program.

The compiler rejects this valid program.

The compiler runs indefinitely.

The compiler terminates prematurely.

The compiler generates bad information for downstream tools.

11.4.3 Design errors

Now we get into the actual root-cause descriptions. We divide these into two major categories: design and coding errors. The first section describes errors that occur during the design phase.

We don’t consider problem reports that originated in erroneous requirement specifications. Our view of software defects is that a defect occurs when there is a difference between what the user or customer or his representative asked for and what the software actually does. The requirements are what was asked for; by definition, there can be no difference between requirements and themselves. If user requirements don’t match what is communicated to the people developing the software, the methods described in this book won’t help with diagnosing the resulting problems.

11.4.3.1 Data structures

The first category of design defects includes those caused by a deficient data-structure design. Here is a list of possible data-structure design defects:

A data definition is missing.

A data definition is incorrect.

A data definition is unclear.

A data definition is contradictory.

A data definition is out of order.

A shared-data access control is missing.

A shared-data access control is incorrect.

A shared-data access control is out of order.

In problems with data definitions, these categories refer both to individual data elements and to complex data aggregates. The following are examples of data-structure errors.

A data definition is missing when it’s referred to in an algorithm but isn’t defined. A data definition is incorrect when it must have a different data type, domain, or structure than the one specified for references to this entity in an algorithm to be meaningful. A data definition is unclear when it can be implemented in more than one way and at least one of the ways will cause an algorithm to work incorrectly. A data definition is contradictory when it’s contained in more than one other data structure in inconsistent ways. Data definition items are out of order when two or more algorithms depend on different orderings of the items.

In problems with shared-data access, these categories refer to synchronization required by multithreading, multiprocessing, or specialized parallel execution. A shared-data access control is missing when two independent streams of instructions can update a single data item, and there is no access control on the item. A shared-data access control is incorrect when two independent streams of instructions can update a single data item, and the access control can cause deadlock. A shared-data access control is out of order when two independent streams of instructions can update a single data item, and the access control allows updates to occur in invalid order.

11.4.3.2 Algorithms

The second category of design defects includes those caused by a deficient algorithm design. Here is a list of possible algorithm design defects:

A logic sequence is missing.

A logic sequence is superfluous.

A logic sequence is incorrect.

A logic sequence is out of order.

An input check is missing.

An input check is superfluous.

An input check is incorrect.

An input check is out of order.

An output definition is missing.

An output definition is superfluous.

An output definition is incorrect.

An output definition is out of order.

A special-condition handler is missing.

A special-condition handler is superfluous.

A special-condition handler is incorrect.

A special-condition handler is out of order.

The following are examples of algorithm errors. A logic sequence is missing if a possible case in an algorithm isn’t handled. A logic sequence is superfluous if it’s logically impossible for a case in an algorithm to be executed. A logic sequence is incorrect if it doesn’t compute values as described in the requirements specification. A logic sequence is out of order if it attempts to process information before that information is ready.

An input check is missing if invalid data can be received from a user or a device and it isn’t rejected by the algorithm. An input check is superfluous if it’s logically impossible for input data to fail the test. An input check is incorrect if it detects some, but not all, invalid data. An input check is out of order if invalid data can be processed by an algorithm, prior to being checked for validity.

An output definition is missing if information specified in the requirements document isn’t displayed. An output definition is superfluous if it’s logically impossible for the output to ever be generated. An output definition is incorrect if values displayed to the user aren’t in the correct format. An output definition is out of order if information specified in the requirements document isn’t displayed in a sequence that is useful to the user.

A special-condition handler is missing if it’s possible for an exception to be generated and there is no handler, implicit or explicit, to receive it. A special-condition handler is superfluous if it’s logically impossible for an exception to be generated that would activate the handler. A special-condition handler is incorrect if it doesn’t resolve the cause of an exception and continues processing. A special-condition handler is out of order if a handler for a more general exception is executed before a handler for a more specific exception.

11.4.3.3 User-interface specification

Here is a list of possible user-interface specification defects:

An assumption about the user is invalid.

A specification item is missing.

A specification item is superfluous.

A specification item is incorrect.

A specification item is unclear.

Specification items are out of order.

The following are examples of user-interface specification errors. A user-interface specification item is missing if there is no way for a user to select or enter a required value. A user-interface specification item is superfluous if it allows the user to select or enter a value that is ignored. A user-interface specification item is incorrect if it specifies the display of information in a format that is incompatible with the data type.

A user-interface specification item is unclear if it’s possible to implement in two different ways, such that one will cause the user to enter or select the wrong information. User-interface specification items are out of order if the target users speak a language that is read left to right, and the user must work right to left or bottom to top.

11.4.3.4 Software-interface specification

Here is a list of possible software-interface specification defects:

An assumption about collateral software is invalid.

A specification item is missing.

A specification item is superfluous.

A specification item is incorrect.

A specification item is unclear.

Specification items are out of order.

The following are examples of software-interface specification errors. A software-interface specification item is missing if the name, the data type, the domain, or the structure of a parameter to a procedure or system call has been omitted. A software-interface specification item is superfluous if the parameter is never used in one of the algorithms of the design. A software-interface specification item is incorrect if the name, the data type, the domain, or the structure of a parameter to a procedure or system call is inconsistent with the usage of that parameter in another part of the design.

A software-interface specification item is unclear if a parameter to a procedure or system call can be implemented in more than one way and at least one of the ways will cause an algorithm to work incorrectly. Software-interface specification items are out of order if the parameters to a procedure or system call are permuted from their usage in one or more algorithms.

11.4.3.5 Hardware-interface specification

Here is a list of possible hardware-interface specification defects:

An assumption about the hardware is invalid.

A specification item is missing.

A specification item is superfluous.

A specification item is incorrect.

A specification item is unclear.

Specification items are out of order.

Hardware-interface specification errors are analogous to their software counterparts.

11.4.4 Coding errors

In this section, we describe errors that occur during the coding phase. We have used general terms to describe programming constructs wherever possible. We have also collected the union of the sets of coding errors possible in a number of programming languages. This means that if you’re only familiar with one language, such as Java, some of the descriptions may not be familiar to you.

11.4.4.1 Initialization errors

Here is a list of possible initialization errors:

A simple variable is always uninitialized.

A simple variable is sometimes uninitialized.

A simple variable is initialized with the wrong value.

An aggregate variable is always uninitialized.

An aggregate variable is sometimes uninitialized.

An aggregate variable is initialized with the wrong value.

An aggregate variable is partially uninitialized.

An aggregate variable isn’t allocated.

An aggregate variable is allocated the wrong size.

A resource isn’t allocated.

In these cases, a simple variable refers to a variable that can be stored in a hardware data type, to an element of an array, or to a member of an inhomogeneous data structure, such as a structure or class. An aggregate variable refers to composite data structures, whose members are referred to by ordinal number, such as arrays, or by name, such as a C struct or C++ class. When either a simple or aggregate variable is sometimes uninitialized, there is some control-flow path along which the variable isn’t initialized.

11.4.4.2 Finalization errors

Here is a list of possible finalization errors:

An aggregate variable isn’t freed.

A resource isn’t freed.

11.4.4.3 Binding errors

Here is a list of possible binding errors:

A variable is declared in the wrong scope.

A procedure call is missing arguments.

A procedure call has the wrong argument order.

A procedure call has extra arguments.

The actual argument-passing mechanism doesn’t match the usage of the formal argument.

A procedure returns no value.

A procedure returns the wrong value.

The following are examples of binding errors. A variable is declared in the wrong scope when a variable of local scope hides a variable of more global scope.

Using procedure calls with the wrong number of arguments is possible in older languages such as Fortran 77 or C. Fortran 95 and C++ provide the means to prevent this error.

The two most common argument-passing mechanisms are call-by-reference and call-by-value. In languages that use call-by-reference, such as Fortran, it’s possible to pass a constant as an actual argument and assign it to the corresponding formal parameter. This can result in constants having varying values!

It is possible to define functions that don’t return values in Fortran 77 or C. Fortran 95 and C++ provide the means to prevent this error as well.

11.4.4.4 Reference errors

Here is a list of possible reference errors:

The wrong procedure is called.

The wrong variable is referenced.

The wrong constant is referenced.

The wrong variable is assigned.

A variable isn’t assigned.

In the first four cases, the statement was syntactically correct, but the programmer referred to a different procedure or variable than he or she meant. In the last case, a reference did not occur.

11.4.4.5 Static data-structure problems

Here is a list of possible errors with static data structures:

A simple variable has the wrong data type.

An element of an aggregate variable has the wrong data type.

An aggregate variable has the wrong aggregate size.

The following are examples of static data-structure errors. An item has the wrong data type when it doesn’t have sufficient precision to accommodate all legitimate values. An aggregate has the wrong size when it doesn’t have sufficient capacity to accommodate all generated values.

11.4.4.6 Dynamic data-structure problems

Here is a list of possible errors with dynamic data structures:

An array subscript is out of bounds.

An array subscript is incorrect.

An uninitialized pointer has been dereferenced.

A null pointer has been dereferenced.

A pointer to freed memory has been dereferenced.

An uninitialized pointer has been freed.

A pointer stored in freed memory has been dereferenced.

A null pointer has been freed.

A pointer to freed memory has been freed.

A pointer to static memory has been freed.

A pointer to automatic (stack) memory has been freed.

The following are examples of dynamic data-structure problems. An array subscript is out of bounds when it’s nonintegral, is less than the lower bound on a given dimension, or is greater than the upper bound on a given dimension. An array subscript is incorrect when it’s a valid subscript but isn’t the element that should have been referenced.

Dereferencing or freeing uninitialized pointers, null pointers, and pointers to free memory are all errors because only pointers to allocated blocks of memory can be dereferenced or freed. Freeing pointers to static or automatic memory is an error because only dynamic memory, allocated on the heap, can be explicitly freed.

11.4.4.7 Object-oriented problems

Here is a list of possible errors that occur in object-oriented code:

A class containing dynamically allocated memory doesn’t have required methods.

A base class has methods declared incorrectly for derived class to override.

The wrong method signature has been used to invoke an overloaded method.

A method from the wrong class in the inheritance hierarchy has been used.

A derived class doesn’t completely implement the required functionality.

The following are examples of object-oriented problems. In C++, all classes that contain references or pointers need an ordinary constructor, a copy constructor, an assignment operator, and a destructor. It is a common error to omit the copy constructor or assignment operator. In Java, classes that contain references need an ordinary constructor and a copy constructor. In C++, a base class should provide destructors that are virtual.

In Java, it’s possible to provide all the method signatures required to implement an interface without actually providing the semantic content they must have. Similar problems can occur in Java and C++ with abstract and derived classes.

11.4.4.8 Memory problems

Here is a list of possible memory problems:

Memory is corrupted.

The stack is corrupted.

The stack overflows.

The heap is corrupted.

A pointer is invalid for the address space.

An address has an invalid alignment.

Memory, stack, and heap corruption are normally symptoms of problems from the dynamic data-structure section (incorrect use of pointers or array subscripts). Occasionally these are problems in their own right. In these cases, there is a related problem in the underlying hardware or operating system.

Some systems have a segmented address space, and in these systems, it’s possible to create a pointer that is invalid for a particular address space. Similarly, some systems have special requirements for the alignment of items addressed by a pointer, and it’s possible to create a pointer to a piece of data that isn’t addressable because of alignment.

11.4.4.9 Missing operations

Here is a list of possible missing operations:

The return code or flag hasn’t been set.

The return code or flag hasn’t been checked.

An exception hasn’t been thrown.

The exception thrown hasn’t been handled.

The event sequence hasn’t been anticipated.

The program state hasn’t been anticipated.

Statements are missing.

Procedure calls are missing.

Return codes or flags are normally associated with operating system calls and C functions. Most operating system calls are defined as if they were written in C, even if they were actually compiled with a C++ compiler. Because C doesn’t have exception handling, it’s necessary to return status codes. A return code of 0 usually means the operation was successful. Statements or procedure calls can be missing because they weren’t included in the algorithm design or because they were simply omitted during coding.

11.4.4.10 Extra operations

Here is a list of possible extra operations:

A return code or flag is set when not needed.

An exception is thrown when not valid.

Extraneous statements are executed.

Extraneous procedure calls are executed.

Statements or procedure calls can be extraneous because they were included in the algorithm design under assumptions that are no longer valid or because they were duplicated during coding.

11.4.4.11 Control-flow problems

Here is a list of possible control-flow problems:

Statements are controlled by the wrong control-flow condition.

Loop iterations are off by one.

A loop terminates prematurely.

A loop runs indefinitely.

A case in a multiway branch is missing.

A multiway branch takes the wrong case.

A statement is executed too many times.

A statement is executed too few times.

The following are examples of control-flow problems. Statements are controlled by the wrong condition when the sense of an if test is reversed. Loop iterations are off by one when either too few or too many iterations are executed. A loop terminates prematurely when the condition that controls it changes its truth value at the wrong time. A loop never terminates when the condition that controls its truth value never changes. A case is missing in a multiway branch when the default case is applied to a situation that should have special handling. A wrong case is taken in a multiway branch when flow of control falls through from one case to another because a break statement is missing.

11.4.4.12 Value-corruption problems

Here is a list of possible value-corruption problems:

Arithmetic operation has underflow or overflow.

Precision is lost.

Signed and unsigned integers are mixed.

Floating-point numbers are compared incorrectly.

Arithmetic overflow occurs either when an integer result exceeds the largest representable integer or when the exponent of a floating-point result is positive and the value of the exponent is too large to fit in the exponent field of the floating-point representation being used. Arithmetic underflow occurs when the exponent of a floating-point result is negative and the absolute value of the exponent is too large to fit in the exponent field of the floating-point representation being used.

Precision loss occurs in floating-point computations because floating-point numbers are a finite precision approximation of real numbers. When a program performs millions of floating-point operations, if it doesn’t manage the precision loss due to this approximation carefully, the resulting values can be meaningless. An invalid floating-point comparison occurs when a floating-point number is compared for equality without a tolerance factor.

11.4.4.13 Invalid expressions

Here is a list of possible invalid expression problems:

The wrong variable is used.

The operator input is invalid.

The wrong arithmetic operator is used.

The wrong arithmetic expression order is used.

The wrong relational operator is used.

The wrong relational expression order is used.

The wrong Boolean operator is used.

The wrong Boolean expression order is used.

There is a missing term.

There is an extra term.

The following are examples of invalid expression problems. When a wrong variable is used, the expression is syntactically correct, but you referred to a different variable than you meant. Operator input is invalid when providing an argument that isn’t in the domain of the operator, such as dividing by zero. An arithmetic operator is wrong when you used addition but should have used subtraction. A relational operator is wrong when you reverse the sense of a comparison.

Arithmetic expression order is wrong when you use arithmetic operator precedence, associativity, or parentheses incorrectly. Relational expression order is wrong when you use relational operator precedence, associativity, or parentheses incorrectly. A Boolean operator is wrong when you use logical OR when you should have used logical AND. Boolean expression order is wrong when you use Boolean operator precedence, associativity, or parentheses incorrectly.

A term is an operator and its corresponding inputs. Inputs can be constants, variables, or function calls.

11.4.4.14 Typographical errors

Here is a list of possible typographical errors:

There are missing characters.

There are extra characters.

Characters are out of order.

These are low-level mistakes that usually result in one of the other root causes listed previously. They are the direct cause, however, when they occur in data that is embedded in a program. If the data is used to control the logic of the program and isn’t there merely for printing so a human can read it, these typographical errors can cause significant problems in program execution.

11.4.4.15 External software problems

Two classes of problems arise from using software written by other people. The first class occurs when we use their software incorrectly. The second class occurs when the problem is actually in the other software.

Here is a list of possible problems with using other people’s software:

The compiler has been used incorrectly to build the application.

A software tool has been used incorrectly to build the application.

A system library bound with the application has been used incorrectly.

A third-party library bound with the application has been used incorrectly.

The following are examples of using other software incorrectly. A compilation is wrong when the code generation options selected are a mismatch for the target hardware architecture. A linker can be used incorrectly by binding the wrong versions of libraries to an application. The build of a program using a tool like the UNIX™ make or an Integrated Development Environment (IDE) can be wrong if not all of the dependencies have been acknowledged. In this case, some parts of the application may not be rebuilt when they should be. A library can be used incorrectly by selecting the wrong Application Programming Interface (API) call or by structuring that call incorrectly.

Here is a list of possible problems within other people’s software:

The compiler used to build the application has a defect.

A software tool used to build the application has a defect.

The operating system used to execute the application has a defect.

A system library bound with the application has a defect.

A third-party library bound with the application has a defect.

11.4.4.16 Root-cause analysis tools

You can download an interactive tool that implements the root-cause classification scheme described in this section from the supporting Web site for this book at www.debuggingbythinking.com.

It allows the programmer to enter new items from any of the categories, which are added to the current classification scheme. Every time the program terminates, it sorts the items in each category by frequency or by recency and writes the counts and dates to the file. The detailed records are stored in a standard comma-delimited ASCII file, which can be imported into any spreadsheet program for analysis.