Succeeding with Use Cases: Working Smart to Deliver Quality
What Is "Reliability"?
Use case-driven development and SRE are a natural match, both being usage-driven styles of product development. What SRE brings to the party is a discipline for focusing time, effort, and resources on use cases in proportion to their estimated frequency of use, or criticality, to maximize reliability while minimizing development costs. But what is reliability? We spent a lot of time in the last chapter talking about the operational profile as a tool to work smart to achieve reliability, but we have never actually said what reliability is. Software reliability is defined as: The probability of failure-free operation for a specified length of time in a specified environment. Though short, this definition has a lot packed into it that doesn't necessarily jump out at you on first read. There are two main themes of which to take notice. Software Reliability is User-Centric and Dynamic
First, software reliability is defined from the perspective of a user using a system in operation. It is a user-centric, dynamic definition of reliability, as opposed to, say, faults per lines of code, which is a developer-centric, static measure. Consider the phrase "specified environment." This includes the hardware, its configuration, and the user profile (e.g., whether the user is an expert user, novice, and so on). User profile is important because a system designed for use by an expert could well be unreliable in the hands of a novice. The whole "specified environment" idea is only pertinent because you are qualifying expectations of reliability in terms of a system being operated by a user. How about the phrase "failure-free" operation; what does that imply? A failure occurs when a system in operation encounters a defect or fault, causing incorrect results or behavior. A defect or fault is a static conceptit's just there, in the codebut a failure is something that can only happen when the system is in operation. So again, this is a dynamic concept implying a system in operation. A dynamic, user-centric definition of reliability is more than an academic issue. This part of the definition is at the heart of SRE's ability to deliver high reliability per development and test dollar spent. A use case with lots of defects or faults in its underlying code can seem reliable if the user spends so little time running it that none of the many bugs are found. Conversely, a use case that has few defects or faults in its underlying code can seem unreliable if the user spends so much time running it that they find all those few bugs in operation. This is the concept of perceived reliability; it is the reliability the user experiences as opposed to a reliability measure in terms of, say, defect density. Software Reliability Is Quantifiable
The second key theme in the definition of reliability is that it is quantifiable; the key phrase here is "…probability of failure-free operation for a specified length of time …" In the last chapter, we saw an example of calculating the risk exposure that two hypothetical hardware widgets posed to a manufacturing machine of which they were a part: when either failed, production was shut down until it was replaced. As part of the calculation of risk, we said that one widget was of a type expected to fail once in 5,000 hours of operation and the other once in 10,000 hours of operation. These were statements about the expected failure intensity of the widgets. Failure intensity is the number of failures per some unit time and is probably the most common method of specifying and tracking software reliability. But technically speaking, to answer the question: "What is the probability of failure-free operation for a specified length of time?" we need the formula shown in Equation 4.1 (typically given with Greek letters, which unfortunately makes it "look" more complicated than it actually is) called the exponential failure law: Equation 4.1 Reliability is the probability of failure-free operation for a specified length of time, which is given by this formula, called the exponential failure law.
Don't worry about committing this formula to memory: You'll see how to use it as part of a simple spreadsheet formula later in the chapter. Equation 4.1 reads like: R( Constant failure rate just means that you aren't fixing bugs or adding new features, both of which would affect the failure rate. A constant failure rate is basically what you have once a system is released for commercial use: the failure rate isn't going to get better or worse; it's constant because you are done working on the system. It is, therefore, a good indicator of what life with the system will be like for your customer.[2] [2] If you add functionality or fix bugs once a system is released, it is, technically speaking, a new version of the system with a new failure rate. What does this equation really mean? Well, it is capturing an intuitive aspect of reliability that failure intensity alone doesn't describe. When you read that a hardware widget is expected to fail once in 5,000 hours of operationthat's the failure intensitya question that may well pop into your mind is "Right, but doesn't the likelihood of failure increase the closer to 5,000 hours of operation you get?" And the answer is yes, and that is exactly what the exponential failure law of Equation 4.1 is describing. Equation 4.2 provides an alternate way to re-write this equation that will help us see this; we simply re-write the equation without a negative exponent: Equation 4.2 Alternate version of Equation 4.1 without a negative exponent.
Now let's reconsider that question: Given a widget is expected to fail once in 5,000 hours (that's the failure intensity, i.e., l = 1/5000) isn't it more likely to fail later (say where So, technically speaking, Equation 4.1 is the "official" definition of reliability per se, as it is what is needed to calculate the probability of failure-free operation for a specified length of time. But the main component of Equation 4.1 is failure intensity, and failure intensity is the bit that is most commonly used in setting and tracking software reliability goals. Reliability: Software Versus Hardware
Finally, if it's not already obvious, it's important to point out there are distinctions between hardware reliability and software reliability. When we talk about hardware widgets with expected lives of 5,000 or 10,000 hours, the source of failure is assumed to be from some part of the widget wearing out from physical use. But software does not wear out per se, leading some to question whether or not it makes sense to apply statistical models, such as the exponential failure law that originated with hardware reliability, to software (Davis 1993). But the software reliability engineering community counters that though the source of failures is different for softwareit doesn't wear out statistical models are nevertheless valid for describing what we experience with software: the longer you run software, the higher the probability you'll eventually use it in an untested manner and find a latent defect that results in a failure. Resolution of these (sometimes theoretical) views notwithstanding, the discipline of software reliability engineering has plenty of ideas I think you will find have practical application to use case development: a user-centric, dynamic, and quantifiable view of reliability. In the next section, you will get a closer look at what is virtually the heart of this view of reliabilityfailure intensityand learn how to apply it to your projects. |