Succeeding with Use Cases: Working Smart to Deliver Quality
As noted in the previous section, your car's air bag is one of those items in life you actually hope to never use. If used at all, it's used just once, and you hope it's reliable. Air bags are an example of functionality that doesn't fit well into the frequency of use model of reliability: the level of reliability we require from an air bag is not in proportion to its frequency of use. In the next section, you will see how to extend an operational profile to address critical use cases that, although used infrequently, nevertheless need an appropriate amount of development and test effort and resources dedicated to them to ensure they are reliable. What Does "Critical" Mean?
Project teams often say things like "That use case is critical" or "That's a mission-critical use case" or "Do we have a list of critical use cases for the next release?". That word"critical"has at least two very common meanings. One common meaning of "critical use case" is that the success of a project from a business standpoint is dependent on delivery of that use case. Another meaning of "critical use case" is that severity of use case failure is high. An air bag is a critical feature in a car because the consequence of it failing is high. Sometimes it's really not clear whichor if maybe bothof these two meanings is intended. Wiegers, in describing scales for prioritizing requirements, uses as an example the scale: High, Medium, and Low, where High is defined as "a mission-critical requirement; required for next release" which can probably be interpreted as being critical in both ways. Of these two meanings, this section will address "critical" as meaning that the consequence of failure can be severe. But rather than use the word "critical" we'll talk about severity of failure. In critical systemssafety-critical, mission-critical, and business-criticalthis is a common way to quantify criticality: the higher the severity of failure, the more critical it is.[6] [6] The other type of "critical"really important from a business perspectivewas addressed in Chapter 1, "An Introduction to QFD: Driving Vision Vertically Through the Project," on QFD where critical equates to critical business drivers. It's a Calculated Risk
Just as the term "critical" can have different meanings, the term "risk" gets used in a lot of different ways. The type of risk we'll be talking about is quantified and used, for example, in talking about the reliability of safety critical systems. It's also the type of "risk" that the actuarial scientists use in thinking about financial risks: for example, the financial risk an insurance company runs when they issue flood insurance in a given geographical area. In this type of risk, the risk of an event is defined as the likelihood that the event will actually happen multiplied by the expected severity of the event. Likelihood is expressed in terms of frequency or a probability. If you have an event that is catastrophic in impact, but rarely happens, it could be low risk (dying from shock as a result of winning the lottery for example). This productlikelihood times severityis called risk exposure. The terms "frequency" and "probability" can be a bit confusing to those of us that are not statisticians. And even the statisticians like to have debates about their meanings and relationship (since about the eighteenth century, in fact). The US Office of Hazardous Materials Safety talks about "frequency" in terms of a measure of the rate at which events occur over time (e.g., "crashes per year") and in the examples we'll see, that's how our likelihood will be expressed. Just be aware that frequency and probability can be, and are, used interchangeably. So what's all this about "events"…what does that have to do with use cases? Well, each time your customer runs a use case, there is some chance that they will encounter a defect in the product. That's an event. What you'd like to have is a way to quantify the relative risk of such events from use case to use case so you can work smarter, planning to spend time making the riskier use cases more reliable. But before we get to that, let's take an example of calculating the risk of an event. Hardware Widget Example
A manufacturing plant has a machine with two hardware widgets A and B. When either one fails, production is shut down until they are replaced. Hardware widget A is rated at about 1 failure in 5,000 hours of operation; widget B is more reliable being rated at 1 failure in about 10,000 hours of operation. Which widget is the bigger risk to shutting down production? Your first reaction might well be it's the least reliable of the two, widget A. But, in fact, we can't tell yet because we don't know what the cost of a failure is. Let's say the cost of the widgets themselves is negligible; the bulk of the cost of a failure is in having production shutdown. Widget A can be replaced pretty quickly, and is estimated to affect production by about $6,000. Widget B, on the other hand, requires significantly more time to replace due to its location in the hardware, requiring the manufacturing machine to be partly disassembled. The estimated impact to production when it fails is about four times as long, or about $24,000. Figure 3.24 illustrates the calculation of the risk of these two widgets failing. The following formula is used to calculate risk exposure: Figure 3.24. Calculating the risk exposure of hardware widgets A and B. In this example, widget B, the more reliable one, is actually the bigger risk.
The risk exposure in both cases is given in terms of dollars per hour and, as it turns out, the more reliable componenthardware widget Bis actually a bigger risk than hardware widget A because of the cost of failure. Widget A at one failure in 5,000 hours of operation has a risk exposure of $1.20 per hour compared with a risk exposure of $2.40 per hour for the operation of widget B, based on one failure in 10,000 hours of operation. Finally, each widget's risk exposure represents some percent of the total risk exposure ($1.2 + $2.4 = $3.60); that percentage is calculated for each widget in the last column of Figure 3.24, labeled probability. This is the relative probability of loss from failure that each widget represents. Profiling Risk in Use Cases
Let's take what we've learned from our hardware widgets and re-apply it to use cases. Using our sales order example of Figure 3.11 and starting with the frequency of use information from its operational profile (see the Times per Day column in Figure 3.14), we build an operational profile that takes into account the risk of each use case (see Figure 3.25). Figure 3.25. Sales order operational profile extended to cover risk.
The following sections will walk you through the key points of the operational profile of Figure 3.25, which has been extended to address use case risk. Frequency of Failure
Recall that in our hardware widget example we specified the frequency of failure in terms of expected failures per hours of operation (e.g., one failure in 10,000 hours of operation) something one would learn, for example, through empirical tests run on batches of widgets. For use cases, we don't have an absolute number like that to work with. We are building a risk profile that can be used very early in the release to help plan activities for development and test. But what we can do is talk about the relative frequency of failure from one use case to another. The first column of Figure 3.25 records the estimated number of times a day that each use case is run; it is the same information as from the operational profile illustrated in Figure 3.14. But rather than saying that use case Arrange Payment is run 355 times a day, think of it like this: use case Arrange Payment has 355 opportunities a day to fail. Request Catalog has 129 opportunities a day to fail. So Arrange Payment has over twice as many opportunities for failure as Request Catalog. In the extreme, a use case that is never run will have no chance to fail. So while we can't estimate frequency of failure for a use case in absolute terms as we did with the hardware widgets, we can say how many we expect in relation to other use cases, and we do this by simply estimating the times per day we expect each use case to be used. This is essentially the same concept we've seen in the operational profiles previous to this, but restated in such a way as to illustrate how it fits into the overall calculation of risk. Severity
The severity of a use case failure may be hard to pin down quantitatively. There are three factors that need to be considered. First, there is the matter of the unit of measure for severity. Common units of measure for severity are cost, lost time (e.g., system downtime), and for safety-critical systems, deaths and/or injuries. Given any one of these units of measurecost, lost time, and so onyou have to also decide what it is that needs to be measured. For example, for cost, is it the cost to repair a failure, the cost of lost revenue due to a failure, or perhaps both? What makes sense for one package of use cases may not make sense for another. Use caution in arbitrarily adopting a scale of say, 1=low severity, 2=medium severity, 3=high severity. Remember, the resulting profile will be used to allocate time, effort and resources. If you plan on giving one use case three times as much effort as another, make sure that it truly is three times more severe in some absolute sense. Next, as Musa et al. (1990) point out, the severity of failure depends a lot on whose perspective you choose to measure it from. A defect that is relatively inexpensive to correct from a development standpoint can be catastrophic to a customer, and vice versa. Finally, there is the issue of actually attributing a number to the severity. Keep in mind that the goal is to get relative values for a profile. So rather than agonizing over whether the severity of the defect represents five versus six hours of down time, consider the suggestion of Musa et al. (1990) to round off severity estimates to the nearest order of magnitude. Order of magnitude estimates are ones that are given in terms of factors of 10:
By their nature of separation, it's often easier for a team to categorize things into order of magnitude buckets. For example, what's the average life of a person? Rather than getting into debates about the numerous factors that effect people's longevityhealth, lifestyle, country, even what centuryit's pretty clear that it's on an order of magnitude of 100 years; 10 is way too small and 1000 is way too big. My inclination is to use orders of magnitude in the same way I use the Pareto Principle, as a heuristic to get a first cut at estimates, and then step back and let gut level intuition tweak things a bit. Example of Estimating Severity
The third column of Figure 3.25 specifies the severity of failures for each use case of our sales order example. The unit of measure selected is an order of magnitude estimate of dollars to correct problems when discovered. Your reasoning for coming up with estimates of severity for this example might go something like the following.[7] [7] Again, keep in mind that this is an example. The cost to correct a problem may or may not be the right unit of measure for your application. Use cases, such as Arrange Payment, Request Catalog, and Check Order Status, can typically be fixed online by a customer representative and are estimated at an order of magnitude of ten dollars to correct. For example, a customer orders a catalog, the system fails to issue request for catalog, and the customer never gets it. The customer phones and complains and the problem is fixed with the re-issue of a catalog. You decide, however, that use cases that involve the shipment of goods within the countryPlace Local Order and Place National Orderare more on the order of magnitude of one hundred dollars to fix. For example, customer orders widget A, but system issues request for widget B. Customer gets wrong widget and phones to complain. Fixing this involves cost of labor, shipping, and insurance to have wrong widget picked up from customer, shipped back to the warehouse, and re-stocked. Problems with international orders, you decide, are even more expensive to fix. They incur the same types of costs to fix as local and national ordersonly more expensiveplus tariffs going and coming, and so on. You estimate use case Place International Order at an order of magnitude of a thousand dollars to fix. Finally, you note that several of the use casesEnter Customer Data and Order Productare included in the generalization use case, Place Order, of which the more critical Place Local, National, and International Order use cases are children. As either of these included use cases could very well play a role in bungling orders, you decide their severity is in line with the place order use cases. Because the predominant use of these included use cases is for local and national orders (400 hundred times daily) versus inter national orders (25 times daily), you conclude an order of magnitude estimate of their average severity is one hundred dollars. You reach a similar conclusion about Cancel Order, which, while not included in Place Order, could, if it failed, result in orders being shipped though actually cancelled. Cancel Order is also pegged at average severity on an order of magnitude of one hundred dollars. Risk Exposure and Probability
The next-to-last column of Figure 3.25 calculates the risk exposure for each use case by taking the frequency of failure, stated in opportunities for failure per day, times severity, stated in dollars per failure. So, risk exposure will be measured in dollars per day. Risk exposure represents the risk in dollars to run a use case for the day. It doesn't mean that is how much money you are necessarily losing; it is the potential loss you are exposed tohence, the term risk exposurefrom running a use case. It's just a way to compare the risk of one use case to another. The last column of Figure 3.25 calculates each use case's percent of total risk exposure; in a nutshell, it's the relative probability of loss due to a use case failure. This profile can be used just like the use case package opera tional profiles we've looked at earlier in the chapter; for example, it can be used for top-down or bottom-up planning. In this case it is based on use case risk rather than just frequency of use. Plotting the Results
Let's conclude by comparing the operational profile of the sales order example with and without criticality of the use cases being taken into consideration (see Figure 3.26). Figure 3.26. Bar chart of sales order example of Figure 3.11 comparing operational profile of Figure 3.14, without use case criticality, and Figure 3.25, with criticality.
As Figure 3.26 illustrates, taking criticality of the use cases into consideration in the operational profile can change the lay of the land. Use cases that, while frequently used, are low in criticality, can fall dramatically in relative ranking (e.g., Arrange Payment), and use cases that are less frequently used but are critical can rise in ranking (e.g., Place International Order). What Have You Got to Lose?
Adding information about the criticality of use cases to your operational profile will, of course, require more effort. Deciding whether or not it's worth it depends on a number of things. If your business has elements that are safety-critical, mission-critical, business-critical, and so on, you probably already spend time thinking about how things fail and the cost of failures, so extending the operational profile is probably not that big a jump for you. On the other hand, even if you deal in critical systems, if the cost of all your failures is astronomical or if the cost of all your failures are on the same order of magnitude, including severity in the profile might not buy you that much; profiling by frequency of use might be all you need. And, for some businesses, making the connection between things that fail and their associated cost may be hard to establish with much certainty. In the end, it really comes down to asking, "What do I have to lose when a use case fails, can I quantify it and will that help me in planning?" |