The Art and Business of Speech Recognition: Creating the Noble Voice

Pilot deployment is usually the longest phase of the deployment process; almost all of the data analysis work must be performed manually, and more changes need to be made than in later phases.

Here's how it works. We listen to the first 100 calls from 100 different people. We don't want to analyze repeat callers (the data wouldn't be valid if it were, say, 20 calls from only 5 different people). We then listen to see how these 100 people use the system and determine if the system is doing the right things at the right times.

We have two primary concerns in this phase: accuracy and transaction completion. We measure accuracy by monitoring the recognizer's ability to correctly identify what callers said. We measure transaction completion by tracking how many callers were successful in completing their tasks .

For example, let's say we're in pilot deployment for a telephone banking system. A user calls in to the system, identifies herself, and says, "Transfer funds." The system recognizes the command and prompts the caller for the two accounts involved, but can't understand when the caller says how much money to transfer. It sends her back to the main menu or to a "live" operator for help.

Assuming that the caller says a valid amount and calls from an acoustical environment where we would expect the recognizer to work (rather than, say, the baggage claim area of an airport), this call would be considered a failure in both recognition and transaction completion. However, if the caller eventually learned to enter the amount using touchtones ”and she did indeed accomplish her task ”we would consider the call a success in transaction completion but a failure in recognition.

Our goal, of course, is to have a system that is highly accurate and that has a high rate of transaction completion. But we don't necessarily weigh these two measurements equally. Ultimately, as shown in the example above, transaction completion is the more important metric, because enabling callers to accomplish their tasks is more important than understanding everything they say. After all, some callers may have heavy accents or may be calling from noisy locations. Conversely, a high accuracy rate and a low transaction completion rate would indicate serious design problems, since the low transaction completion number could not be attributable to recognition problems.

What do we consider "acceptable" rates of accuracy and transaction completion? While there are no hard and fast rules ”and all systems are different ”the recognition should be good enough that callers can get through their calls without getting frustrated. The overall transaction completion rate should be between 80 to 85% before moving on to partial deployment. Of course, the most desirable outcome is both a high accuracy rate and a high transaction completion rate.

Identifying Pilot Testers

The first and most important step in pilot deployment is to find the right population to use the system. Just as in usability testing, it's important to find people who are representative of the calling population to ensure that the results will be accurate. This process won't be the same as in usability testing, where we might have used a professional recruiter. It is essential to get real callers who need to use the system. That means designers and their clients should not send a company-wide e-mail to employees telling them to "call in to the following number to try out our new speech-activated customer service system." We don't want any ringers. (Proud family members of the designer are also disqualified.) However, it may be a good idea to alert the customer service representatives that there is a new system in place so they will know what customers are talking about and can take notes when customers complain about or compliment the new system.

Depending on the system, there are a variety of ways to get the right people to call in. For example, if we were creating a flight information system to replace an existing touchtone system, we could simply take one-tenth of 1% of all calls and route them to the new system. If, however, we were testing a credit card information system, we might send out newsletters to several hundred randomly selected customers asking them to update their records and call a new number for account information. This isn't a perfect strategy, because some people may not call the new number or even read the mail sent to them by the bank (I know I don't read anything included in my credit card statement except how much I owe).

Another approach is to select calls as they come in, using caller ID information to aid in segmenting the population of callers. This method enables us to roll out a system geographically ”for example, testing only callers in different area codes before rolling it out nationwide . This strategy is especially useful when we know the demographic attributes within each area code, such as the urban/rural mix, ethnic composition, education levels, and income levels. We can then test using a set of diverse callers.

Whichever method is chosen , clients should avoid testing the system during times that are not representative of the entire year. That means an airline shouldn't turn on a new system the day before Thanksgiving, nor if at all possible should an online brokerage firm test its new system when the stock market is especially volatile.

Configuring the System for Testing

Before starting the test, the designer may consider placing a temporary prompt in the system alerting callers that they are about to try a new system. This not only prepares the caller for any "hiccups" in the system, but also prevents them from being surprised if the system isn't available to them the next time they call.

Some companies keep these temporary prompts simple: "Welcome to our new flight information system." Others go to greater lengths to ensure the message is totally clear: "Welcome to our new speech-activated system. Currently it's only available during business hours, and in order to ensure your satisfaction, there may be times when it's being improved and you'll be unable to use it to access your account information." This kind of prompt is useful if the client is trying to expose the system to valuable customers and wants to avoid jeopardizing their satisfaction and loyalty. However, such a long statement will quickly get boring, so it's a good idea, if possible, to play it only once for each caller. This can be accomplished through programming. For example, the system can be set up to play the long prompt only the first time any given account number is entered. Callers hear it only once and will more likely be patient with any problems.

Being Prepared to Stop the Test

Being prepared to start the test is essential. So is being prepared to stop the test at a moment's notice. When a system is being launched for the first time, almost anything can go wrong. In most cases, the system uses a new type of computer, connected to a new database, using new hardware to connect to the phone lines. There are multiple potential points of failure ”and they may have nothing at all to do with the people designing and deploying the system. For example, the phone company may not be sending calls to the system correctly, or perhaps there's a previously undiscovered bug in the computer operating system. Of course, we can't rule out the possibility of a problem with the design, either.

Whatever the cause, if something occurs to compromise, invalidate, or incapacitate the test, we always need a plan and a procedure that will enable us to turn off the system and have all the calls go back to where they used to go. If it's an entirely new system that isn't replacing a prior call center or touchtone system, calls should be routed to a dedicated computer that simply announces that the system is down for service and then hangs up ” gracefully. Perhaps it should also provide information about how the caller can contact the company via other channels.

Clearly, if something goes awry, we have to be prepared to respond immediately ”and that means we have to watch the system like hawks. This is why it makes sense to limit the pilot test to normal business hours. This ensures that someone is there to monitor the system and that any third-party hardware or software vendors can be contacted for customer support in the event of an emergency. (Even companies with 24- hour customer support don't always have the most knowledgeable people standing by in the middle of the night.)

Once the test is running and people are monitoring its operation, you can start to examine and analyze the calls coming into the system.

Analyzing and Categorizing Calls

The system logs calls as they come in and stores the data ”including actual audio files of calls ”in a database. In addition to some manual analyses, we also analyze calls using a software tool that provides a graphical view of the logged data.

The capabilities and functions of analysis tools vary by vendor, but the best ones enable the team to view statistics such as the average call duration and the average number of transactions. They pinpoint problem areas by providing a view into individual states where the recognizer is having problems. The best tools also allow us to listen to each utterance of any call so we can gain a clearer understanding of what happened in that call ”and whether or not the system is working for the callers.

We examine the calls from the callers' perspective to determine the objectives of their calls and their impressions about the system. Based on all the data we collect and analyze, we put each call into one of the three categories: success, failure, or unknown.

Successes

If we hear a caller effortlessly getting information and accomplishing her tasks promptly, we can consider that call a success ”particularly if she sounds nonchalant and closes the conversation with a friendly sign-off, such as "Good-bye."

But it's harder to determine the success of some other calls. For example, when working with United Airlines to deploy a system to handle employee travel reservations , we received several puzzling calls. Employees would call in, search for flights , and get right to the point where they could make a reservation ”and then they would hang up. Were they successes? What defines a success? The callers didn't complete the expected transaction or say "Good-bye" ”but they didn't sound upset, either. So what was happening?

After more research we found out that many employees were calling to see how many available seats there were on the flights they wanted. Often, if there was limited space, they wouldn't make the reservation. We discovered that this was related to how the airline determines who can fly for free. Employees with the most seniority or higher-level jobs can get on flights with only one available seat just by checking in at the gate. Employees who were hired more recently are considered lower-priority fliers and as such, are more likely to get bumped off a crowded flight. That was why they would call, check seating availability, and then hang up. Since they were successful in getting the information they wanted ”even though they didn't make reservations ”we categorized these calls as successes.

It's not always possible to know what all the success criteria will be before deploying the application, so it is critical to examine the calls later on to see if you overlooked some success criteria. During the testing process, it's important to look for unusual patterns such as this to ensure accurate test results.

Failures

Some failures are obvious. For example, if we listen to a call and hear someone not being recognized, yelling at the system, and using profanity, that's a definite failure. Sometimes a failure occurs when the system simply does one thing wrong ”for example, correctly understanding that a caller wants to talk to "Zach Westelman" and transferring the call correctly to Zach Westelman, but incorrectly telling the caller that he's being transferred to "Ben McCormack." This is still considered a failure, because if the problem were left uncorrected, it would inevitably result in lots of hang-ups and confused or frustrated callers.

There are only three, mutually exclusive categories of calls ”success, failure, and unknown. So, what if a call starts off and proceeds successfully, but ends as a failure? Since we have no "semi-success" category and our goal is to improve the system (not whitewash the results), I'd recommend treating such calls as failures.

Unknowns

We call the third category of calls "unknowns." Sometimes after reviewing the data we collect ”and even listening to the audio files ”we are simply unable to determine whether a call was successful or not. For example, sometimes a call starts and then there is an immediate hang-up. Or perhaps a caller uses the system successfully for a while, and then abruptly hangs up for no apparent reason.

There are many explanations for why this happens, and it's often no fault of the system or its design. If a hang-up is the first and only sound on the call, it could be a wrong number. If a caller is using the system without problems and then abruptly hangs up, it could mean that they were interrupted or they had a bad mobile phone connection. Calls are considered failures in cases where callers' behavior doesn't tell us enough about their intent to categorize it.

There are bound to be a few unknown calls during any test, and if the percentage is small, there's no need for further investigation. However ”as we learned in the United Airlines example above ”if unknown calls represent a significant percentage of all calls, it's definitely worth checking out.

Fixing the Right Problems

The whole idea of listening to test calls is to identify and understand any patterns that emerge in the system and as it is used. By actually hearing the calls, a development team can quickly understand what callers are experiencing. By reviewing the data and running statistics on them, the team can spot trends and act quickly to fix the "right problems," rather than risk a misdiagnosis and fix something that's actually working well. To paraphrase a surgeon I once heard , we don't want to take out the appendix when it's the tonsils that are causing the problem.

Sometimes, major problems emerge in testing ”problems that necessitate an equally major redesign of all or some of the system. However, having a large number of failed calls in a pilot test doesn't necessarily dictate a wholesale redesign. For example, if we find that 80 out of 100 calls are failing, but we understand the problem behind 76 of them ”and we know how to correct it ”we can move ahead knowing we have only four calls that need further investigation.

Let's say we designed a system for an entertainment ticket sales company. It includes a prompt that says, "You can say 'Movies,' 'Music,' 'Theater,' or 'Talk to an operator.'" What if, during the pilot test, we found that 58% of the callers asked to talk to an operator? We couldn't automatically assume that the system was a failure. People may have been opting to talk to an operator because they needed help with something the system wasn't designed to handle. In that case, it may be OK to leave the system as is ”or, if it made sense, we might consider adding functionality to the system to reduce the volume of calls going to call center representatives. But if further investigation revealed that callers simply didn't know that the system could handle their requests , we would definitely want to modify the design to ensure that callers know what the system can and cannot do.

"What We've Got Here Is Failure to Communicate" [1]

[1] Spoken by Captain, Road Prison 36 in the movie Cool Hand Luke (1967).

It's often important to look beyond the speech-recognition system for the source of problems when they crop up. Take the United Airlines Bag Desk application, for example. The system had been up and running successfully for several months when, all of a sudden, a significant number of calls were being categorized as failures by the automated analysis tools. When the people on the team analyzed the problem calls, they discovered that all of the callers were saying they had lost their bags in "Jackson, Mississippi."

It turned out that "Jackson, Mississippi" wasn't in the recognition vocabulary because, as far as the team knew, United didn't fly in or out of the Jackson airport. So, the system had worked properly, correctly rejecting what people were saying and transferring them to United agents . But the question remained: Why were people calling United to report lost baggage at an airport without United service?

The development team called the United technical team. Neither group could figure it out ”until a few calls were made to United officials. It turned out that United had just recently inaugurated service at the Jackson airport, but not all of the airline's databases had yet been updated, and the people in charge of the system hadn't been informed. Once the source of the problem was identified, the team was able to update the system in a matter of hours and its failure rate dropped once again.

A sudden jump in the failure rate is not the only way to spot a problem in a deployed system. A good analysis tool will monitor the system and ”if anything out of the normal range occurs ”send an alarm by either pager or e-mail to the people in charge of system administration.

Knowing When to End Pilot Deployment

Every development team has to establish a grading system to signal when it's appropriate to end the pilot deployment phase and move on to partial deployment. The following are some rules of thumb.

  • If fewer than 5% of calls are failures, the system is ready for partial deployment.

  • If between 5% and 20% of calls are failures, we need to figure out what we can do to fix the application promptly.

  • If more than 20% of calls are failures, we should seriously consider turning off the application until problems are identified and solved .

  • If more than 33% of calls are categorized as unknown, we need to gather more data to help us analyze the application.

Making Changes

Once the calls have been categorized, the problem calls should be divided into subcategories . This helps the development team pinpoint the areas of the application causing problems for callers. Once they identify the sources of any problems, the development team proposes potential fixes, making sure to consider what effect, if any, the fix will have on the recognizer and the usability of the rest of the application.

Once the appropriate changes have been made, we turn the system on again to collect another set of calls. If the failure rate drops to 5% or lower, it's time to turn up the volume through partial deployment.

Категории