The Art and Business of Speech Recognition: Creating the Noble Voice
Genius begins great works; labor alone finishes them. ” JOSEPH JOUBERT, FRENCH ESSAYIST Our success derives from sustained intensity. ” ANONYMOUS The successful conclusion of usability testing ”and the completion of any redesigns necessitated by it ”are important milestones in the speech-recognition production life cycle, but by no means indicate that your work is finished. Sometimes when clients have all their phone lines set up and their system ready to take phone calls, they ask, "Why can't I just turn it on? I did that with my touchtone system." It's an understandable question; after all, many technologies do come ready to work, right out of the box. But speech-recognition systems do not always work perfectly from Day One. Why not? Because verifying the successful operation of the intricacies of a speech-recognition is much more difficult than verifying a touchtone system. When people are asked to respond to a touchtone system, they do so by pressing one or more of the 12 buttons on a telephone keypad. So, for example, if it's a system that provides movie show times, it's very easy to verify that when callers press "1" for the first movie, the correct information is played . It's also easy to verify that the system is properly rejecting incorrect responses when callers press invalid buttons. (These rejections usually come in the form of a rude, often untrue, and painfully time-consuming prompt, such as "That input was not valid. Please listen to the options and select one of the choices.") As a result, by the time a touchtone system is deployed, the people who designed and tested it should be highly confident that it works. Not so with a speech-recognition system. When testing a speech-recognition system in a laboratory environment, it's hard to replicate "real world" conditions. And since there is often no single acceptable answer to a prompt, we can't even be sure that callers will say the right things in the first place. Yes, usability testing gives some insight into these issues, but there is simply no substitute for real data from real callers making real calls under real world conditions. For example, how many ways do you think people might say a four-digit number like "1687?" If it's the part number of a product you're ordering, you might say "one, six, eight, seven." If it's a flight number, you might say "sixteen eighty-seven." And if it's the amount of a cash transfer, you might say "one thousand six hundred eighty-seven." But we can't be sure that everyone else calling in to a speech-recognition system will say those numbers the same way. When speech scientists tune the vocabulary of a system, they need to know what actual people are actually saying. They need to listen to thousands of utterances of people saying these numbers , build statistical models to analyze the responses, and then make any necessary changes to improve the recognition. In many ways, deploying a speech-recognition application is like deploying any other computer software application. No software company simply tests its code internally and then ships it out the door to the customers. Instead, they often deploy it in at least two phases (alpha and beta) to small numbers of select customers. These customers agree to try the software and report any problems or bugs . This process enables the software company to see how its code works in the real world ”and to remedy any problems that arise ”before releasing it on a large scale. This is similar to the process used in deploying a speech-recognition application. But there's an important difference between commercial software and speech-recognition systems. When commercial, shrink-wrapped software is deployed, customers have a way to report bugs to the company so that it can still make changes to future versions. Speech-recognition systems don't allow the user to easily do this since the software doesn't reside in the hands of the caller. And callers don't have a manual to tell them what to do if they find problems or who to contact with technical support questions. And due to the size of most speech applications, testing with a relatively small number of subjects won't uncover all the potential problems. For this reason, it's critical that companies deploy speech-recognition applications in a deliberate , phased approach. A similar approach is used in Hollywood, where movie studios almost always conduct advance screenings of films for "real" audiences to gauge their reactions and, if necessary, make changes. These changes may occur before the movie is put into general release, but also after the movie has been released to particular markets. Sometimes these changes are major. For example, director Frank Capra shot and tested four different endings to his 1941 film classic, Meet John Doe, starring Gary Cooper and Barbara Stanwyck. In one, Cooper's character commits suicide. In another, Stanwyck's character persuades him not to leap from the top of City Hall. Unsatisfied with the response to these and two other endings, Capra filmed a fifth (and the final) ending, in which Stanwyck talks some sense into Cooper and then faints into his arms. A determined artist may not care what the public thinks about a work of art ”in fact, some even view widespread popularity as a sign of artistic failure ”but as with film, speech-recognition designers don't have that choice. To be considered a success, their work has to be understood and widely used and enjoyed by their intended audience. And the only sure way to achieve these objectives is through a careful, controlled, and monitored multi-phase deployment. |