A Montreal commuter is stuck in traffic at dawn, stricken with a need for speed.
“Où est le Star-Bucks?” he shouts, in
“I’m sorry, would you please repeat that?”
“Starbucks, alors. Star-bucks! Zut!” He swerves. “High test, café extrème, get it?”
“Je suis désolé, monsieur. Sorry, I don’t recognize that name. Please hold for an operator.”
A simple request for the nearest Starbucks location—spoken to a cell phone equipped with a hands free, speech-enabled locator system—should be a breeze. But this morning, the caller happens to be a Portuguese-Rumanian speaking French with a hangover. The traffic noise is a steady blare of honk honk honks in the background. What’s a speech recognition algorithm to do?
Plenty, of course. According to analysts, voice-enabled technologies will change the communications interface between humans and machines forever, tearing down barriers of accent, language, dialects, nasal speech, weird colloquialisms and monosyllabic grunts. A speech-enabled wireless phone will be able to dole out directions to a Portuguese-Rumanian Francophone with relative ease. New algorithms will suppress the interference of traffic noise and honking horns. In fact, speech systems—both recognition systems and text-to-speech—are so clever they can absorb huge vocabularies, parse mutilated phonemes and cordially accept a word or words enabling them to provide a bonified and useful voice response. For example, “Where is Starbucks?”
“On 21st and Vine.” “How do I get there?” “Easy. Turn right at the next corner.” “Can you direct me around traffic?” “Of course. Hang a left and go down that alleyway. Watch out for gangs.”
Although speech recognition and text-to-speech technologies have already enjoyed increasing success in such computer-telephony applications as integrated voice response (IVR) and automated attendants, particularly for call centers, speech solutions distinctly branded for wireless systems are just emerging. According to Chris Biber, a director of marketing for Pronexus, a speech recognition company in Ottawa, Ontario, “People are still using wireless phones and PDAs today as just one of the access methods into a server-based speech application located centrally. For example, you can phone into an auto-attendant, and the auto-attendant doesn’t care whether you’re
calling from a wireless device or a landline phone.” Tailor-made wireless speech solutions are still comparatively rare, Biber adds. “People are speech-enabling existing processes, but speech recognition is not happening on [most] wireless handheld devices.”
Yakov Shulman, the acting CEO of Advanced Recognition Technologies of Tel Aviv, Israel, wouldn’t agree. His company produces SmARTSpeak XGT and SmARTCar speech recognition systems for mobile phones and embedded computers in cars. “In cars, the applications are obvious. You can access your phone by speech recognition and text-to-speech response while having your hands free and keeping your eyes on the road. You can use your voice to browse an address book or dial from your phone and say the number you want. Once the name is in the address book, the system can recognize it and dial it directly.”
Shulman says his technology is already in 14 million wireless phones on the market today, his customers range from NEC to Mitsubishi to Motorola (for fixed car phones), Quanta, Sierra Wireless and
others. He also belongs to the camp of wireless vendors that believes voice processing and text-to-speech systems will become a standard part of the mobile handset. “There are definitely many other markets—and that doesn’t mean there’s no place for [centralized] voice server technology. But in the long run, we see distributed speech recognition systems, in which some part of the recognition algorithm is embedded in the hand device; and that device, once connected to a speech-enabled server, delivers information from larger databases [and apps] with more capability.”
The interesting thing about speech recognition is that analysts don’t routinely segment the wireless speech systems market from speech recognition in general. “I’m not sure anyone focuses on mobile [exclusively],” observes Steve Cramoysan, a principal analyst at Gartner. Gartner’s calculation for speech recognition telephony software (that’s software only, not hardware or integration) for 2003 was $200 million. However, “the value of the solutions can be between 5 to 10 times that value all together, so that the total market today for speech-enabled solutions is about $1 billion,” says Cramoysan. That figure compromises the market value of all solutions—wireless, wireline and central server technologies.
According to Ronald Gruia, senior strategic analyst for Frost & Sullivan, speech engine components alone, not embedded systems or applications, will grow from $27.5 million to about $252 million by 2010, a 38.5 percent compound annual growth rate, and that doesn’t include applications or embedded systems.” Industry mergers and acquisitions have fueled growth and competition (e.g., Scansoft, a leading voice vendor, bought Speechworks in recent years), as have improvements in speech technologies affected the whole value chain. “AT&T pushed the envelope and came up with a natural sounding text-to-speech engine of very high quality,” Gruia observes.
A project funded by DARPA is the creation of prosodic processing, in which speech recognition engines can discern intonations in human speech patterns or add intonations to a text-to-speech engine depending on contextual situations. “This research finds ways to detect the person’s emotional state while talking,” Gruia explains. “The knowledge can be leveraged in the battlefield or for commercial applications such as call center offload”—in other words, to detect alarm calls of high priority and caller stress. Technologies such as these continue to fuel speech recognition growth.
In wireless environments, which tend to be noisy, vendors are making progress with speech engines that filter out noise to make recognition more accurate. Consequently, the prospects for speech recognition are excellent, according to Gruia, especially given improvements in natural language understanding. “Vendors are able now to leverage speech services for directory automation and assistance services and reduce the time operators take retrieving records,” he said. “On average, just saving 1 to 2 seconds in operator intervention time saves telephone companies a huge amount of cost.”
Microsoft Stacks the Deck
In addition to rapid technological advances, the most important influence on speech recognition today is the entry of Microsoft into the marketplace. The company is touting speech as the next ubiquitous form of access to the Internet. Its Speech Server platform and emphasis on multimodality—access to Web pages through several modes of communication, including natural speech commands, graphics and textual input—will transform speech services into a major enterprise force.
The most obvious impact of Microsoft’s vision is that mobile phones will become the world’s most popular speech-enabled access tool to the Web, especially among millions of international users who have no other computer or portable PC readily available. “There are about 1.1 billion mobile phones and 1.1 billion landline phones. Compared to PCs, the phone is the most pervasive device in the world,” says James Mastan, a director of marketing for Microsoft’s speech product group. “We believe the bulk of access to applications will be the telephone. It’s a huge opportunity to allow access to applications and data, and the most
natural way to open up these applications is speech.”
Ultimately, Microsoft believes multimodality will be incorporated as a simultaneous and/or sequential capability in mobile devices, meaning that users will either have one or several interfaces available to them at any one time. “For example, when you speak into a PDA and say you want to change your flight reservations, a speech application on the back-end will send both a voice response and a graphical confirmation of the changes you want to make,” says Pronexus’ Biber. “The level of interaction is complex—quite a bit higher, say, than typing in a request and receiving an e-mail confirmation or speaking a request to an automated attendant and receiving a voice response alone.” But the user experience, he argues, could be richer. More importantly, multimodal access will encourage mobile subscribers to buy bigger buckets of minutes and use their cell phones more. This is exactly the kind of strategy mobile carriers and IT companies relish.
The wrench in multimodality is that so far, it doesn’t work. “We’ve yet to see a commercial application in this scenario,” affirms Biber. “While it sounds excellent in theory, the development effort to put these applications together is not straightforward,” he says. The challenge of creating next-generation speech recognition and text-to-speech solutions à la Microsoft may well fall to partner companies and individual developers such as Kirusa, an enterprise devoted to implementing multimodal applications on mobile devices. According to Inderpal Mumick, the founder and CEO of Kirusa, simultaneous multimodality—the ability to use more than one type of interface at the same time—will in fact work on advanced devices using Microsoft Mobile OS, Palm or Symbian operating systems. However, the larger set of devices out there, cell phones, can do only sequential multimodality with limited processing and storage power available.
Kirusa is now working on both types of technology, targeting carriers and enterprises directly with solutions such as voice and graphics-enabled car manuals (shown recently at the Automechanica Show in Frankfurt, Germany) and multimodal yellow pages. By 2008, Mumick believes 300 million wireless devices will be fully enabled with multimodal capability.
On the horizon, speech systems will emulate human operators to provide unified messaging—what some analysts are now calling unified communications. Enterprise systems will incorporate artificial intelligence to listen in on real human conversations (e.g., customer call center operators) and implement business rules and grammars to make their speech systems more reliable and natural sounding. “That’s the technology break we’re looking for,” says Matt Keowan, a director of marketing at Nuance Communications, a speech recognition supplier. “The goal is to allow the speech system to adjust itself to what callers are doing [and saying] over the course of time.” •
Arielle Emmett is a freelance writer based in Pennsylvania.