Speech, context and seams

Nicholas Nova recently asked “what about voice interfaces” and he quoted a few smart people on the future of where speech recognition is going.

Apparently the future is speech recognition combined with contextual interfaces. I have some reservations about contextual interfaces as they are often imagined, especially the “you’re in a strange city and need to have dinner somewhere” scenario.

But first, let’s talk about speech recognition interfaces. To be specific, let’s talk about speaker dependent large vocabulary computer speech recognition interfaces. LVCSR systems are quite amazing. They use statistical modelling at the level of parts of sounds to “listen” to human speech and then reconstruct whole sounds, parts and then whole syllables and then parts and whole words. LVCSR systems are things like Dragon NaturallySpeaking and the Samsung P207 phone. They aren’t the sorts of things described in this article. I also think that they’re fundamentally broken because when they work they’re magic and when they break they fail in a completely opaque way. Like this: Speech recognition systems have hidden seams.
You can’t interrogate an LVCSR system to understand what it doesn’t understand. You can only go merrily along enjoying the magic until it fails. And even then when it fails you have to ask if the system failed to understand because of some extra-system fault, for example you mis-spoke or there was a noise that affected the recognition, or because what you said just isn’t in the system. LVCSR systems are completely seamless until you fall off the edge of the world. Admittedly the capabilities of such systems are increasing and are already very large. Most people who end up using speech recognition systems will probably never encounter the seams. But when the blunder into a space that the system can’t cope with things don’t fail in a particularly elegant manner. And they never can because a system can’t know what it doesn’t know.


About this entry