|
On the
move
For years,
voice recognition failed to live up to its promise. But now the time
is ripe for breaking the barriers, says Raj Kaushik
“Open
the pod bay doors, Hal.” “I’m sorry, Dave, I’m afraid I can’t do
that.”
Thus goes the conversation between Hal, a highly
advanced computer aboard a Jupiter-bound spacecraft, and Dave in
Stanley Kubrick’s visionary movie, 2001: A Space Odyssey, based on
the novel by Arthur Clarke, in 1968. About three decades ago, Clarke
was not much off the mark in visualising the extent of success in
the area of voice recognition. In 2001... Hal was pretty much
receptive to human voices. It was another matter that it sometimes
heard “I scream” in place of “Ice-cream” or vice versa. The voice
interface between humans and machines sounds simple, but it’s
actually the complex product of 30 years of research in mathematics,
physics, linguistics, and computer science. The revolution in the
area of speech recognition is now happening. Nine visionaries from
the world’s leading companies will reveal the future of the industry
on 5 and 6 May at Voice World Europe 2004. The event will take place
at the Olympia Conference Centre in London. An effective voice
recognition system must imitate humans. It must literally understand
the complete vocabulary and be able to use words in the right
context. The sound of human speech is composed of many frequencies
(20Hz to 22,000 Hz) of different volumes or levels, all occurring at
once. But we can easily discard higher frequencies and our voice can
well be described as using sound between 100 and 8,000 Hz. A sound
wave is analog in nature because it varies continuously in
amplitude. Amplitude indicates the relative loudness or level of a
sound. The first step in voice recognition is to capture and
digitise the actual sound. The conversion from the analog signal to
digital data is done by the computer’s sound card or circuitry. The
conversion requires that the voltages (amplitudes) of the sound are
sampled at regular intervals. This converts a continuous signal into
numbers that can be stored as 8-bit or 16-bit digital data called
“bit depth”. Sound sampling rate and bit depth have a major
impact on the amount of data that the computer must process. The
sampling rate measures the number of samples (or “snapshots”) of the
sound taken per second. For example, if the sampling rate is 44 kHz,
then 44,000 samples are taken per second. The bit depth measures
the number of tones (or amplitude levels) per sample. Generally the
higher bit translates to the higher fidelity. A sample with
8-bit depth can have only 256 different amplitude levels. CD quality
sound is usually digitised at 16 bits. That means it can have 65,536
different amplitude levels. Next, the digitised sound is
manipulated mathematically so as to reduce “noise” and to chop off
other stray signals. The cleaned-up stream of digital sound is then
broken into phonemes, or the specific sounds of words. A “phone” is
the basic unit of speech sound. A phone that distinguishes one
word from another is called a “phoneme”. For example, the phonemes
/r/ and /l/ serve to distinguish the word rip from the word lip.
Conceptually, phonemes are not letters. Phoneme sounds occur as
in /w/ “we”, “quite” ,“once” /OU/ “no” “boat” “low” /CH/ “much”
“nature” “match”. The extracted phonemes are then matched to a
built-in dictionary. With 43 phonemes in the English language, there
are millions of possibilities of how these sounds could be combined.
The voice recognition system figures out the correct choice through
a series of algorithms that rely on the grammar of the language and
how it is spoken. For example, if the voice command is “I am going
to be there”, the algorithm will select the word “there” not
“their”. Most programmes rely on a statistical process that
tries to predict what will most likely be the next phoneme, based on
the words that it has determined. The Hidden Markov Model is one of
the most common models that are used for comparing and matching
sounds. It contains over 70,000 words in its storage and also uses
libraries of grammar rules for deciding what word is most likely to
come next. In the last couple of years, there is renewed
interest in the software industry with regard to providing voice
interface to machines. Speech technologies are increasingly being
incorporated by all mainstream industries, including game, auto,
health and telecommunications. “In the last six months, speech has
moved faster than ever before,” claims Steve Chambers, president,
SpeechWorks Division, Scansoft Inc. Bill Gates, Microsoft’s chairman
and chief software architect, unveiled Speech Server 2004 during his
keynote presentation at the AVIOS SpeechTEK Spring 2004, VSLive! San
Francisco 2004 and Microsoft Mobile Developers’ Conference in March
2004. Speech Server 2004 includes Microsoft’s own
speech-recognition engine, ScanSoft Inc.’s Speechify text-to-speech
engine and a development kit for building speech applications with
Visual Studio. Net. IBM is pushing forward with its research on
advanced technology to make speech applications and Interactive
Voice Response systems more powerful than ever. IBM’s
speech-technology research targets four areas: superhuman speech,
expressive output, advanced speech tooling and conversational
biometrics. IBM already has accumulated a portfolio of more than 250
patents related to speech and voice technology. OnStar Corporation,
a subsidiary of General Motors, has chosen IBM to provide
interactive speech technologies for its next-generation
voice-recognition applications. IBM’s Embedded ViaVoice is an engine
that provides both voice-recognition as well as digit-dialing
capability. ViaVoice takes the verbal string of numbers,
digitises that string and prepares it into bits of information ready
to feed to the dialling application. The application software turns
the input data into digits and dials the phone. As the IBM voice
engine also supports a much larger vocabulary, GM may use the
technology in its cars to perform other functions, including opening
and closing windows with voice commands. Norwegian company Opera is
adding IBM speech-recognition technology to its free browser
software. Postbank AG, a German company, has selected
VoiceObjects to deploy its automated telephone banking system. The
contract includes integration of speech processing systems with the
necessary back-end software and databases. Sports Loyalty Card has
chosen SRC ContactCapture™, a packaged speech recognition service
that captures callers’ name and address details over the telephone.
The service is designed to collect football fans’ contact details as
the loyalty card programme is rolled out to clubs across the
country. Richland County, South Carolina, USA, is deploying an
IVR system to facilitate tax payments for businesses and residents.
The iVoice IVR for Richland County is designed to allow clients to
dial in to make tax payments to their accounts using credit cards.
LifeLine, an action-adventure game for PlayStation2, is billed as
the first completely voice-activated game. The game unfolds on a
space station that’s attacked by aliens. Trapped in a control
room, the player only has contact with Rio, a female survivor. The
player issues voice commands to Rio who is on a search and rescue
mission. Rio recognises more that 5,000 words and 100,000 phrases.
During firefights, the player needs to issue voice commands as to
where, when and whom to shoot. For years, voice recognition has
failed to live up to its promise. Until 2001, the products were
expensive, inaccurate, and hard to use. That’s changing. AT&T
CEO, Hossein Eslambolchi recently predicted the rate of innovation
in speech technology is five times faster than Moore’s Law.
(The author, a former project coordinator with the National
Council of Science Museums, India, and exhibits manager at the
Discovery Centre, Halifax, now works as senior server developer in
Toronto and writes for various Canadian and international
journals.)
|