Site hosted by Angelfire.com: Build your free website today!
The first completely customisable news site on the web
129 years in print
  Search   in  The Statesman Web
April 19,2004 
  News
    Page one
    India
    World
    Editorial
    Perspective
    Business
    Sport
    Bengal
  Magazine
    Sports & Leisure
    Career & Campus
    Science & Technology
    Voices
    Lifestyle
    Kolkata Plus
    Bengal Plus
    Viewpoint
    North East Page
    Orissa Plus
    Note Book
    N.B & Sikkim Plus
    Entertainment
    NB Extra
The Sunday Statesman Magazine
 
Crystal Ball
 
Subscription
 
 
 

On the move

For years, voice recognition failed to live up to its promise. But now the time is ripe for breaking
the barriers, says Raj Kaushik

“Open the pod bay doors, Hal.”
“I’m sorry, Dave, I’m afraid I can’t do that.”

Thus goes the conversation between Hal, a highly advanced computer aboard a Jupiter-bound spacecraft, and Dave in Stanley Kubrick’s visionary movie, 2001: A Space Odyssey, based on the novel by Arthur Clarke, in 1968. About three decades ago, Clarke was not much off the mark in visualising the extent of success in the area of voice recognition.
In 2001... Hal was pretty much receptive to human voices. It was another matter that it sometimes heard “I scream” in place of “Ice-cream” or vice versa. The voice interface between humans and machines sounds simple, but it’s actually the complex product of 30 years of research in mathematics, physics, linguistics, and computer science. The revolution in the area of speech recognition is now happening. Nine visionaries from the world’s leading companies will reveal the future of the industry on 5 and 6 May at Voice World Europe 2004. The event will take place at the Olympia Conference Centre in London.
An effective voice recognition system must imitate humans. It must literally understand the complete vocabulary and be able to use words in the right context. The sound of human speech is composed of many frequencies (20Hz to 22,000 Hz) of different volumes or levels, all occurring at once. But we can easily discard higher frequencies and our voice can well be described as using sound between 100 and 8,000 Hz. A sound wave is analog in nature because it varies continuously in amplitude. Amplitude indicates the relative loudness or level of a sound.
The first step in voice recognition is to capture and digitise the actual sound. The conversion from the analog signal to digital data is done by the computer’s sound card or circuitry. The conversion requires that the voltages (amplitudes) of the sound are sampled at regular intervals. This converts a continuous signal into numbers that can be stored as 8-bit or 16-bit digital data called “bit depth”.
Sound sampling rate and bit depth have a major impact on the amount of data that the computer must process. The sampling rate measures the number of samples (or “snapshots”) of the sound taken per second. For example, if the sampling rate is 44 kHz, then 44,000 samples are taken per second.
The bit depth measures the number of tones (or amplitude levels) per sample. Generally the higher bit translates to the higher fidelity.
A sample with 8-bit depth can have only 256 different amplitude levels. CD quality sound is usually digitised at 16 bits. That means it can have 65,536 different amplitude levels.
Next, the digitised sound is manipulated mathematically so as to reduce “noise” and to chop off other stray signals. The cleaned-up stream of digital sound is then broken into phonemes, or the specific sounds of words. A “phone” is the basic unit of speech sound.
A phone that distinguishes one word from another is called a “phoneme”. For example, the phonemes /r/ and /l/ serve to distinguish the word rip from the word lip. Conceptually, phonemes are not letters.
Phoneme sounds occur as in /w/ “we”, “quite” ,“once” /OU/ “no” “boat” “low” /CH/ “much” “nature” “match”. The extracted phonemes are then matched to a built-in dictionary. With 43 phonemes in the English language, there are millions of possibilities of how these sounds could be combined. The voice recognition system figures out the correct choice through a series of algorithms that rely on the grammar of the language and how it is spoken. For example, if the voice command is “I am going to be there”, the algorithm will select the word “there” not “their”.
Most programmes rely on a statistical process that tries to predict what will most likely be the next phoneme, based on the words that it has determined. The Hidden Markov Model is one of the most common models that are used for comparing and matching sounds. It contains over 70,000 words in its storage and also uses libraries of grammar rules for deciding what word is most likely to come next.
In the last couple of years, there is renewed interest in the software industry with regard to providing voice interface to machines. Speech technologies are increasingly being incorporated by all mainstream industries, including game, auto, health and telecommunications. “In the last six months, speech has moved faster than ever before,” claims Steve Chambers, president, SpeechWorks Division, Scansoft Inc. Bill Gates, Microsoft’s chairman and chief software architect, unveiled Speech Server 2004 during his keynote presentation at the AVIOS SpeechTEK Spring 2004, VSLive! San Francisco 2004 and Microsoft Mobile Developers’ Conference in March 2004.
Speech Server 2004 includes Microsoft’s own speech-recognition engine, ScanSoft Inc.’s Speechify text-to-speech engine and a development kit for building speech applications with Visual Studio. Net. IBM is pushing forward with its research on advanced technology to make speech applications and Interactive Voice Response systems more powerful than ever. IBM’s speech-technology research targets four areas: superhuman speech, expressive output, advanced speech tooling and conversational biometrics. IBM already has accumulated a portfolio of more than 250 patents related to speech and voice technology. OnStar Corporation, a subsidiary of General Motors, has chosen IBM to provide interactive speech technologies for its next-generation voice-recognition applications. IBM’s Embedded ViaVoice is an engine that provides both voice-recognition as well as digit-dialing capability.
ViaVoice takes the verbal string of numbers, digitises that string and prepares it into bits of information ready to feed to the dialling application. The application software turns the input data into digits and dials the phone. As the IBM voice engine also supports a much larger vocabulary, GM may use the technology in its cars to perform other functions, including opening and closing windows with voice commands. Norwegian company Opera is adding IBM speech-recognition technology to its free browser software.
Postbank AG, a German company, has selected VoiceObjects to deploy its automated telephone banking system. The contract includes integration of speech processing systems with the necessary back-end software and databases. Sports Loyalty Card has chosen SRC ContactCapture™, a packaged speech recognition service that captures callers’ name and address details over the telephone. The service is designed to collect football fans’ contact details as the loyalty card programme is rolled out to clubs across the country.
Richland County, South Carolina, USA, is deploying an IVR system to facilitate tax payments for businesses and residents. The iVoice IVR for Richland County is designed to allow clients to dial in to make tax payments to their accounts using credit cards. LifeLine, an action-adventure game for PlayStation2, is billed as the first completely voice-activated game. The game unfolds on a space station that’s attacked by aliens.
Trapped in a control room, the player only has contact with Rio, a female survivor. The player issues voice commands to Rio who is on a search and rescue mission. Rio recognises more that 5,000 words and 100,000 phrases. During firefights, the player needs to issue voice commands as to where, when and whom to shoot.
For years, voice recognition has failed to live up to its promise. Until 2001, the products were expensive, inaccurate, and hard to use. That’s changing. AT&T CEO, Hossein Eslambolchi recently predicted the rate of innovation in speech technology is five times faster than Moore’s Law.

(The author, a former project coordinator with the National Council of Science Museums, India, and exhibits manager at the Discovery Centre, Halifax, now works as senior server developer in Toronto and writes for various Canadian and international journals.)


  Discussion on this Science & Technology item
Disclaimer: These are Internet generated discussion threads for which the The Statesman has no responsibility.

No discussions on this item


  Other Science & Technology

tech on spec: Talking new languages

Body talk

A boy beyond reach

A new gene therapy

webcrawler: Google under fire, Pac-Man returns


Print this page
Email this page to a friend
Post your comments
  Login
 
Username
Password
 

Page views : 9364260 since March 15,2003