Automatic Speech Recognition Systems: Working, Components and Challenges

What is ASR?

ASR is an Automatic Speech Recognition system. It is a technology that converts speech into text. With the help of this technology, we can talk to our machines in a natural way like the way we talk to humans.

ExampleYoutube, Alexa, Smart TV, etc.

Components in ASR 

There are three major components in ASR.

  1. LEXICON
  2. Acoustic Model
  3. Language Model

LEXICONLexicon is the primary or first step in decoding speech and creating a comprehensive lexical design for an ASR system and including all fundamental elements of spoken language and written vocabulary. The lexicon is the building block of the acoustic model for every vocal input.

Acoustic ModelAcoustic model is the second step in ASR, its work is to separate an audio signal into minute time frames. Then it analyzes each frame and provides the probability of using different phonemes  (Phonemes are the basic building block sounds of language and words ) in that section of audio Simply put. The acoustic model aims to analyze which sound is spoken in each frame. 

The acoustic model is very important because different people pronounce the same phrase in multiple ways and background noise, and accents can make the same sentence sound different and it’s also depending on the speaker.

Acoustic Models use deep learning algorithms to determine the relationship between audio frames and phonemes.

A Very commonly used acoustic model in ASR is the Hidden Markov Model. Which is based on the Markov Chain Model. This model is used to predict the probability of an event based on a situation’s current state. In that way, the acoustic model works.

Language Model 

The language model is the third step in the ASR system. The language model is used to recognize the intent of spoken phrases and used to compose word sequences and operates in a similar way to the acoustic model by using deep learning algorithms to train text data to estimate the probability of which word comes next in a phrase.

It is common for speech recognition software to use N-gram probability to translate spoken words into text.

So with the help of these three component ASR systems, it is able to make close-to-close accurate predictions of words or sentences in the audio input.

How does ASR work? 

You ask your device, what is the weather forecast, then your device creates a wave file of your words then background noise is reduced and the volume is normalized then the filtered waveform is broken into phonemes ( phonemes are the sound used to build words). Then each phoneme is like a link in a chain based on the first phoneme statistical analysis is used to find the most likely phonemes. 

The ASR focuses on tagged words meaning – the vocabulary of an ASR consists of 60 thousand or more words so its means is over 215 trillion possible word combinations if you just speak three words in a sequence to it so it would impractical for an ASR system to scan its entire vocabulary for each word and process them individually so the ASR reacts to certain “tagged” words and phrases like “weather forecast”,” Check my balance” etc.

How does ASR learn from Humans?

     ASR learning is based on two mechanisms.

  1. Human Tuning
  2. Active learning

Human TuningIn this learning, ASR is learned through the conversation logs of a given ASR. Software interface and commonly used words are present but it is not in the pre-programmed vocabulary. By adding those words to the software vocabulary, the software will be able to understand speech better.

Active Learning Active learning is much more advanced learning than ASR. It constantly expands its vocabulary by learning autonomously and adapting new words during this learning process.

Challenges in ASR 

  • Background Noise
  • Difficult accents
  • Lack of trust and privacy issues
  • Touchless screens
  • Due to Bad Recording equipment, it is hard to identify what the user said