Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links


Automatic Speech Recognition

In automatic speech recognition (ASR) systems acoustic information is sampled as a signal suitable for processing by computers and fed into a recognition process. The output of the system is a hypothesis for a transcription of the utterance. Speech recognition is a complicated task and state-of-the-art recognition systems are very complex. There are a big number of different approaches for the implementation of the components. For further information the reader is referred to [1,2,3]. Here we only want to provide an overview over ASR, some of its main difficulties, the basic components, their functionality and interaction.

Components of ASR

Figure 1 shows the main components of an ASR system.
Figure 1: Principle components of an ASR system
In the first step, the Feature Extraction, the sampled speech signal is parameterized. The goal is to extract a number of parameters (`features') from the signal that have a maximum of information relevant for the following classification. That means features are extracted that are robust to acoustic variation but sensitive to linguistic content. Put in other words, features that are discriminant and allow to distinguish between different linguistic units (e.g., phones) are required. On the other hand the features should also be robust against noise and factors that are irrelevant for the recognition process (e.g., the fundamental frequency of the speech signal). The number of features extracted from the waveform signal is commonly much lower than the number of signal samples, thus reducing the amount of data. The choice of suitable features varies depending on the classification technique.
Figure 2: Feature extraction from a speech signal. Every `hop-size' seconds a vector of features is computed from the speech samples in a window of length `window-size'.

Figure 2 indicates how features (or feature vectors) are derived from the speech signal. Typically, a frequency-domain based parametrization is performed to extract the features. Spectral analysis is performed, e.g., every 10 ms on the speech samples in a window of, e.g., 32 ms length. The speech signal is regarded stationary in this time-scale. Although this is not strictly true, it is a reasonable approximation. For each frame a vector of parameters, the feature vector, is determined and handed to the next stage, the classification.

In the classification module the feature vectors are matched with reference patterns, which are called acoustic models. The reference patterns are usually Hidden Markov Models (HMMs) trained for whole words or, more often, for phones as linguistic units. HMMs cope with temporal variation, which is important since the duration of individual phones may differ between the reference speech signal and the speech signal to be recognized. A linear normalization of the time axis is not sufficient here, since not all phones are expanded or compressed over time in the same way. For instance, stop consonants (``d'', ``t'', ``g'', ``k'', ``b'', and ``p'') do not change their length much, whereas the length of vowels strongly depends on the overall speaking rate.

The pronunciation dictionary defines which combination of phones give valid words for the recognition. It can contain information about different pronunciation variants of the same word. Table 1 shows an extract of a dictionary. The words (graphemes) in the left column are related to their pronunciation (phones) in the right column (phone symbols like in the table are commonly used for the English language).

Table 1: Extract from a dictionary
word pronunciation
INCREASE ih n k r iy s
INCREASED ih n k r iy s t
INCREASES ih n k r iy s ah z
INCREASING ih n k r iy s ih ng
INCREASINGLY ih n k r iy s ih ng l iy
INCREDIBLE ih n k r eh d ah b ah l

The language model contains rudimentary syntactic information. Its aim is to predict the likelihood of specific words occurring one after another in a certain language. In a more formal description, the probability of the $ k$-th word following the $ (k-1)$ previous words is defined as $ P(w_k\vert w_{k-1},w_{k-2},...,w_1)$. In practice the context (number of previous words considered in the model) is restricted to $ (n-1)$ words $ P(w_k\vert w_{k-1},w_{k-2},...,w_1)\approx P(w_k\vert w_{k-1},w_{k-2},...,w_{k-n+1})$, and the resulting language model is called $ n$-gram model.

Sub-word modeling with HMMs

In large vocabulary ASR systems, HMMs are used to represent sub units of words (such as phones). For English it is typical to have around 40 models (phones). The exact phone set depends on the dictionary that is used. Word models can be constructed as a combination of the sub word models.

In practice, the realization of one and the same phone differs a lot depending on its neighboring phones (the phone `context'). Therefore context dependent phone models are most widely used. Biphone models consider either the left (preceding) or right (succeeding) phone, in triphone models both neighboring phones are taken into account, and for each phone different models are used for a different context. In Figure 3, the English word ``bat'' [b ae t] is shown in a monophone, biphone and triphone representation. The underlying sub-models for the phones or their combinations (in the bi- and triphone case) are in most cases HMMs. Because of lack of enough occurrences of all the triphone combinations in the training data (a phonetic alphabet of 40 phones results in a number of $ 40^3=64 000$ possible triphones) clustering techniques, e.g., by using binary regression trees, are often used to get more reliable models for rarely occurring phone combinations.

Figure 3: Monophone, biphone, and triphone HMMs for the English word ``bat'' [b ae t]. `sil' stands for silence at the beginning and end of the utterance, which is modeled as a `phone', too.
The information coming from the language model and acoustic models as well as the information from the pronunciation dictionary has to be balanced during speech recognition.