Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links


Hidden Markov Model Toolkit (HTK)

The Hidden Markov Model Toolkit (HTK) [5] is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for research in speech recognition. HTK is in use at hundreds of sites worldwide.

Experiment - with HTK

We will demonstrate the HTK on a rather simple recognition experiment. The lexicon contains two entries: These are the German words `ja' and `nein'. We will first train the acoustic models (HMMs) with a set of training data, and then evaluate the HMMs on a distinct set of test utterances.

HTK tools are designed to run with a traditional command-line style interface. Each tool has a number of required arguments plus optional arguments. The speech data used for the following experiment is part of the SpeechDat-AT, a database for Austrian German [4].

Starting point

We need speech data to train the HMMs. These are saved in the Q/ directory. Furthermore, a dictionary is needed to define the valid words and their pronunciation (cf. table 1) for the recognition. The dictionary for our task is in file `dict'. The file ``phones0.mlf'' contains the pronunciation of all the training data. We will use 2 different sets of training data, one set of about 40, the other of about 1000 utterances. The sets are defined in the files ``train1.scp'' (smaller set) and ``train2.scp'' (bigger set).

Step 0 - Feature Extraction

In the first stage of the ASR system the raw speech data (signal waveforms) are parametrized into sequences of feature vectors (cf. fig. 2). Since this is a somewhat time consuming task (and since the speech signal files are very large), this feature extraction has been done in advance for the training data set. The feature vector sequences for the training and test data set can be found in the directory Q/ in files SNNNN.mfc.

Step 1 - Training the Acoustical Models (HMMs)

The parameters of the HMMs are trained using the EM-algorithm. The emission pdfs are mixtures of Gaussians (cf. tutorial `Mixtures of Gaussian'). The EM-algorithm is an iterative algorithm, each iteration is invoked in HTK with a call of the function HERest. To find initial parameters the HTK tool HCompV can be used. It scans the set of training feature files, computes the global mean and variance and sets the parameters of all pdfs in a given HMM to this mean and variance values.

The invocation of these 2 HTK tools is implemented in the perl program Its first argument specifies the base names of the new directories that will be created to save the parameters of the trained HMMs. The second argument is the training file (use either train1.scp or train2.scp) and the third argument specifies the number of iterations which are performed to train the HMMs. For example, if you use the command perl ABC train1.scp 2, you will get 3 new directories called ABC_hmm0 (with the initial HMM parameters), ABC_hmm1 (the HMM parameters after the first iteration) and ABC_hmm2 (the HMM parameters after the second and final iteration). In the command line window the commands used by the perl program to invoke the HTK tools are echoed.

Step 2 - Recognizing Test Data and Evaluation of the Recognition Result

The HTK Tool HVite is a general-purpose Viterbi word recognizer. It matches speech signals against a network of HMMs and returns a transcription for each speech signal. HResults is the HTK performance analysis tool. It reads in a set of label files (typically output from the recognition tool such as HVite) and compares them with the corresponding reference transcription.

The perl script first calls HVite to perform speech recognition and obtain a transcription of the test speech signals1 (this may take a while!), and then HResult to compute recognition statistics, such as the percentage of correctly recognized words. Its first and only argument is the name of the directory where the HMM specifications are saved. For example, to use the trained models from the last subsection for recognition, type perl ABC_hmm2 Again, the commands for calling the HTK functions are echoed in the command line window. Eventually, the recognition statistics are displayed.

In the recognition statistics, the first line gives the sentence-level accuracy based on the total number of transcriptions generated by the recognizer which are identical to the according reference transcriptions. The second line contains numbers concerning the word accuracy of the transcriptions generated by the recognizer. Here, $ H$ is the number of correct words, $ D$ is the number of deletions (words that are present in the reference transcription, but are `deleted' by the recognizer and do not occur in the recognizer's transcription), $ S$ is the number of substitutions (words in the reference transcription that are `substituted' by other words in the recognizer's transscription), $ I$ is the number of insertions (words that are present in the recognizer's transcription but not in the reference), and $ N$ is the total number of words in the reference transcription.

The percentage of correctly recognized words is given by

Correct$\displaystyle = \frac{H}{N} \times 100\%,$    

and the word recognition accuracy is computed by
Accuracy$\displaystyle = \frac{H-I}{N} \times 100\%.$    


  • Train HMMs using Try it with the small (train1.scp) and the large (train2.scp) data set. Try out different number of iterations.
  • Test the trained models using
  • Write a report about all experiments you performed, state the chosen settings, the results, and your interpretations.
  • Do you get an improvement using the larger training set? How many iterations of the EM algorithm are necessary, in the case of the small/large training set?
  • If you wanted to build a yes/no recognizer, would you use monophone-models, as we did, or do you think whole-word models are more suitable? Give reasons!