Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links

Homework 23: Automatic Speech Recognition with HTK

[Points: 8; Issued: 2004/06/17; Deadline: 2004/06/28; Tutor: Bernhard Tittelbach, Thomas Zilaji; Infohour: 2004/06/21, 12:00-13:00, Seminarraum IGI; Einsichtnahme: 2004/07/05, 12:00-13:00, Seminarraum IGI; Download: pdf; ps.gz]

In this homework you shall build a simple automatic speech recognition (ASR) system for the German words ``ja'' and ``nein''. The system is realized using the hidden Markov model toolkit HTK. Recordings of the words ``ja'' and ``nein'' from many different speakers are available, for training these recodings have been pre-processed for you (to reduce the zip file size and computation time) to yield speech feature vectors in the form of mel frequency cepstral coefficients (MFCC, files *.mfc) which are commonly used in ASR systems. Only for testing a limited number of ``ja''/``nein'' examples are included in as waveform signals (files Q/*.wav).

Our ASR system will model the two words by a sequence of monophone models (one HMM with three states for each phone). The emission probabilities for the HMM states are Gaussian pdfs.

To carry out this homework you need some programs from HTK, and some script files (perl-scripts for linux, batch-files for windows). Download the appropriate zip files from the homework assignment page. Note, that the files have not been extensively tested yet! If you encounter problems running the scripts, contact us. If you want to use HTK beyond this homework, it is (freely) available at

Note that HTK commands and script files are called from the command line (DOS window).

As always: Write down the results and your observations for all experiments you performed, including the chosen settings, and your interpretations.

  • The commands for training of the HMMs are implemented in the script resp. train.bat. You have to specify three command line arguments, the first agument being the basename for the directories where the HMMs for each training iteration will be stored, the second argument is the name of a file where the feature files for training are listed, and the third argument is the number of iterations used for training of the HMMs.

    Train your system using a small data set (listed in train1.scp) and using a large data set (listed in train2.scp). Try out different numbers of iterations.

  • Test the trained models using test.*. This script calls a Viterbi decoder for the HMM networks, and a function to evaluate the decoding results and print recognition statistics.
  • If you have a soundcard and a microphone available you can test your speech recognition system in realtime with the script test_realtime.*. How is your estimate for the recognition rate? What kind of problems do you encounter when testing the system online (you probably sit close to a PC with a noisy fan, maybe in a room with other people talking, ...)?


  • Do you get an improvement using the larger training set as compared to the result using the small training set? How many training iterations seem reasonable, in the case of the small/large training set?
  • If you wanted to build a YES/NO recognizer, would you use monophone-models, as we did, or do you think whole-word models are more suitable? Give reasons!