Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links

Homework 22: Mixtures of Gaussians

[Points: 8; Issued: 2004/06/10; Deadline: 2004/06/28; Tutor: Thomas Zilaji; Infohour: 2004/06/21, 12:00-13:00, Seminarraum IGI; Einsichtnahme: 2004/07/05, 12:00-13:00, Seminarraum IGI; Download: pdf; ps.gz]

HMMs with Gaussian mixture emission pdfs should be trained and used for recognition of utterances of English digits from `one' to `five'.

  • Load the signals into MATLAB using load digits, and play them with the MATLAB functions sound or wavplay, e.g., sound(three15).

    Process the speech signals to get parameters (features) suitable for speech recognition purposes. For the signals loaded from digits.mat this is done using the function preproc() (without arguments). This function produces a cell array data_N{} for each digit N holding the parameter vectors (mel-frequency cepstral coefficients, 12-dimensional) for each training signal (e.g., data_1{2} holds the sequence of parameter vectors for the second example of digit `one'), as well as a cell array testdata_N{} for each digit N for each test signal. preproc() uses functions that are part of VOICEBOX, a MATLAB toolbox for speech processing.

    The sequences of parameter vectors have different length (as also the speech signals differ in length!), that is why we can not store all sequences for training or testing in one array.

  • Train one HMM for each digit. Training of HMM parameters (emission pdfs using the EM algorithm, as well as prior and transition probabilities) is done using the function train:

    » [HMM] = train(data,K,Ns,nr_iter)

    where Ns denotes the number of HMM states, K the number of Gaussian mixtures in the emission pdfs, and nr_iter the number of iterations. The function trains left-to-right HMMs with covariance matrices of the Gaussian mixtures in diagonal form.
  • Determine the recognition rate on the test signals. To test the model use the function recognize as described in the tutorial, e.g., for digit 1:

    [r1,L1,p1] = recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)
  • In your report note down your chosen settings, intermediate results and considerations, and the recognition results (recognition rate for each digit, and for the whole set). Which digits seem more easy to recognize? Which digits get easily confused during recognition? Use different values for the number of states (Ns) and the number of Gaussian mixtures (K). How do these numbers affect training and recognition? Which values for Ns and K do you think are optimal?
  • (optional) Record some test examples of digits yourself, and try to recognize them! (Consult preproc.m to find how to produce the cell array with feature vectors from the speech signal.)


  • Why does it seem reasonable to use left-to-right HMMs for this task, and for speech in general? What are the advantages/disadvantages? (You can modify the function train to train ergodic models.)
  • Why do we use diagonal covariance matrices for the Gaussian mixtures? What assumption do we take, if we do so? (You can also modify the function train to train models with a full covariance matrix.)