Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links

Homework 54: Mixtures of Gaussians and HMMs

[Points: 12.5; Issued: 2007/06/15; Deadline: 2007/07/03; Tutor: Qiang Chen; Infohour: 2007/06/29, 16:15-17:15, HS i11; Einsichtnahme: 2007/07/09, 16:15-17:15, HS i11; Download: pdf; ps.gz]

HMMs with Gaussian mixture emission pdfs should be trained and used for recognition of utterances of English digits from `one' to `five'. Go through the tutorial ``Mixtures of Gaussians'' and download the accompanying MATLAB programs and data.

Add to the end of the report the MATLAB script you programmed for producing all the results.

Please provide the Name and Matrikelnummer of each team member on the report.

  1. Load the signals into MATLAB using load digits, and play them with the MATLAB functions sound or wavplay, e.g., sound(three15).

    Process the speech signals to get parameters (features) suitable for speech recognition purposes. For the signals loaded from digits.mat this is done using the function preproc() (without arguments). This function produces a cell array data_N{} for each digit N holding the parameter vectors (mel-frequency cepstral coefficients, 12-dimensional) for each training signal (e.g., data_1{2} holds the sequence of parameter vectors for the second example of digit `one'), as well as a cell array testdata_N{} for each digit N for each test signal. preproc() uses functions that are part of VOICEBOX, a MATLAB toolbox for speech processing.

    The sequences of parameter vectors have different length (as also the speech signals differ in length!), that is why we can not store all sequences for training or testing in one array.

  2. Train one HMM for each digit. Training of HMM parameters (emission pdfs using the EM algorithm, as well as prior and transition probabilities) is done using the function train:

    » [HMM] = train(data,K,Ns,nr_iter)

    where Ns denotes the number of HMM states, K the number of Gaussian mixtures in the emission pdfs, and nr_iter the number of iterations. The function trains left-to-right HMMs with covariance matrices of the Gaussian mixtures in diagonal form.
  3. Determine the recognition rate on the test signals. To test the model use the function recognize as described in the tutorial, e.g., for digit 1:

    [r1,L1,p1] = recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)
  4. In your report note down your chosen settings, intermediate results and considerations, and the recognition results (recognition rate for each digit, and for the whole set). Which digits seem more easy to recognize? Which digits get easily confused during recognition? Use different values for the number of states (Ns=2...5) and the number of Gaussian mixture components (K=1...3). How do these numbers affect the computational effort for training and the recognition rate? Which values for Ns and K do you think are optimal? Please present your recognition results (for each digit and for the whole data set) as well as the time for training the models as tables/figures (depending on the choice of Ns and K). For measuring the training time use the matlab commands tic and toc.
  5. Why does it seem reasonable to use left-to-right HMMs for this task, and for speech in general? What are the advantages/disadvantages? Please modify the function train to train ergodic models and do experiments similar as in task 4 for Ns=2...5 and K=1...3. Again, present the tables for the recognition rate and the computational effort for training.
  6. Why do we use diagonal covariance matrices for the Gaussian mixtures? What assumption do we take, if we do so? Please modify the function train to train left-to-right models with a full covariance matrix. Again, do experiments similar as in task 4 for Ns=2...5 and K=1...3 and present the tables for the recognition rate and the computational costs for training. Give also a table for the number of parameters we have to train in this case compared to the left-to-right HMM using diagonal covariance matrices for the Gaussian mixtures (as in task 4). Is there a connection between the number of parameters we have to train, the number of training samples we have available, and the recognition performance? Please present a short discussion on that in your report.
  7. What kind of model do we have if we have just Ns=1 state in our HMM. Please give the equation $ P\left(X\vert\mathbf{\Theta}\right)$ for this model in terms of $ \mathbf{\Theta}=\left\{\mathbf{\pi},\mathbf{A},\mathbf{B}\right\}$.