Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
Course Notes (Skriptum)
Online Tutorials
Practical Course Slides
Animated Algorithms
Interactive Tests
Key Definitions
Literature and Links

Homework 11: Mixtures of Gaussian

[Points: 10; Issued: 2003/06/06; Deadline: 2003/06/20; Tutor: Emir Serdarevic; Infohour: 2003/06/18, 14:00-15:00, Seminarraum IGI; Einsichtnahme: 2003/07/02, 14:00-15:00, Seminarraum IGI; Download: pdf; ps.gz]

  • HMMs for utterances of English digits from one to five have to be trained. Afterwards these models are used to recognize utterances of digits. You can load the data in matlab load digits, and play it with the MATLAB functions sound or wavplay, e.g. sound(three15).

    We need to process the data from these wave files to get suitable data for speech recognition purposes. This can be done with the function preproc which does not take any arguments. This function produces for each digit a cell array, e.g. data_1 for digit 1, including the sequences of each utterance of a particular digit. preproc uses functions that are part of the VOICEBOX, which is a MATLAB toolbox for speech processing. As an output we get a 12 dimensional sequence for each utterance.

    These sequences have different length (as also the speech utterances differ in length !), that is why we can not store different sequences for training or testing in one array in MATLAB. According to the formalism in the BNT toolkit, the data is supposed to be stored in form of a cell. (data_s{a} is the matrix of observation vectors for sequence a of digit s.

  • Train one HMM for each digit. To train the model you can use the function train as follows, e.g. [HMM]=train(data_1,M,Q,nr_iter) for digit 1.

    where Q denotes the number of states, M the number of Gaussian mixtures and nr_iter the number of iterations. The function trains left-to-right HMM's with covariance matrix in diagonal form.

  • Determine the recognition rate: To test the model you can use the function recognize as described in the tutorial,

    e.g. [r1 L1 p1]=recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5) for digit 1.

  • Why does it seem reasonable to use left-to-right models in this task, and for speech in general? What are the advantages/disadvantages? (You can easily change the function train to train ergodic models.)
  • Why do we use a diagonal covariance matrix ? What assumption do we have to take, if we do so ? (You can easily change the function train to train models with a full covariance matrix.)
  • Write a report about your chosen settings, interesting intermediate results and considerations, and the recognition results (recognition rate for each digit and the whole set). Which digits seem more easy to recognize ? Which digits get easily confused in the recognition ? Use different numbers of states(Q=2,3,4,5) and Gaussian mixtures (M=1,2,3,4,5). What can you observe during training and recognition? Which values for Q and M would you chose?