Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
General
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Homework
Exams
Animated Algorithms
Interactive Tests
Key Definitions
Downloads
Literature and Links
News
mailto:webmaster


Subsections

HMMs with Gaussian mixture emission pdfs

Mixtures of Gaussians can be used to model the emission pdfs in Hidden Markov Models (HMMs)1. In this way speech signal features with complex probability density function may be modeled.

To train these models the Expectation Maximization (EM) algorithm [2] is used. In this case, not only the parameters of every Gaussian mixture of each state of the HMM (emission/observation parameters) have to be estimated, but also the rest of the parameters of the HMM, i.e. the transition matrix and the prior probabilities, have to be re-estimated in each iteration step of the EM algorithm.

As it has been already found in the experiments in the tutorial about HMMs, it is crucial to have `good' initial parameters to get good parameters with the EM-algorithm. To find these initial parameters is not trivial!

Task

We want to train Hidden Markov Models for different sets of data, (1 model for each set) and use these models afterwards to classify data.

Load the data set symbols into MATLAB. In Symbol_A.data, Symbol_B.data, and Symbol_C.data are samples of data sequences belonging to 3 different symbols. Each sequence has a length of 12 samples. There are 100 sequences of each symbol available. (The format in which the data are stored is: data(dimension(=1:2),sample(=1:12),example(=1:100))).

We will use the EM-algorithm to train one model for each symbol. We can do that with the HMM-EM-Explorer, e.g., using

» BW_hmm(Symbol_A.data(:,:,1:60),K,Ns)

for Symbol_A, where Ns is the number of assumed states of the HMM, and K is the number of Gaussian mixtures, and we use the first 60 sequences as the training set.

It is assumed that the data may be modeled with a left-to-right HMM: The prior probability equals 1 in the first state, and zero for the rest of the states. Transition from a certain state in the model are only possible to the same state (self-transitions), and to the next state on the `right-hand side'.

We will test the HMMs on a distinct set of test data (one (test)set for each symbol), so be aware not to use all available data for training. (Remark: The more data you use for training, the longer the training by the HMM-EM explorer takes.)

Try different settings for the number of Gaussians (M) and for the number of states (Ns). When you close the window of the HMM-EM explorer (with the button: close) the actual trained parameters (specification of your trained HMM) are saved in your workspace in the array mg_hmm (copy this array to another variable, to use it later for testing the models by recognizing the test data).

Recognition of the test data is done using the function recognize. Its inputs are the (test) data and a list of HMM parameter arrays, e.g.,

» [r L p] = recognize(Symbol_A.data(:,:,61:100),HMM1,HMM2,HMM3)

recognizes the `test data' (the last 40 examples in Symbol_A.data) for Symbol_A by matching it to the models HMM1, HMM2, and HMM3.). The output variable r is a vector of recognized symbols (1 stands for the first model you specified invoking the function (in this case HMM1), 2 for the second model (HMM2), and so on).

Output variable L is the likelihood matrix with the entries equal to the likelihoods matching the test data to each of the models (HMM1 HMM2 HMM3) and p gives the `best path' according to a Viterbi decoding (cf. tutorial Hidden Markov Models) for each sequence of the test data and each model.

Do

  1. Dividing the data:

    Divide the data sets into distinct parts for training and testing.
  2. Training:

    Train one HMM for each training data set of Symbol_A, Symbol_B, and Symbol_C. Use different numbers of states (Ns) and number of Gaussian mixtures (K). What can you observe during training? Which values for Ns and K would you choose, according to your observations during training.
  3. Evaluation:
    • Evaluate the trained HMMs. Determine the recognition rate, which is defined as:
      Recognition rate$\displaystyle = \frac{\text{True~recognitions}}{\text{All recognized words}}$ (4)


      (You have to use a distinct data for training and testing the model, i.e., the test set must not contain any data used for training. Note down the size of your training and test data set.)
    • Use the function comp_hist_pdf to depict histograms and the parameters determined for the models.
  4. Try out different ratios between the size of the training and the size of the test. Additionally, vary the parameters Ns and K, and compare the evaluation performance achieved. Note down the results. Which values for Ns and K seem most suitable?

Digit recognition task

We will now train HMMs for utterances of English digits from `one' to `five'. We will then use the HMMs to recognize utterances of digits.

Load the signals into MATLAB using load digits, and play them with the MATLAB functions sound or wavplay, e.g., sound(three15).

Process the speech signals to get parameters (features) suitable for speech recognition purposes. For the signals loaded from digits.mat this is done using the function preproc() (without arguments). This function produces a cell array data_N{} for each digit N holding the parameter vectors (mel-frequency cepstral coefficients, 12-dimensional) for each training signal (e.g., data_1{2} holds the sequence of parameter vectors for the second example of digit `one'), as well as a cell array testdata_N{} for each digit N for each test signal. preproc() uses functions that are part of VOICEBOX, a MATLAB toolbox for speech processing.

The sequences of parameter vectors have different length (as also the speech signals differ in length!), that is why we can not store all sequences for training or testing in one array.

To train a hidden Markov model we use the function train, as follows:

» [HMM] = train(data_1,K,Ns,nr_iter)

for digit 1, where Ns denotes the number of states, K the number of Gaussian mixtures and nr_iter the number of iterations. The function trains left-to-right HMMs with covariance matrix in diagonal form.

To test the model you can use the function recognize as described in the previous task, e.g., for digit 1:

» [r1 L1 p1] = recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)

Questions

  • Why does it seem reasonable to use left-to-right HMMs in this task, and for speech modeling in general? What are the advantages/disadvantages? (You can modify the function train to train ergodic HMMs.)
  • Why do we use diagonal covariance matrices for the Gaussian mixtures? What assumption do we take, if we do so? (You can also modify the function train to train models with a full covariance matrix.)
  • Write a report about your chosen settings, interesting intermediate results and considerations, and the recognition results (recognition rate for each number and the whole set). Which digits seem more easy to recognize? Which digits get easily confused in the recognition?
  • (optional) Record some test examples of digits yourself, and try to recognize them! How well does it work?