
Subsections
Mixtures of Gaussians can be used to model the emission pdfs in
Hidden Markov Models (HMMs)^{1}. In this way speech signal features
with complex probability density function may be modeled.
To train these models the Expectation Maximization (EM)
algorithm [2] is used. In this case,
not only the parameters of every Gaussian mixture of each state of
the HMM (emission/observation parameters) have to be estimated, but
also the rest of the parameters of the HMM, i.e. the transition
matrix and the prior probabilities, have to be reestimated in each
iteration step of the EM algorithm.
As it has been already found in the experiments in the tutorial
about HMMs, it is crucial to have `good' initial parameters to get
good parameters with the EMalgorithm. To find these initial
parameters is not trivial!
We want to train Hidden Markov Models for different sets of
data, (1 model for each set) and use these models afterwards to
classify data.
Load the data set symbols into MATLAB.
In Symbol_A.data, Symbol_B.data, and
Symbol_C.data are samples of data sequences belonging to 3
different symbols. Each sequence has a length of 12 samples. There
are 100 sequences of each symbol available. (The format in which
the data are stored is:
data(dimension(=1:2),sample(=1:12),example(=1:100))).
We will use the EMalgorithm to train one model for each symbol.
We can do that with the HMMEMExplorer, e.g., using
» BW_hmm(Symbol_A.data(:,:,1:60),K,Ns)
for Symbol_A, where Ns is the number of assumed states of
the HMM, and K is the number of Gaussian mixtures, and we
use the first 60 sequences as the training set.
It is assumed that the data may be modeled with a lefttoright
HMM: The prior probability equals 1 in the first state, and zero
for the rest of the states. Transition from a certain state in the
model are only possible to the same state (selftransitions), and
to the next state on the `righthand side'.
We will test the HMMs on a distinct set of test data (one
(test)set for each symbol), so be aware not to use all available
data for training. (Remark: The more data you use for training, the
longer the training by the HMMEM explorer takes.)
Try different settings for the number of Gaussians (M)
and for the number of states (Ns). When you close the
window of the HMMEM explorer (with the button: close) the actual
trained parameters (specification of your trained HMM) are saved in
your workspace in the array mg_hmm (copy this array to
another variable, to use it later for testing the models by
recognizing the test data).
Recognition of the test data is done using the function
recognize. Its inputs are the (test) data and a list of
HMM parameter arrays, e.g.,
» [r L p] =
recognize(Symbol_A.data(:,:,61:100),HMM1,HMM2,HMM3)
recognizes the `test data' (the last 40 examples in
Symbol_A.data) for Symbol_A by matching it to the models
HMM1, HMM2, and HMM3.). The output variable r is a vector
of recognized symbols (1 stands for the first model you specified
invoking the function (in this case HMM1), 2 for the second model
(HMM2), and so on).
Output variable L is the likelihood matrix with the
entries equal to the likelihoods matching the test data to each of
the models (HMM1 HMM2 HMM3) and p gives the `best path'
according to a Viterbi decoding (cf. tutorial Hidden Markov Models)
for each sequence of the test data and each model.
 Dividing the data:
Divide the data sets into distinct parts for training and
testing.
 Training:
Train one HMM for each training data set of Symbol_A,
Symbol_B, and Symbol_C. Use different numbers of
states (Ns) and number of Gaussian mixtures (K).
What can you observe during training? Which values for Ns
and K would you choose, according to your observations
during training.
 Evaluation:
 Evaluate the trained HMMs. Determine the recognition rate,
which is defined as:
Recognition
rate 
(4)

(You have to use a distinct data for training and testing the
model, i.e., the test set must not
contain any data used for training. Note down the size of your
training and test data set.)
 Use the function comp_hist_pdf to depict histograms
and the parameters determined for the models.
 Try out different ratios between the size of the training and
the size of the test. Additionally, vary the parameters Ns
and K, and compare the evaluation performance achieved.
Note down the results. Which values for Ns and K
seem most suitable?
We will now train HMMs for utterances of English digits from
`one' to `five'. We will then use the HMMs to recognize utterances
of digits.
Load the signals into MATLAB using load
digits, and play them with the MATLAB functions
sound or wavplay, e.g.,
sound(three15).
Process the speech signals to get parameters (features) suitable
for speech recognition purposes. For the signals loaded from
digits.mat this is done using the function
preproc() (without arguments). This function produces a
cell array data_N{} for each digit N holding the
parameter vectors (melfrequency cepstral coefficients,
12dimensional) for each training signal (e.g., data_1{2}
holds the sequence of parameter vectors for the second example of
digit `one'), as well as a cell array testdata_N{} for
each digit N for each test signal. preproc() uses
functions that are part of VOICEBOX, a MATLAB
toolbox for speech processing.
The sequences of parameter vectors have different length (as
also the speech signals differ in length!), that is why we can not
store all sequences for training or testing in one array.
To train a hidden Markov model we use the function
train, as follows:
» [HMM] = train(data_1,K,Ns,nr_iter)
for digit 1, where Ns denotes the number of states,
K the number of Gaussian mixtures and nr_iter the
number of iterations. The function trains lefttoright HMMs with
covariance matrix in diagonal form.
To test the model you can use the function recognize as
described in the previous task, e.g., for digit 1:
» [r1 L1 p1] =
recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)
 Why does it seem reasonable to use lefttoright HMMs in this
task, and for speech modeling in general? What are the
advantages/disadvantages? (You can modify the function
train to train ergodic HMMs.)
 Why do we use diagonal covariance matrices for the Gaussian
mixtures? What assumption do we take, if we do so? (You can also
modify the function train to train models with a full
covariance matrix.)
 Write a report about your chosen settings, interesting
intermediate results and considerations, and the recognition
results (recognition rate for each number and the whole set). Which
digits seem more easy to recognize? Which digits get easily
confused in the recognition?
 (optional) Record some test examples of digits yourself, and
try to recognize them! How well does it work?
