
[Points: 12.5; Issued: 2007/06/15; Deadline: 2007/07/03; Tutor:
Qiang Chen; Infohour: 2007/06/29, 16:1517:15, HS
i11; Einsichtnahme: 2007/07/09, 16:1517:15, HS i11; Download:
pdf; ps.gz]
HMMs with Gaussian mixture emission pdfs should be trained and
used for recognition of utterances of English digits from `one' to
`five'. Go through the tutorial ``Mixtures of Gaussians'' and
download the accompanying MATLAB programs and
data.
Add to the end of the report the MATLAB script
you programmed for producing all the results.
Please provide the Name and Matrikelnummer of each team member
on the report.
 Load the signals into MATLAB using load
digits, and play them with the MATLAB functions
sound or wavplay, e.g., sound(three15).
Process the speech signals to get parameters (features) suitable
for speech recognition purposes. For the signals loaded from
digits.mat this is done using the function
preproc() (without arguments). This function produces a
cell array data_N{} for each digit N holding the
parameter vectors (melfrequency cepstral coefficients,
12dimensional) for each training signal (e.g., data_1{2}
holds the sequence of parameter vectors for the second example of
digit `one'), as well as a cell array testdata_N{} for
each digit N for each test signal. preproc() uses
functions that are part of VOICEBOX, a MATLAB
toolbox for speech processing.
The sequences of parameter vectors have different length (as
also the speech signals differ in length!), that is why we can not
store all sequences for training or testing in one array.
 Train one HMM for each digit. Training of HMM parameters
(emission pdfs using the EM algorithm, as well as prior and
transition probabilities) is done using the function
train:
» [HMM] = train(data,K,Ns,nr_iter)
where Ns denotes the number of HMM states, K the
number of Gaussian mixtures in the emission pdfs, and
nr_iter the number of iterations. The function trains
lefttoright HMMs with covariance matrices of the Gaussian
mixtures in diagonal form.
 Determine the recognition rate on the test signals. To test the
model use the function recognize as described in the
tutorial, e.g., for digit 1:
[r1,L1,p1] =
recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)
 In your report note down your chosen settings, intermediate
results and considerations, and the recognition results
(recognition rate for each digit, and for the whole set). Which
digits seem more easy to recognize? Which digits get easily
confused during recognition? Use different values for the number of
states (Ns=2...5) and the number of Gaussian mixture
components (K=1...3). How do these numbers affect the
computational effort for training and the recognition rate? Which
values for Ns and K do you think are optimal?
Please present your recognition results (for each digit and for the
whole data set) as well as the time for training the models as
tables/figures (depending on the choice of Ns and
K). For measuring the training time use the matlab
commands tic and toc.
 Why does it seem reasonable to use lefttoright HMMs for this
task, and for speech in general? What are the
advantages/disadvantages? Please modify the function train
to train ergodic models and do experiments similar as in task 4 for
Ns=2...5 and K=1...3. Again, present the tables
for the recognition rate and the computational effort for
training.
 Why do we use diagonal covariance matrices for the Gaussian
mixtures? What assumption do we take, if we do so? Please modify
the function train to train lefttoright models with a
full covariance matrix. Again, do experiments similar as in task 4
for Ns=2...5 and K=1...3 and present the tables
for the recognition rate and the computational costs for training.
Give also a table for the number of parameters we have to train in
this case compared to the lefttoright HMM using diagonal
covariance matrices for the Gaussian mixtures (as in task 4). Is
there a connection between the number of parameters we have to
train, the number of training samples we have available, and the
recognition performance? Please present a short discussion on that
in your report.
 What kind of model do we have if we have just Ns=1
state in our HMM. Please give the equation
for this model
in terms of
.
