
[Points: 8; Issued: 2004/06/10; Deadline: 2004/06/28; Tutor:
Thomas Zilaji; Infohour: 2004/06/21, 12:0013:00,
Seminarraum IGI; Einsichtnahme: 2004/07/05, 12:0013:00,
Seminarraum IGI; Download: pdf; ps.gz]
HMMs with Gaussian mixture emission pdfs should be trained and
used for recognition of utterances of English digits from `one' to
`five'.
 Load the signals into MATLAB using load
digits, and play them with the MATLAB functions
sound or wavplay, e.g., sound(three15).
Process the speech signals to get parameters (features) suitable
for speech recognition purposes. For the signals loaded from
digits.mat this is done using the function
preproc() (without arguments). This function produces a
cell array data_N{} for each digit N holding the
parameter vectors (melfrequency cepstral coefficients,
12dimensional) for each training signal (e.g., data_1{2}
holds the sequence of parameter vectors for the second example of
digit `one'), as well as a cell array testdata_N{} for
each digit N for each test signal. preproc() uses
functions that are part of VOICEBOX, a MATLAB
toolbox for speech processing.
The sequences of parameter vectors have different length (as
also the speech signals differ in length!), that is why we can not
store all sequences for training or testing in one array.
 Train one HMM for each digit. Training of HMM parameters
(emission pdfs using the EM algorithm, as well as prior and
transition probabilities) is done using the function
train:
» [HMM] = train(data,K,Ns,nr_iter)
where Ns denotes the number of HMM states, K the
number of Gaussian mixtures in the emission pdfs, and
nr_iter the number of iterations. The function trains
lefttoright HMMs with covariance matrices of the Gaussian
mixtures in diagonal form.
 Determine the recognition rate on the test signals. To test the
model use the function recognize as described in the
tutorial, e.g., for digit 1:
[r1,L1,p1] =
recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5)
 In your report note down your chosen settings, intermediate
results and considerations, and the recognition results
(recognition rate for each digit, and for the whole set). Which
digits seem more easy to recognize? Which digits get easily
confused during recognition? Use different values for the number of
states (Ns) and the number of Gaussian mixtures
(K). How do these numbers affect training and recognition?
Which values for Ns and K do you think are
optimal?
 (optional) Record some test examples of digits yourself, and
try to recognize them! (Consult preproc.m to find how to
produce the cell array with feature vectors from the speech
signal.)
Questions:
 Why does it seem reasonable to use lefttoright HMMs for this
task, and for speech in general? What are the
advantages/disadvantages? (You can modify the function
train to train ergodic models.)
 Why do we use diagonal covariance matrices for the Gaussian
mixtures? What assumption do we take, if we do so? (You can also
modify the function train to train models with a full
covariance matrix.)
