
[Points: 10; Issued: 2003/06/06; Deadline: 2003/06/20; Tutor:
Emir Serdarevic; Infohour: 2003/06/18, 14:0015:00,
Seminarraum IGI; Einsichtnahme: 2003/07/02, 14:0015:00,
Seminarraum IGI; Download: pdf; ps.gz]
 HMMs for utterances of English digits from one to five have to
be trained. Afterwards these models are used to recognize
utterances of digits. You can load the data in matlab load
digits, and play it with the MATLAB functions
sound or wavplay, e.g. sound(three15).
We need to process the data from these wave files to get
suitable data for speech recognition purposes. This can be done
with the function preproc which does not take any
arguments. This function produces for each digit a cell array, e.g.
data_1 for digit 1, including the sequences of each
utterance of a particular digit. preproc uses functions
that are part of the VOICEBOX, which is a MATLAB toolbox for speech
processing. As an output we get a 12 dimensional sequence for each
utterance.
These sequences have different length (as also the speech
utterances differ in length !), that is why we can not store
different sequences for training or testing in one array in
MATLAB. According to the formalism in the BNT
toolkit, the data is supposed to be stored in form of a cell.
(data_s{a} is the matrix of observation vectors for
sequence a of digit s.
 Train one HMM for each digit. To train the model you can use
the function train as follows, e.g.
[HMM]=train(data_1,M,Q,nr_iter) for digit 1.
where Q denotes the number of states, M the
number of Gaussian mixtures and nr_iter the number of
iterations. The function trains lefttoright HMM's with covariance
matrix in diagonal form.
 Determine the recognition rate: To test the model you can use
the function recognize as described in the tutorial,
e.g. [r1 L1
p1]=recognize(testdata_1,HMM_1,HMM_2,HMM_3,HMM_4,HMM_5) for
digit 1.
 Why does it seem reasonable to use lefttoright models in this
task, and for speech in general? What are the
advantages/disadvantages? (You can easily change the function
train to train ergodic models.)
 Why do we use a diagonal covariance matrix ? What assumption do
we have to take, if we do so ? (You can easily change the function
train to train models with a full covariance matrix.)
 Write a report about your chosen settings, interesting
intermediate results and considerations, and the recognition
results (recognition rate for each digit and the whole set). Which
digits seem more easy to recognize ? Which digits get easily
confused in the recognition ? Use different numbers of
states(Q=2,3,4,5) and Gaussian mixtures
(M=1,2,3,4,5). What can you observe during training and
recognition? Which values for Q and M would you
chose?

