The Hidden Markov Model Toolkit (HTK)  is a portable toolkit for building and
manipulating hidden Markov models. HTK is primarily used for
research in speech recognition. HTK is in use at hundreds of sites
We will demonstrate the HTK on a rather simple recognition
experiment. The lexicon contains two entries: These are the German
words `ja' and `nein'. We will first train the acoustic models
(HMMs) with a set of training data, and then evaluate the HMMs on a
distinct set of test utterances.
HTK tools are designed to run with a traditional command-line
style interface. Each tool has a number of required arguments plus
optional arguments. The speech data used for the following
experiment is part of the SpeechDat-AT, a database for Austrian
We need speech data to train the HMMs. These are saved in the Q/
directory. Furthermore, a dictionary is needed to define the valid
words and their pronunciation (cf. table 1) for the recognition. The dictionary for
our task is in file `dict'. The file ``phones0.mlf'' contains the
pronunciation of all the training data. We will use 2 different
sets of training data, one set of about 40, the other of about 1000
utterances. The sets are defined in the files ``train1.scp''
(smaller set) and ``train2.scp'' (bigger set).
In the first stage of the ASR system the raw speech data (signal
waveforms) are parametrized into sequences of feature vectors (cf.
fig. 2). Since this is a
somewhat time consuming task (and since the speech signal files are
very large), this feature extraction has been done in advance for
the training data set. The feature vector sequences for the
training and test data set can be found in the directory Q/ in
The parameters of the HMMs are trained using the EM-algorithm. The
emission pdfs are mixtures of Gaussians (cf. tutorial `Mixtures of
Gaussian'). The EM-algorithm is an iterative algorithm, each
iteration is invoked in HTK with a call of the function
HERest. To find initial parameters the HTK tool
HCompV can be used. It scans the set of training feature
files, computes the global mean and variance and sets the
parameters of all pdfs in a given HMM to this mean and variance
The invocation of these 2 HTK tools is implemented in the perl
program train.pl. Its first argument specifies the base
names of the new directories that will be created to save the
parameters of the trained HMMs. The second argument is the training
file (use either train1.scp or train2.scp) and the third argument
specifies the number of iterations which are performed to train the
HMMs. For example, if you use the command perl train.pl ABC
train1.scp 2, you will get 3 new directories called
ABC_hmm0 (with the initial HMM parameters),
ABC_hmm1 (the HMM parameters after the first iteration)
and ABC_hmm2 (the HMM parameters after the second and
final iteration). In the command line window the commands used by
the perl program to invoke the HTK tools are echoed.
The HTK Tool HVite is a general-purpose Viterbi word
recognizer. It matches speech signals against a network of HMMs and
returns a transcription for each speech signal. HResults
is the HTK performance analysis tool. It reads in a set of label
files (typically output from the recognition tool such as HVite)
and compares them with the corresponding reference transcription.
The perl script test.pl first calls HVite to
perform speech recognition and obtain a transcription of the test
speech signals1 (this may take a while!), and then
HResult to compute recognition statistics, such as the percentage
of correctly recognized words. Its first and only argument is the
name of the directory where the HMM specifications are saved. For
example, to use the trained models from the last subsection for
recognition, type perl test.pl ABC_hmm2 Again, the
commands for calling the HTK functions are echoed in the command
line window. Eventually, the recognition statistics are
In the recognition statistics, the first line gives the
sentence-level accuracy based on the
total number of transcriptions generated by the recognizer which
are identical to the according reference transcriptions. The second
line contains numbers concerning the word
accuracy of the transcriptions generated by the recognizer.
Here, is the number of
correct words, is
the number of deletions (words that are present in the reference
transcription, but are `deleted' by the recognizer and do not occur
in the recognizer's transcription), is the number of substitutions (words in the
reference transcription that are `substituted' by other words in
the recognizer's transscription), is the number of insertions (words that are
present in the recognizer's transcription but not in the
reference), and is
the total number of words in the reference transcription.
The percentage of correctly recognized
words is given by
and the word recognition accuracy is
- Train HMMs using train.pl. Try it with the small
(train1.scp) and the large (train2.scp) data set. Try out different
number of iterations.
- Test the trained models using test.pl.
- Write a report about all experiments you performed, state the
chosen settings, the results, and your interpretations.
- Do you get an improvement using the larger training set? How
many iterations of the EM algorithm are necessary, in the case of
the small/large training set?
- If you wanted to build a yes/no recognizer, would you use
monophone-models, as we did, or do you think whole-word models are
more suitable? Give reasons!