Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
General
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Homework
Exams
Animated Algorithms
Interactive Tests
Key Definitions
Downloads
Literature and Links
News
mailto:webmaster


Subsections

Statistical pattern recognition



A-priori class probabilities

Experiment:

Load data from file ``vowels.mat''. This file contains a database of 2-dimensional samples of speech features in the form of formant frequencies (the first and the second spectral formants, $ [F_1,F_2]$). The formant frequency samples represent features that would be extracted from the speech signal for several occurrences of the vowels /a/, /e/, /i/, /o/, and /y/1. They are grouped in matrices of size $ N\times2$, where each of the $ N$ lines contains the two formant frequencies for one occurrence of a vowel.

Supposing that the whole database covers adequately an imaginary language made only of /a/'s, /e/'s, /i/'s, /o/'s, and /y/'s, compute the probability $ P(q_k)$ of each class $ q_k$, $ k \in \{$/a/$ ,$/e/$ ,$/i/$ ,$/o/$ ,$/y/$ \}$. Which is the most common and which the least common phoneme in our imaginary language?

Example:

» clear all; load vowels.mat; whos

» Na = size(a,1); Ne = size(e,1); Ni = size(i,1); No = size(o,1); Ny = size(y,1);

» N = Na + Ne + Ni + No + Ny;

» Pa = Na/N

» Pi = Ni/N

etc.



Gaussian modeling of classes

Experiment:

Plot each vowel's data as clouds of points in the 2D plane. Train the Gaussian models corresponding to each class (use directly the mean and cov commands). Plot their contours (use directly the function plotgaus(mu,sigma,color) where color = [R,G,B]).

Example:

» plotvow; % Plot the clouds of simulated vowel features

(Do not close the figure obtained, it will be used later on.)

Then compute and plot the Gaussian models:

» mu_a = mean(a);

» sigma_a = cov(a);

» plotgaus(mu_a,sigma_a,[0 1 1]);

» mu_e = mean(e);

» sigma_e = cov(e);

» plotgaus(mu_e,sigma_e,[0 1 1]);

etc.



Bayesian classification

We will now find how to classify a feature vector $ \ensuremath\mathbf{x}_i$ from a data sample (or several feature vectors $ X$) as belonging to a certain class $ q_k$.

Useful formulas and definitions:

  • Bayes' decision rule:
    $\displaystyle X \in q_k$   if$\displaystyle \quad P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) \geq P(q_j\vert X,\ensuremath\boldsymbol{\Theta}), \quad\forall j \neq k $
    This formula means: given a set of classes $ q_k$, characterized by a set of known parameters in model $ \ensuremath\boldsymbol{\Theta}$, a set of one or more speech feature vectors $ X$ (also called observations) belongs to the class which has the highest probability once we actually know (or ``see'', or ``measure'') the sample $ X$. $ P(q_k\vert X,\ensuremath\boldsymbol{\Theta})$ is therefore called the a posteriori probability, because it depends on having seen the observations, as opposed to the a priori probability $ P(q_k\vert\ensuremath\boldsymbol{\Theta})$ which does not depend on any observation (but depends of course on knowing how to characterize all the classes $ q_k$, which means knowing the parameter set $ \ensuremath\boldsymbol{\Theta}$).
  • For some classification tasks (e.g. speech recognition), it is practical to resort to Bayes' law, which makes use of likelihoods (see sec. 1.3), rather than trying to directly estimate the posterior probability $ P(q_k\vert X,\ensuremath\boldsymbol{\Theta})$. Bayes' law says:
    $\displaystyle P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) = \frac{p(X\vert q_k... ...k\vert\ensuremath\boldsymbol{\Theta})}{p(X\vert\ensuremath\boldsymbol{\Theta})}$ (4)


    where $ q_k$ is a class, $ X$ is a sample containing one or more feature vectors and $ \ensuremath\boldsymbol{\Theta}$ is the parameter set of all the class models.
  • The speech features are usually considered equi-probable: $ p(X\vert\ensuremath\boldsymbol{\Theta})=$const. (uniform prior distribution for $ X$). Hence, $ P(q_k\vert X,\ensuremath\boldsymbol{\Theta})$ is proportional to $ p(X\vert q_k,\ensuremath\boldsymbol{\Theta}) P(q_k\vert\ensuremath\boldsymbol{\Theta})$ for all classes:
    $\displaystyle P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) \propto p(X\vert q_k... ...dsymbol{\Theta})\; P(q_k\vert\ensuremath\boldsymbol{\Theta}), \quad \forall k $
  • Once again, it is more convenient to do the computation in the $ \log$ domain:
    $\displaystyle \log P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) \propto \log p(... ...ensuremath\boldsymbol{\Theta}) + \log P(q_k\vert\ensuremath\boldsymbol{\Theta})$ (5)


In our case, $ \ensuremath\boldsymbol{\Theta}$ represents the set of all the means $ \ensuremath\boldsymbol{\mu}_k$ and variances $ \ensuremath\boldsymbol{\Sigma}_k$, $ k \in \{$/a/$ ,$/e/$ ,$/i/$ ,$/o/$ ,/u/\}$ of our data generation model. $ p(X\vert q_k,\ensuremath\boldsymbol{\Theta})$ and $ \log p(X\vert q_k,\ensuremath\boldsymbol{\Theta})$ are the joint likelihood and joint log-likelihood (eq. 2 in section 1.3) of the sample $ X$ with respect to the model $ \ensuremath\boldsymbol{\Theta}$ for class $ q_k$ (i.e., the model with parameter set $ (\ensuremath\boldsymbol{\mu}_k,\ensuremath\boldsymbol{\Sigma}_k)$).

The probability $ P(q_k\vert\ensuremath\boldsymbol{\Theta})$ is the a-priori class probability for the class $ q_k$. It defines an absolute probability of occurrence for the class $ q_k$. The a-priori class probabilities for our phoneme classes have been computed in section 2.1.

Experiment:

Now, we have modeled each vowel class with a Gaussian pdf (by computing means and variances), we know the probability $ P(q_k)$ of each class in the imaginary language (sec. 2.1), which we assume to be the correct a priori probabilities $ P(q_k\vert\ensuremath\boldsymbol{\Theta})$ for each class given our model $ \ensuremath\boldsymbol{\Theta}$. Further we assume that the speech features $ \ensuremath\mathbf{x}_i$ (as opposed to speech classes $ q_k$) are equi-probable2.

What is the most probable class $ q_k$ for each of the formant pairs (features) $ \ensuremath\mathbf{x}_i=[F_1,F_2]^{\mathsf T}$ given in the table below? Compute the values of the functions $ f_k(\ensuremath\mathbf{x}_i)$ for our model $ \ensuremath\boldsymbol{\Theta}$ as the right-hand side of eq. 5: $ f_k(\ensuremath\mathbf{x}_i) = \log p(\ensuremath\mathbf{x}_i\vert q_k,\ensuremath\boldsymbol{\Theta}) + \log P(q_k\vert\ensuremath\boldsymbol{\Theta})$, proportional to the log of the posterior probability of $ \ensuremath\mathbf{x}_i$ belonging to class $ q_k$.



i $ \ensuremath\mathbf{x}_i=[F_1,F_2]^{\mathsf T}$ $ f_{\text{/a/}}(\ensuremath\mathbf{x}_i)$ $ f_{\text{/e/}}(\ensuremath\mathbf{x}_i)$ $ f_{\text{/i/}}(\ensuremath\mathbf{x}_i)$ $ f_{\text{/o/}}(\ensuremath\mathbf{x}_i)$ $ f_{\text{/y/}}(\ensuremath\mathbf{x}_i)$ Most prob. class $ q_k$
1 $ [400,1800]^{\mathsf T}$            
2 $ [400,1000]^{\mathsf T}$            
3 $ [530,1000]^{\mathsf T}$            
4 $ [600,1300]^{\mathsf T}$            
5 $ [670,1300]^{\mathsf T}$            
6 $ [420,2500]^{\mathsf T}$            

Example:

Use function gloglike(point,mu,sigma) to compute the log-likelihoods $ \log p(\ensuremath\mathbf{x}_i\vert q_k,\ensuremath\boldsymbol{\Theta})$. Don't forget to add the log of the prior probability $ P(q_k\vert\ensuremath\boldsymbol{\Theta})$! E.g., for the feature set $ x_1$ and class /a/ use

» gloglike([400,1800],mu_a,sigma_a) + log(Pa)





Discriminant surfaces

For the Bayesian classification in the last section we made use of the discriminant functions $ f_k(\ensuremath\mathbf{x}_i) = \log p(\ensuremath\mathbf{x}_i\vert q_k,\ensuremath\boldsymbol{\Theta}) + \log P(q_k\vert\ensuremath\boldsymbol{\Theta})$ to classify data points $ \ensuremath\mathbf{x}_i$. This corresponds to establishing discriminant surfaces of dimension $ d-1$ in the vector space for $ \ensuremath\mathbf{x}$ (dimension $ d$) to separate regions for the different classes.

Useful formulas and definitions:

  • Discriminant function: a set of functions $ f_k(\ensuremath\mathbf{x})$ allows to classify a sample $ \ensuremath\mathbf{x}$ into $ k$ classes $ q_k$ if:
    $\displaystyle \ensuremath\mathbf{x}\in q_k \quad \Leftrightarrow \quad f_k(\ens... ...suremath\mathbf{x},\ensuremath\boldsymbol{\Theta}_l), \quad \forall l \neq k $
    In this case, the $ k$ functions $ f_k(\ensuremath\mathbf{x})$ are called discriminant functions.

The a-posteriori probability $ P(q_k\vert\ensuremath\mathbf{x}_i)$ that a sample $ \ensuremath\mathbf{x}_i$ belongs to class $ q_k$ is itself a discriminant function:

$\displaystyle \ensuremath\mathbf{x}\in q_k$ $\displaystyle \Leftrightarrow$ $\displaystyle P(q_k\vert\ensuremath\mathbf{x}_i) \geq P(q_l\vert\ensuremath\mathbf{x}_i),\quad \forall l \neq k$  
  $\displaystyle \Leftrightarrow$ $\displaystyle p(\ensuremath\mathbf{x}_i\vert q_k)\; P(q_k) \geq p(\ensuremath\mathbf{x}_i\vert q_l)\; P(q_l),\quad \forall l \neq k$  
  $\displaystyle \Leftrightarrow$ $\displaystyle \log p(\ensuremath\mathbf{x}_i\vert q_k)+\log P(q_k) \geq \log p(\ensuremath\mathbf{x}_i\vert q_l)+\log P(q_l),\quad \forall l \neq k$  


As in our case the samples $ \ensuremath\mathbf{x}$ are two-dimensional vectors, the discriminant surfaces are one-dimensional, i.e., lines at equal values of the discriminant functions for two distinct classes.

Experiment:

Figure 1: Iso-likelihood lines for the Gaussian pdfs $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/i/}},\ensuremath \boldsymbol {\Sigma }_{\text {/i/}})$ and $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/e/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$ (top), and $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/i/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$ and $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/e/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$ (bottom).
\includegraphics[height=0.95\textheight]{iso}
The iso-likelihood lines for the Gaussian pdfs $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/i/}},\ensuremath \boldsymbol {\Sigma }_{\text {/i/}})$ and $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/e/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$, which we used before to model the class /i/ and the class /e/, are plotted in figure 1, first graph. On the second graph in figure 1, the iso-likelihood lines for $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/i/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$ and $ {\cal N}(\ensuremath \boldsymbol {\mu }_{\text {/e/}},\ensuremath \boldsymbol {\Sigma }_{\text {/e/}})$ (two pdfs with the same covariance matrix $ \ensuremath\boldsymbol{\Sigma}_{\text{/e/}}$) are represented.

On these figures, use a colored pen to join the intersections of the level lines that correspond to equal likelihoods. Assume that the highest iso-likelihood lines (smallest ellipses) are of the same height. (You can also use isosurf in MATLAB to create a color plot.)

Question:

What is the nature of the surface that separates class /i/ from class /e/ when the two models have different variances? Can you explain the origin of this form?

What is the nature of the surface that separates class /i/ from class /e/ when the two models have the same variances? Why is it different from the previous discriminant surface?

Show that in the case of two Gaussian pdfs with equal covariance matrices, the separation between class 1 and class 2 does not depend upon the covariance $ \ensuremath\boldsymbol{\Sigma}$ any more.

As a summary, we have seen that Bayesian classifiers with Gaussian data models separate the classes with combinations of parabolic surfaces. If the covariance matrices of the models are equal, the parabolic separation surfaces become simple hyper-planes.