Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
General
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Homework
Exams
Animated Algorithms
Interactive Tests
Key Definitions
Downloads
Literature and Links
News
mailto:webmaster


Subsections



Unsupervised training

In the previous section, we have computed the models for classes /a/, /e/, /i/, /o/, and /y/ by knowing a-priori which training samples belongs to which class (we were disposing of a labeling of the training data). Hence, we have performed a supervised training of the Gaussian models. Now, suppose that we only have unlabeled training data that we want to separate in several classes (e.g., 5 classes) without knowing a-priori which point belongs to which class. This is called unsupervised training. Several algorithms are available for that purpose, among them: the $ K$-means, the EM (Expectation-Maximization), and the Viterbi-EM algorithm.

All these algorithms are characterized by the following components:

  • a set of models for the classes $ q_k$ (not necessarily Gaussian), defined by a parameter set $ \ensuremath\boldsymbol{\Theta}$ (means, variances, priors,...);
  • a measure of membership, telling to which extent a data point ``belongs'' to a model;
  • a ``recipe'' to update the model parameters as a function of the membership information.
The measure of membership usually takes the form of a distance measure or the form of a measure of probability. It replaces the missing labeling information to permit the application of standard parameter estimation techniques. It also defines implicitly a global criterion of ``goodness of fit'' of the models to the data, e.g.:
  • in the case of a distance measure, the models that are closer to the data characterize it better;
  • in the case of a probability measure, the models with a larger likelihood for the data explain it better.

$ K$-means algorithm

Synopsis of the algorithm:

  • Start with $ K$ initial prototypes $ \ensuremath\boldsymbol{\mu}_k$, $ k=1,\ldots,K$.
  • Do:
    1. For each data-point $ \ensuremath\mathbf{x}_n$, $ n=1,\ldots,N$, compute the squared Euclidean distance from the $ k^{\text{th}}$ prototype:

      $\displaystyle d_k(\ensuremath\mathbf{x}_n)$ $\displaystyle =$ $\displaystyle \, \left\Vert \ensuremath\mathbf{x}_n - \ensuremath\boldsymbol{\mu}_k \right\Vert^2$  
        $\displaystyle =$ $\displaystyle (\ensuremath\mathbf{x}_n-\ensuremath\boldsymbol{\mu}_k)^{\mathsf T}(\ensuremath\mathbf{x}_n-\ensuremath\boldsymbol{\mu}_k)$  


    2. Assign each data-point $ \ensuremath\mathbf{x}_n$ to its closest prototype $ \ensuremath\boldsymbol{\mu}_k$, i.e., assign $ \ensuremath\mathbf{x}_n$ to the class $ q_k$ if
      $\displaystyle d_k(\ensuremath\mathbf{x}_n) \, < \, d_l(\ensuremath\mathbf{x}_n), \qquad \forall l \neq k $
      Note: using the squared Euclidean distance for the classification gives the same result as using the true Euclidean distance, since the square root is a monotonically growing function. But the computational load is obviously smaller when the square root is dropped.
    3. Replace each prototype with the mean of the data-points assigned to the corresponding class;
    4. Go to 1.
  • Until: no further change occurs.

The global criterion for the present case is

$\displaystyle J = \sum_{k=1}^{K} \sum_{\ensuremath\mathbf{x}_n \in q_k} d_k(\ensuremath\mathbf{x}_n) $
and represents the total squared distance between the data and the models they belong to. This criterion is locally minimized by the algorithm.

Experiment:

Use the $ K$-means explorer utility:
 KMEANS K-means algorithm exploration tool

   Launch it with KMEANS(DATA,NCLUST) where DATA is the matrix
   of observations (one observation per row) and NCLUST is the
   desired number of clusters.

   The clusters are initialized with a heuristic that spreads
   them randomly around mean(DATA) with standard deviation
   sqrtm(cov(DATA)).

   If you want to set your own initial clusters, use
   KMEANS(DATA,MEANS) where MEANS is a cell array containing
   NCLUST initial mean vectors.

   Example: for two clusters
     means{1} = [1 2]; means{2} = [3 4];
     kmeans(data,means);

Launch kmeans with the data sample allvow, which was part of file vowels.mat and gathers all the simulated vowels data. Do several runs with different cases of initialization of the algorithm:

  1. 5 initial clusters determined according to the default heuristic;
  2. initial MEANS values equal to some data points;
  3. initial MEANS values equal to $ \{\ensuremath\boldsymbol{\mu}_{\text{/a/}}, \ensuremath\boldsymbol{\mu}_{\tex... ...emath\boldsymbol{\mu}_{\text{/o/}}, \ensuremath\boldsymbol{\mu}_{\text{/y/}}\}$.
Iterate the algorithm until its convergence. Observe the evolution of the cluster centers, of the data-points attribution chart and of the total squared Euclidean distance. (It is possible to zoom these plots: left click inside the axes to zoom $ 2\times$ centered on the point under the mouse; right click to zoom out; click and drag to zoom into an area; double click to reset the figure to the original). Observe the mean values found after the convergence of the algorithm.

Example:

» kmeans(allvow,5);

- or -

» means = { mu_a, mu_e, mu_i, mu_o, mu_y };

» kmeans(allvow,means);

Enlarge the window, then push the buttons, zoom etc. After the convergence, use:

» for k=1:5, disp(kmeans_result_means{k}); end

to see the resulting means.

Think about the following question:

  1. Does the final solution depend on the initialization of the algorithm?
  2. Describe the evolution of the total squared Euclidean distance.
  3. What is the nature of the discriminant surfaces corresponding to a minimum Euclidean distance classification scheme?
  4. Is the algorithm suitable for fitting Gaussian clusters?

Viterbi-EM algorithm for Gaussian clustering

Synopsis of the algorithm:

  • Start from $ K$ initial Gaussian models $ {\cal N}(\ensuremath\boldsymbol{\mu}_{k},\ensuremath\boldsymbol{\Sigma}_{k}), \; k=1\ldots K$, characterized by the set of parameters $ \ensuremath\boldsymbol{\Theta}$ (i.e., the set of all means and variances $ \ensuremath\boldsymbol{\mu}_k$ and $ \ensuremath\boldsymbol{\Sigma}_k$, $ k=1\ldots K$). Set the initial prior probabilities $ P(q_k)$ to $ 1/K$.
  • Do:
    1. Classify each data-point using Bayes' rule.

      This step is equivalent to having a set $ Q$ of boolean hidden variables that give a labeling of the data by taking the value 1 (belongs) or 0 (does not belong) for each class $ q_k$ and each point $ \ensuremath\mathbf{x}_n$. The value of $ Q$ that maximizes $ p(X,Q\vert\ensuremath\boldsymbol{\Theta})$ precisely tells which is the most probable model for each point of the whole set $ X$ of training data.

      Hence, each data point $ \ensuremath\mathbf{x}_n$ is assigned to its most probable cluster $ q_k$.

    2. Update the parameters ($ i$ is the iteration index):
      • update the means:
        $\displaystyle \ensuremath\boldsymbol{\mu}_{k}^{(i+1)} =$   mean of the points belonging to $\displaystyle q_k^{(i)} $
      • update the variances:
        $\displaystyle \ensuremath\boldsymbol{\Sigma}_{k}^{(i+1)} =$   variance of the points belonging to $\displaystyle q_k^{(i)} $
      • update the priors:
        $\displaystyle P(q_k^{(i+1)}\vert\ensuremath\boldsymbol{\Theta}^{(i+1)}) = \frac... ...ng points belonging to } q_k^{(i)} }{\mbox{total number of training points}} $
    3. Go to 1.
  • Until: no further change occurs.
The global criterion in the present case is

$\displaystyle {\cal L}(\ensuremath\boldsymbol{\Theta})$ $\displaystyle =$ $\displaystyle \sum_{X} P(X\vert\ensuremath\boldsymbol{\Theta}) \;=\; \sum_{Q} \sum_{X} p(X,Q\vert\ensuremath\boldsymbol{\Theta})$  
  $\displaystyle =$ $\displaystyle \sum_{k=1}^{K} \sum_{\ensuremath\mathbf{x}_n \in q_k} \log p(\ensuremath\mathbf{x}_n\vert\ensuremath\boldsymbol{\Theta}_k),$  


and represents the joint likelihood of the data with respect to the models they belong to. This criterion is locally maximized by the algorithm.

Experiment:

Use the Viterbi-EM explorer utility:
 VITERB Viterbi version of the EM algorithm

   Launch it with VITERB(DATA,NCLUST) where DATA is the matrix
   of observations (one observation per row) and NCLUST is the
   desired number of clusters.

   The clusters are initialized with a heuristic that spreads
   them randomly around mean(DATA) with standard deviation
   sqrtm(cov(DATA)). Their initial covariance is set to cov(DATA).

   If you want to set your own initial clusters, use
   VITERB(DATA,MEANS,VARS) where MEANS and VARS are cell arrays
   containing respectively NCLUST initial mean vectors and NCLUST
   initial covariance matrices. In this case, the initial a-priori
   probabilities are set equal to 1/NCLUST.

   To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)
   where PRIORS is a vector containing NCLUST a priori probabilities.

   Example: for two clusters
     means{1} = [1 2]; means{2} = [3 4];
     vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];
     viterb(data,means,vars);
Launch viterb with the dataset allvow. Do several runs with different initializations of the algorithm:
  1. 5 initial clusters determined according to the default heuristic;
  2. initial MEANS values equal to some data points, and some random VARS values (try for instance cov(allvow) for all the classes);
  3. the initial MEANS, VARS and PRIORS values found by the $ K$-means algorithm.
  4. initial MEANS values equal to $ \{\ensuremath\boldsymbol{\mu}_{\text{/a/}}, \ensuremath\boldsymbol{\mu}_{\tex... ...emath\boldsymbol{\mu}_{\text{/o/}}, \ensuremath\boldsymbol{\mu}_{\text{/y/}}\}$, VARS values equal to

    $ \{\ensuremath\boldsymbol{\Sigma}_{\text{/a/}}, \ensuremath\boldsymbol{\Sigma}_... ...\boldsymbol{\Sigma}_{\text{/o/}}, \ensuremath\boldsymbol{\Sigma}_{\text{/y/}}\}$, and PRIORS values equal to $ [P_{\text{/a/}},P_{\text{/e/}},P_{\text{/i/}},P_{\text{/o/}},P_{\text{/y/}}]$;
  5. initial MEANS and VARS values chosen by yourself.
Iterate the algorithm until it converges. Observe the evolution of the clusters, of the data points attribution chart and of the total likelihood curve. Observe the mean, variance and priors values found after the convergence of the algorithm. Compare them with the values computed in section 2.2 (using supervised training).

Example:

» viterb(allvow,5);

- or -

» means = { mu_a, mu_e, mu_i, mu_o, mu_y };

» vars = { sigma_a, sigma_e, sigma_i, sigma_o, sigma_y };

» viterb(allvow,means,vars);

Enlarge the window, then push the buttons, zoom, etc. After convergence, use:

» for k=1:5, disp(viterb_result_means{k}); end

» for k=1:5, disp(viterb_result_vars{k}); end

» for k=1:5, disp(viterb_result_priors(k)); end

to see the resulting means, variances and priors.

Question:

  1. Does the final solution depend on the initialization of the algorithm?
  2. Describe the evolution of the total likelihood. Is it monotonic?
  3. In terms of optimization of the likelihood, what does the final solution correspond to?
  4. What is the nature of the discriminant surfaces corresponding to the Gaussian classification?
  5. Is the algorithm suitable for fitting Gaussian clusters?

EM algorithm for Gaussian clustering

Synopsis of the algorithm:

  • Start from K initial Gaussian models $ {\cal N}(\ensuremath\boldsymbol{\mu}_{k},\ensuremath\boldsymbol{\Sigma}_{k}), \; k=1\ldots K$, with equal priors set to $ P(q_k) = 1/K$.
  • Do:
    1. Estimation step: compute the probability $ P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)})$ for each data point $ \ensuremath\mathbf{x}_n$ to belong to the class $ q_k^{(i)}$:

      $\displaystyle P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)})$ $\displaystyle =$ $\displaystyle \frac{P(q_k^{(i)}\vert\ensuremath\boldsymbol{\Theta}^{(i)}) \cdot... ...}^{(i)})} {p(\ensuremath\mathbf{x}_n\vert\ensuremath\boldsymbol{\Theta}^{(i)})}$  
        $\displaystyle =$ $\displaystyle \frac{P(q_k^{(i)}\vert\ensuremath\boldsymbol{\Theta}^{(i)}) \cdot... ...rt\ensuremath\boldsymbol{\mu}_j^{(i)},\ensuremath\boldsymbol{\Sigma}_j^{(i)}) }$  


      This step is equivalent to having a set $ Q$ of continuous hidden variables, taking values in the interval $ [0,1]$, that give a labeling of the data by telling to which extent a point $ \ensuremath\mathbf{x}_n$ belongs to the class $ q_k$. This represents a soft classification, since a point can belong, e.g., by 60% to class 1 and by 40% to class 2 (think of Schrödinger's cat which is 60% alive and 40% dead as long as nobody opens the box or performs Bayesian classification).

    2. Maximization step:
      • update the means:
        $\displaystyle \ensuremath\boldsymbol{\mu}_{k}^{(i+1)} = \frac{\sum_{n=1}^{N} \e... ...P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)})} $
      • update the variances:
        $\displaystyle \ensuremath\boldsymbol{\Sigma}_{k}^{(i+1)} = \frac{\sum_{n=1}^{N}... ...P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)})} $
      • update the priors:
        $\displaystyle P(q_k^{(i+1)}\vert\ensuremath\boldsymbol{\Theta}^{(i+1)}) = \frac... ... P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)}) $
      In the present case, all the data points participate to the update of all the models, but their participation is weighted by the value of $ P(q_k^{(i)}\vert\ensuremath\mathbf{x}_n,\ensuremath\boldsymbol{\Theta}^{(i)})$.
    3. Go to 1.
  • Until: the total likelihood increase for the training data falls under some desired threshold.
The global criterion in the present case is the joint likelihood of all data with respect to all the models:
$\displaystyle {\cal L}(\ensuremath\boldsymbol{\Theta}) = \log p(X\vert\ensuremath\boldsymbol{\Theta})$ $\displaystyle = \log \sum_Q p(X,Q\vert\ensuremath\boldsymbol{\Theta})$    
  $\displaystyle = \log \sum_Q P(Q\vert X,\ensuremath\boldsymbol{\Theta})p(X\vert\ensuremath\boldsymbol{\Theta})$   (Bayes)    
  $\displaystyle = \log \sum_{k=1}^{K} P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) p(X\vert\ensuremath\boldsymbol{\Theta})$    


Applying Jensen's inequality $ \left( \log \sum_j \lambda_j y_j \geq \sum_j \lambda_j \log y_j \mbox{ if } \sum_j \lambda_j = 1 \right)$, we obtain:
$\displaystyle {\cal L}(\ensuremath\boldsymbol{\Theta}) \ge J(\ensuremath\boldsymbol{\Theta})$ $\displaystyle = \sum_{k=1}^{K} P(q_k\vert X,\ensuremath\boldsymbol{\Theta}) \log p(X\vert\ensuremath\boldsymbol{\Theta})$    
  $\displaystyle = \sum_{k=1}^{K} \sum_{n=1}^{N} P(q_k\vert\ensuremath\mathbf{x}_n... ...bol{\Theta}) \log p(\ensuremath\mathbf{x}_n\vert\ensuremath\boldsymbol{\Theta})$    


Hence, the criterion $ J(\ensuremath\boldsymbol{\Theta})$ represents a lower boundary for $ {\cal L}(\ensuremath\boldsymbol{\Theta})$. This criterion is locally maximized by the algorithm.

Experiment:

Use the EM explorer utility:
 EMALGO EM algorithm explorer

   Launch it with EMALGO(DATA,NCLUST) where DATA is the matrix
   of observations (one observation per row) and NCLUST is the
   desired number of clusters.

   The clusters are initialized with a heuristic that spreads
   them randomly around mean(DATA) with standard deviation
   sqrtm(cov(DATA)*10). Their initial covariance is set to cov(DATA).

   If you want to set your own initial clusters, use
   EMALGO(DATA,MEANS,VARS) where MEANS and VARS are cell arrays
   containing respectively NCLUST initial mean vectors and NCLUST
   initial covariance matrices. In this case, the initial a-priori
   probabilities are set equal to 1/NCLUST.

   To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)
   where PRIORS is a vector containing NCLUST a priori probabilities.

   Example: for two clusters
     means{1} = [1 2]; means{2} = [3 4];
     vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];
     emalgo(data,means,vars);
Launch emalgo with again the same dataset allvow. Do several runs with different cases of initialization of the algorithm:
  1. 5 clusters determined according to the default heuristic;
  2. initial MEANS values equal to some data points, and some random VARS values (e.g., cov(allvow) for all the classes);
  3. the initial MEANS and VARS values found by the $ K$-means algorithm.
  4. initial MEANS values equal to $ \{\ensuremath\boldsymbol{\mu}_{\text{/a/}}, \ensuremath\boldsymbol{\mu}_{\tex... ...emath\boldsymbol{\mu}_{\text{/o/}}, \ensuremath\boldsymbol{\mu}_{\text{/y/}}\}$, VARS values equal to

    $ \{\ensuremath\boldsymbol{\Sigma}_{\text{/a/}}, \ensuremath\boldsymbol{\Sigma}_... ...\boldsymbol{\Sigma}_{\text{/o/}}, \ensuremath\boldsymbol{\Sigma}_{\text{/y/}}\}$, and PRIORS values equal to $ [P_{\text{/a/}},P_{\text{/e/}},P_{\text{/i/}},P_{\text{/o/}},P_{\text{/y/}}]$;
  5. initial MEANS and VARS values chosen by yourself.
(If you have time, also increase the number of clusters and play again with the algorithm.)

Iterate the algorithm until the total likelihood reaches an asymptotic convergence. Observe the evolution of the clusters and of the total likelihood curve. (In the EM case, the data points attribution chart is not given because each data point participates to the update of each cluster.) Observe the mean, variance and prior values found after the convergence of the algorithm. Compare them with the values found in section 2.2.

Example:

» emalgo(allvow,5);

- or -

» means = { mu_a, mu_e, mu_i, mu_o, mu_y };

» vars = { sigma_a, sigma_e, sigma_i, sigma_o, sigma_y };

» emalgo(allvow,means,vars);

Enlarge the window, then push the buttons, zoom etc. After convergence, use:

» for k=1:5; disp(emalgo_result_means{k}); end

» for k=1:5; disp(emalgo_result_vars{k}); end

» for k=1:5; disp(emalgo_result_priors(k)); end

to see the resulting means, variances and priors.

Question:

  1. Does the final solution depend on the initialization of the algorithm?
  2. Describe the evolution of the total likelihood. Is it monotonic?
  3. In terms of optimization of the likelihood, what does the final solution correspond to?
  4. Is the algorithm suitable for fitting Gaussian clusters?

Questions to $ K$-means, Viterbi-EM and EM algorithm:

Find the right statements:

$ \Box$
With the $ K$-means algorithm the solution is independent upon the initialization.
$ \Box$
With the $ K$-means algorithm the discriminant surfaces are linear.
$ \Box$
The $ K$-means algorithm is well suited for fitting Gaussian clusters
$ \Box$
In the $ K$-means algorithm the global criterion used to minimize is the maximum likelihood.
$ \Box$
In all 3 algorithms the measure used as global criterion decreases in a monotonic way.
$ \Box$
In the Viterbi-EM and EM algorithm the solution is highly dependent upon initialization.
$ \Box$
The EM algorithm is best suited for fitting Gaussian clusters.
$ \Box$
It is an easy task to guess the parameters for initialization.
$ \Box$
With the Viterbi-EM algorithm the discriminant surfaces are linear.
$ \Box$
With the EM algorithm the discriminant surfaces have the form of (hyper)parabola.
$ \Box$
The EM algorithm needs less computational effort then the Viterbi-EM algorithm.
$ \Box$
In the EM algorithm and the Viterbi-EM algorithm, the same global criterion is used.
$ \Box$
The EM algorithm finds a global optimum.