Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
General
Course Notes (Skriptum)
Online Tutorials
Introduction to Matlab
Neural Network Toolbox
OCR with ANNs
Adaptive Filters
VC dimension
Gaussian Statistics
PCA, ICA, Blind Source Separation
Hidden Markov Models
Mixtures of Gaussians
Automatic Speech Recognition
Practical Course Slides
Homework
Exams
Animated Algorithms
Interactive Tests
Key Definitions
Downloads
Literature and Links
News
mailto:webmaster


Subsections

Gaussian statistics



Samples from a Gaussian density



Useful formulas and definitions:

  • The Gaussian probability density function (pdf) for the $ d$-dimensional random variable $ \ensuremath\mathbf{x}\circlearrowleft {\cal N}(\ensuremath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma})$ (i.e., variable $ \ensuremath\mathbf{x}\in \ensuremath\mathbb{R}^d$ following the Gaussian, or Normal, probability law) is given by:
    $\displaystyle g_{(\ensuremath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma})}(... ...th\boldsymbol{\Sigma}^{-1} (\ensuremath\mathbf{x}-\ensuremath\boldsymbol{\mu})}$ (1)


    where $ \ensuremath\boldsymbol{\mu}$ is the mean vector and $ \ensuremath\boldsymbol{\Sigma}$ is the covariance matrix. $ \ensuremath\boldsymbol{\mu}$ and $ \ensuremath\boldsymbol{\Sigma}$ are the parameters of the Gaussian distribution.
  • The mean vector $ \ensuremath\boldsymbol{\mu}$ contains the mean values of each dimension, $ \mu_i = E(x_i)$, with $ E(x)$ being the expected value of $ x$.
  • All of the variances $ c_{ii}$ and covariances $ c_{ij}$ are collected together into the covariance matrix $ \ensuremath\boldsymbol{\Sigma}$ of dimension $ d\times d$:
    $\displaystyle \ensuremath\boldsymbol{\Sigma}= \left[ \begin{array}{*{4}{c}} ... ...ddots & \vdots \\ c_{n1} & c_{n2} & \cdots & c_{nn} \\ \end{array} \right]$    


    The covariance $ c_{ij}$ of two components $ x_i$ and $ x_j$ of $ \ensuremath\mathbf{x}$ measures their tendency to vary together, i.e., to co-vary,

    $\displaystyle c_{ij} = E\left((x_i-\mu_i)^{\mathsf T}\,(x_j-\mu_j)\right).$
    If two components $ x_i$ and $ x_j$, $ i\ne j$, have zero covariance $ c_{ij} = 0$ they are orthogonal in the statistical sense, which transposes to a geometric sense (the expectation is a scalar product of random variables; a null scalar product means orthogonality). If all components of $ \ensuremath\mathbf{x}$ are mutually orthogonal the covariance matrix has a diagonal form.
  • $ \sqrt{\ensuremath\boldsymbol{\Sigma}}$ defines the standard deviation of the random variable $ \ensuremath\mathbf{x}$. Beware: this square root is meant in the matrix sense.
  • If $ \ensuremath\mathbf{x}\circlearrowleft {\cal N}(\mathbf{0},\mathbf{I})$ ( $ \ensuremath\mathbf{x}$ follows a normal law with zero mean and unit variance; $ \mathbf{I}$ denotes the identity matrix), and if $ \mathbf{y} = \ensuremath\boldsymbol{\mu}+ \sqrt{\ensuremath\boldsymbol{\Sigma}}\,\ensuremath\mathbf{x}$, then $ \mathbf{y} \circlearrowleft {\cal N}(\ensuremath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma})$.

Experiment:

Generate samples $ X$ of $ N$ points, $ X=\{\ensuremath\mathbf{x}_1, \ensuremath\mathbf{x}_2,\ldots,\ensuremath\mathbf{x}_N\}$, with $ N=10000$, coming from a 2-dimensional Gaussian process that has mean
$\displaystyle \ensuremath\boldsymbol{\mu}= \left[ \begin{array}{c} 730 \\ 1090 \end{array} \right] $
and variance
  • 8000 for both dimensions (spherical process) (sample $ X_1$):
    $\displaystyle \ensuremath\boldsymbol{\Sigma}_1 = \left[ \begin{array}{cc} 8000 & 0 \ 0 & 8000 \end{array} \right] $
  • expressed as a diagonal covariance matrix (sample $ X_2$):
    $\displaystyle \ensuremath\boldsymbol{\Sigma}_2 = \left[ \begin{array}{cc} 8000 & 0 \ 0 & 18500 \end{array} \right] $
  • expressed as a full covariance matrix (sample $ X_3$):
    $\displaystyle \ensuremath\boldsymbol{\Sigma}_3 = \left[ \begin{array}{cc} 8000 & 8400 \ 8400 & 18500 \end{array} \right] $
Use the function gausview (» help gausview) to plot the results as clouds of points in the 2-dimensional plane, and to view the corresponding 2-dimensional probability density functions (pdfs) in 2D and 3D.

Example:

» N = 10000;

» mu = [730 1090]; sigma_1 = [8000 0; 0 8000];

» X1 = randn(N,2) * sqrtm(sigma_1) + repmat(mu,N,1);

» gausview(X1,mu,sigma_1,'Sample X1');

Repeat for the two other variance matrices $ \ensuremath\boldsymbol{\Sigma}_2$ and $ \ensuremath\boldsymbol{\Sigma}_3$. Use the radio buttons to switch the plots on/off. Use the ``view'' buttons to switch between 2D and 3D. Use the mouse to rotate the plot (must be enabled in Tools menu: Rotate 3D, or by the $ \circlearrowleft$ button).

Questions:

By simple inspection of 2D views of the data and of the corresponding pdf contours, how can you tell which sample corresponds to a spherical process (as the sample $ X_1$), which sample corresponds to a process with a diagonal covariance matrix (as $ X_2$), and which to a process with a full covariance matrix (as $ X_3$)?

Find the right statements:

$ \Box$
In process 1 the first and the second component of the vectors $ \ensuremath\mathbf{x}_i$ are independent.
$ \Box$
In process 2 the first and the second component of the vectors $ \ensuremath\mathbf{x}_i$ are independent.
$ \Box$
In process 3 the first and the second component of the vectors $ \ensuremath\mathbf{x}_i$ are independent.
$ \Box$
If the first and second component of the vectors $ \ensuremath\mathbf{x}_i$ are independent, the cloud of points and the pdf contour has the shape of a circle.
$ \Box$
If the first and second component of the vectors $ \ensuremath\mathbf{x}_i$ are independent, the cloud of points and pdf contour has to be elliptic with the principle axes of the ellipse aligned with the abscissa and ordinate axes.
$ \Box$
For the covariance matrix $ \ensuremath\boldsymbol{\Sigma}$ the elements have to satisfy $ c_{ij} = c_{ji}$.
$ \Box$
The covariance matrix has to be positive definite ( $ \ensuremath\mathbf{x}^{\mathsf T}\ensuremath\boldsymbol{\Sigma}\, \ensuremath\mathbf{x}\ge 0$). (If yes, what happens if not? Try it out in MATLAB).

Gaussian modeling: Mean and variance of a sample

We will now estimate the parameters $ \ensuremath\boldsymbol{\mu}$ and $ \ensuremath\boldsymbol{\Sigma}$ of the Gaussian models from the data samples.

Useful formulas and definitions:

  • Mean estimator: $ \displaystyle \hat{\ensuremath\boldsymbol{\mu}} = \frac{1}{N} \sum_{i=1}^{N} \ensuremath\mathbf{x}_i$
  • Unbiased covariance estimator: $ \displaystyle \hat{\ensuremath\boldsymbol{\Sigma}} = \frac{1}{N-1} \; \sum_{i... ...dsymbol{\mu})^{\mathsf T}(\ensuremath\mathbf{x}_i-\ensuremath\boldsymbol{\mu}) $

Experiment:

Take the sample $ X_3$ of 10000 points generated from $ {\cal N}(\ensuremath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma}_3)$. Compute an estimate $ \hat{\ensuremath\boldsymbol{\mu}}$ of its mean and an estimate $ \hat{\ensuremath\boldsymbol{\Sigma}}$ of its variance:
  1. with all the available points $ \hat{\ensuremath\boldsymbol{\mu}}_{(10000)} =$ $ \hat{\ensuremath\boldsymbol{\Sigma}}_{(10000)} =$





  2. with only 1000 points $ \hat{\ensuremath\boldsymbol{\mu}}_{(1000)} =$ $ \hat{\ensuremath\boldsymbol{\Sigma}}_{(1000)} =$





  3. with only 100 points $ \hat{\ensuremath\boldsymbol{\mu}}_{(100)} =$ $ \hat{\ensuremath\boldsymbol{\Sigma}}_{(100)} =$





Compare the estimated mean vector $ \hat{\ensuremath\boldsymbol{\mu}}$ to the original mean vector $ \ensuremath\boldsymbol{\mu}$ by measuring the Euclidean distance that separates them. Compare the estimated covariance matrix $ \hat{\ensuremath\boldsymbol{\Sigma}}$ to the original covariance matrix $ \ensuremath\boldsymbol{\Sigma}_3$ by measuring the matrix 2-norm of their difference (the norm $ \Vert\mathbf{A}-\mathbf{B}\Vert _2$ constitutes a measure of similarity of two matrices $ \mathbf{A}$ and $ \mathbf{B}$; use MATLAB's norm command).

Example:

In the case of 1000 points (case 2.):

» X = X3(1:1000,:);

» N = size(X,1)

» mu_1000 = sum(X)/N

-or-

» mu_1000 = mean(X)

» sigma_1000 = (X - repmat(mu_1000,N,1))' * (X - repmat(mu_1000,N,1)) / (N-1)

-or-

» sigma_1000 = cov(X)

» % Comparison of means and covariances:

» e_mu = sqrt((mu_1000 - mu) * (mu_1000 - mu)')

» % (This is the Euclidean distance between mu_1000 and mu)

» e_sigma = norm(sigma_1000 - sigma_3)

» % (This is the 2-norm of the difference between sigma_1000 and sigma_3)

Question:

When comparing the estimated values $ \hat{\ensuremath\boldsymbol{\mu}}$ and $ \hat{\ensuremath\boldsymbol{\Sigma}}$ to the original values of $ \ensuremath\boldsymbol{\mu}$ and $ \ensuremath\boldsymbol{\Sigma}_3$ (using the Euclidean distance and the matrix 2-norm), what can you observe?

Find the right statements:

$ \Box$
An accurate mean estimate requires more points than an accurate variance estimate.
$ \Box$
It is very important to have enough training examples to estimate the parameters of the data generation process accurately.



Likelihood of a sample with respect to a Gaussian model

In the following we compute the likelihood of a sample point $ \ensuremath\mathbf{x}$, and the joint likelihood of a series of samples $ X$ for a given model $ \ensuremath\boldsymbol{\Theta}$ with one Gaussian. The likelihood will be used in the formula for classification later on (sec. 2.3).

Useful formulas and definitions:

  • Likelihood: the likelihood of a sample point $ \ensuremath\mathbf{x}_i$ given a data generation model (i.e., given a set of parameters $ \ensuremath\boldsymbol{\Theta}$ for the model pdf) is the value of the pdf $ p(\ensuremath\mathbf{x}_i\vert\ensuremath\boldsymbol{\Theta})$ for that point. In the case of Gaussian models $ \ensuremath\boldsymbol{\Theta}= (\ensuremath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma})$, this amounts to the evaluation of equation 1.
  • Joint likelihood: for a set of independent identically distributed (i.i.d.) samples, say $ X=\{\ensuremath\mathbf{x}_1, \ensuremath\mathbf{x}_2,\ldots,\ensuremath\mathbf{x}_N\}$, the joint (or total) likelihood is the product of the likelihoods for each point. For instance, in the Gaussian case:
    $\displaystyle p(X\vert\ensuremath\boldsymbol{\Theta}) = \prod_{i=1}^{N} p(\ens... ...emath\boldsymbol{\mu},\ensuremath\boldsymbol{\Sigma})}(\ensuremath\mathbf{x}_i)$ (2)


Question:

Why do we might want to compute the log-likelihood rather than the simple likelihood?

Computing the log-likelihood turns the product into a sum:

$\displaystyle p(X\vert\ensuremath\boldsymbol{\Theta}) = \prod_{i=1}^{N} p(\ensu... ...{i=1}^{N} \log p(\ensuremath\mathbf{x}_i\vert\ensuremath\boldsymbol{\Theta}) $

In the Gaussian case, it also avoids the computation of the exponential:

$\displaystyle p(\ensuremath\mathbf{x}\vert\ensuremath\boldsymbol{\Theta})$ $\displaystyle =$ $\displaystyle \frac{1}{\sqrt{2\pi}^d \sqrt{\det\left(\ensuremath\boldsymbol{\Si... ...th\boldsymbol{\Sigma}^{-1} (\ensuremath\mathbf{x}-\ensuremath\boldsymbol{\mu})}$  
$\displaystyle \log p(\ensuremath\mathbf{x}\vert\ensuremath\boldsymbol{\Theta})$ $\displaystyle =$ $\displaystyle \frac{1}{2} \left[-d \log \left( 2\pi \right) - \log \left( \det\... ...dsymbol{\Sigma}^{-1} (\ensuremath\mathbf{x}-\ensuremath\boldsymbol{\mu})\right]$ (3)


Furthermore, since $ \log(x)$ is a monotonically growing function, the log-likelihoods have the same relations of order as the likelihoods
$\displaystyle p(x\vert\ensuremath\boldsymbol{\Theta}_1) > p(x\vert\ensuremath\b... ...emath\boldsymbol{\Theta}_1) > \log p(x\vert\ensuremath\boldsymbol{\Theta}_2), $
so they can be used directly for classification.

Find the right statements:

We can further simplify the computation of the log-likelihood in eq. 3 for classification by
$ \Box$
dropping the division by two: $ \frac{1}{2} \left[\ldots\right]$,
$ \Box$
dropping term $ d\log \left( 2\pi \right)$,
$ \Box$
dropping term $ \log \left( \det\left(\ensuremath\boldsymbol{\Sigma}\right) \right)$,
$ \Box$
dropping term $ (\ensuremath\mathbf{x}-\ensuremath\boldsymbol{\mu})^{\mathsf T}\ensuremath\boldsymbol{\Sigma}^{-1} (\ensuremath\mathbf{x}-\ensuremath\boldsymbol{\mu})$,
$ \Box$
calculating the term $ \log \left( \det\left(\ensuremath\boldsymbol{\Sigma}\right) \right)$ in advance.



We can drop term(s) because:

$ \Box$
The term(s) are independent of $ \ensuremath\boldsymbol{\mu}$.
$ \Box$
The terms are negligible small.
$ \Box$
The term(s) are independent of the classes.

As a summary, log-likelihoods use simpler computation and are readily usable for classification tasks.

Experiment:

Given the following 4 Gaussian models $ \ensuremath\boldsymbol{\Theta}_i = (\ensuremath\boldsymbol{\mu}_i,\ensuremath\boldsymbol{\Sigma}_i)$
$ {\cal N}_1: \; \ensuremath\boldsymbol{\Theta}_1 = \left( \left[\begin{array}{... ...right], \left[\begin{array}{cc}8000 & 0 \\ 0 & 8000\end{array}\right] \right)$   $ {\cal N}_2: \; \ensuremath\boldsymbol{\Theta}_2 = \left( \left[\begin{array}{... ...ight], \left[\begin{array}{cc}8000 & 0 \\ 0 & 18500\end{array}\right] \right)$
$ {\cal N}_3: \; \ensuremath\boldsymbol{\Theta}_3 = \left( \left[\begin{array}{... ... \left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right] \right)$   $ {\cal N}_4: \; \ensuremath\boldsymbol{\Theta}_4 = \left( \left[\begin{array}{... ... \left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right] \right)$




compute the following log-likelihoods for the whole sample $ X_3$ (10000 points):

$\displaystyle \log p(X_3\vert\ensuremath\boldsymbol{\Theta}_1),\; \log p(X_3\ve... ...th\boldsymbol{\Theta}_2),\; \log p(X_3\vert\ensuremath\boldsymbol{\Theta}_3),\;$   and$\displaystyle \; \log p(X_3\vert\ensuremath\boldsymbol{\Theta}_4). $

Example:

» N = size(X3,1)

» mu_1 = [730 1090]; sigma_1 = [8000 0; 0 8000];

» logLike1 = 0;

» for i = 1:N;

logLike1 = logLike1 + (X3(i,:) - mu_1) * inv(sigma_1) * (X3(i,:) - mu_1)';

end;

» logLike1 = - 0.5 * (logLike1 + N*log(det(sigma_1)) + 2*N*log(2*pi))

Note: Use the function gausview to compare the relative positions of the models $ {\cal N}_1$, $ {\cal N}_2$, $ {\cal N}_3$ and $ {\cal N}_4$ with respect to the data set $ X_3$, e.g.:

» mu_1 = [730 1090]; sigma_1 = [8000 0; 0 8000];

» gausview(X3,mu_1,sigma_1,'Comparison of X3 and N1');



Question:

Of $ {\cal N}_1$, $ {\cal N}_2$, $ {\cal N}_3$ and $ {\cal N}_4$, which model ``explains'' best the data $ X_3$? Which model has the highest number of parameters (with non-zero values)? Which model would you choose for a good compromise between the number of parameters and the capacity to accurately represent the data?