next up previous
Next: Polytrees [3 P] Up: MLA_Exercises_2009 Previous: Parameter Learning in Bayesian

Naive Bayes Classifier [4+2* P]

Implement an algorithm for learning a naive Bayes classifier and apply it to a spam email data set. You are required to use MATLAB for this assignment. The spam dataset is available for download on the course homepage1.

a)
[1 P]

Write a function called nbayes_learn.m that takes a training dataset for a binary classification task with binary attributes and returns the posterior Beta distributions of all model parameters (specified by variables $ a'_i$ and $ b'_i$ for the $ i$ th model parameter) of a naive Bayes classifier given a prior Beta distribution for each of the model parameters (specified by variables $ a_i$ and $ b_i$ for the $ i$ th model parameter).

b)
[1 P]

Write a function called nbayes_predict.m that takes a set of test data vectors and returns the most likely class label predictions for each input vector based on the posterior parameter distributions obtained in a).

c)
[2 P]

Use both functions to conduct the following experiment. For your assignment you will be working with a data set that was created a few years ago at the Hewlett Packard Research Labs as a testbed data set to test different spam email classification algorithms.

  1. Verify the naive Bayes assumption for all pairs of input attributes.

  2. Train a naive Bayes model on the first 2500 samples (using Laplace uniform prior distributions) and report the classification error of the trained model on a test data set consisting of the remaining examples that were not used for training.

  3. Repeat the previous step, now training on the first {10, 50, 100, 200, ... , 500} samples, and again testing on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples. Hand in a plot of this function.

  4. Comment on how accurate the classifier would be if it would randomly guess a class label or it would always pick the most common label in the training data. Compare these performance values to the results obtained for the naive Bayes model.

d)
[2* P] Train a feedforward neural network with one sigmoidal output unit and no hidden units with backpropagation (use the algorithm traingdx and initialize the network with small but nonzero weights) on the first {10, 50, 100, 200, ... , 500} samples and test on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples and compare the results to the one obtained for the naive Bayes classifier. Hand in a plot of this function.
Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless.


next up previous
Next: Polytrees [3 P] Up: MLA_Exercises_2009 Previous: Parameter Learning in Bayesian
Haeusler Stefan 2010-01-26