Next: Factor Graphs [4 P] Up: MLA_Exercises_2007 Previous: d-separation [4* P]

# Naive Bayes Classifier [4 P]

Implement an algorithm for learning a naive Bayes classifier and apply it to a spam email data set. You are required to use MATLAB for this assignment. The spam dataset and the templates for the matlab functions are available for download on the course homepage1.

a)
[1 P]

Write a function called nbayes_learn.m that takes a training dataset and returns the probabilities for a naive Bayes classifier. (A template for this function is included in the supplementary material.)

b)
[1 P]

Write a function called nbayes_predict.m that takes a set of test data vectors and returns a set of class label predictions for each vector. (A template for this function is included in the supplementary material.)

c)
[2 P]

Use both functions to conduct the following experiment. For your assignment you will be working with a data set that was created a few years ago at the Hewlett Packard Research Labs as a testbed data set to test different spam email classification algorithms.

1. Train a naive Bayes model on the first 2500 samples and report the classification error of the trained model on a test data set consisting of the remaining examples that were not used for training.

2. Repeat the previous step, now training on the first {10, 50, 100, 200, ... , 500} samples, and again testing on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples. Hand in a plot of this function.

3. Comment on how accurate the classifier would be if it would randomly guess a class label or it would always pick the most common label in the training data. Compare these performance values to the results obtained for the naive Bayes model.

Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless.

Next: Factor Graphs [4 P] Up: MLA_Exercises_2007 Previous: d-separation [4* P]
Haeusler Stefan 2007-12-03