Computational Intelligence, SS08
2 VO 442.070 + 1 RU 708.070 last updated:
General
Course Notes (Skriptum)
Online Tutorials
Practical Course Slides
Homework
Assignments
Scores
Guidelines
Archive
Exams
Animated Algorithms
Interactive Tests
Key Definitions
Downloads
Literature and Links
News
mailto:webmaster

Homework 4: Decision Trees and Support Vector Machines



[Points: 12.5; Issued: 2008/03/20; Deadline: 2008/05/26; Tutor: Martin Stadler; Infohour: 2008/05/16, 15:30-16:30, HS i11; Einsichtnahme: 2008/06/06, 15:30-16:30, HS i11; Download: pdf; ps.gz]





Decision Trees [5 points]

In this example you are asked to implement the entropy impurity and also the gain ratio impurity calculation in order to build a decision tree. Download and unzip the decision tree framework for matlab decisiontrees.zip. This framework consists of several functions for building a decision tree from a given data set, classify new data point with a decision tree and several other evaluation functions. Please refer to the trees_template.m file in order to see how these functions work.

The function getScore calculates a score for splitting a certain attribute. As input, it gets the attribute values of dimension $ j$ of all training examples in $ L_v$ and also the class labels of all examples in $ L_v$. The attribute values can be any integer value, the class labels are either $ 1$ or $ 2$. At each new node, the function getScore is called for each dimension $ j$. The dimension with the highest score is used for splitting. By now, the getScore function only returns random values, the correct implementation of these functions is your task.
  • Use the dataset treedataset1.csv. Build the tree with the already existing random getScore function and determine the depth, the number of leafes and the error of the tree by performing a 10 fold cross validation.
  • Implement the information gain (with the help of the entropy impurity) in the getScore function. Again build the tree with treedataset1.csv and evaluate the depth and the number of leaves of the tree. Estimate the true error of the tree by performing a 10 fold cross validation. Compare it to the previous results. Explain the differences in the cross validation error.
  • Use the dataset treedataset2.csv. It is basically the same dataset with an additional attribute added. The new attribute has 50 random nominal values. Again build your tree with the entropy impurity score function and determine depth, number of leaves and the cross-validation error. Can you observe a change in the cross-validation error? If yes, why?
  • Implement the gain ratio impurity in the getScore function. Build the tree with the dataset treedataset2.csv and evaluate the depth, the number of leaves of the tree and the cross validation error. Compare it to the previous result.

Hints

  • Be careful when calculating the product $ p log(p)$ in Matlab when $ p = 0$. Matlab returns NaN in this case.
  • The function em unique might be useful to determine all possible attribute values.

Support Vector Machines [7.5 points]

In this homework you are supposed to evaluate different parameter settings for the Support Vector Machine (svm). Download the svm library for matlab and follow the install instructions (install.txt). Consult the files demsvm1.m or svm_template.m to get an overview of the basic methods of the svm library
  • The file homework4.mat contains a training set (x_train, t_train) and a test set (x_test, t_test). The file is already included in the svm directory.
  • Train several svms with different polynomial kernels. Use $ p = [1, 2, 3, 4, 5]$ for the degree of the kernel. Plot the training points, the decision boundary and the support vectors for all settings of $ p$ (see template). Compare the plots. Also plot the error on the test set and on the training set for increasing $ p$ values. Which $ p$ is the best setting?
  • Train several svms with different rbf kernels (bandwidth set to $ [0.1, 10, 100]$) and $ C$ set to $ 10$. Plot the training points, the decision boundary and the support vectors (see template) for all three settings. Interpret the plots.
  • Again train svms with different rbf kernels. This time set the bandwidths to logspace(-1, 2, 25)) and compute the error on the test set for each bandwith. Repeat the expirement for $ C = [0.1, 1.0, 10.0, 50.0]$. Plot the test set error versus the rbf kernel bandwith for each setting of $ C$ (i.e. for each $ C$ setting you get an individual curve), use a semilogx plot to get to a good scaling of the curves. Interpret your results, can you find any relation between good $ C$ values and good rbf bandwidths? Can you find any explanation for this relationship?
  • Train a feed-forward neural network with 20 hidden neurons using weight decay. Use a learning algorithm of your choice, try different regularization constants (net.performParam.ratio). Can you achieve a comparable performance than with svms?.

Hints

  • The class labels used for the svm are $ 1$ and $ -1$. Keep that in mind when training the neural network.