
[Points: 12.5; Issued: 2008/03/20; Deadline: 2008/05/26; Tutor:
Martin Stadler; Infohour: 2008/05/16, 15:3016:30,
HS i11; Einsichtnahme: 2008/06/06, 15:3016:30, HS i11; Download:
pdf; ps.gz]
In this example you are asked to implement the entropy impurity and
also the gain ratio impurity calculation in order to build a
decision tree. Download and unzip the decision tree framework for
matlab decisiontrees.zip. This framework consists of
several functions for building a decision tree from a given data
set, classify new data point with a decision tree and several other
evaluation functions. Please refer to the trees_template.m
file in order to see how these functions work.
The function getScore calculates a score for splitting a
certain attribute. As input, it gets the attribute values of
dimension of all
training examples in and
also the class labels of all examples in . The attribute values can be any
integer value, the class labels are either or . At each new node, the function getScore
is called for each dimension . The dimension with the highest score is used for
splitting. By now, the getScore function only returns
random values, the correct implementation of these functions is
your task.
 Use the dataset treedataset1.csv. Build the tree with
the already existing random getScore function and
determine the depth, the number of leafes and the error of the tree
by performing a 10 fold cross validation.
 Implement the information gain (with the help of the entropy
impurity) in the getScore function. Again build the tree
with treedataset1.csv and evaluate the depth and the
number of leaves of the tree. Estimate the true error of the tree
by performing a 10 fold cross validation. Compare it to the
previous results. Explain the differences in the cross validation
error.
 Use the dataset treedataset2.csv. It is basically the
same dataset with an additional attribute added. The new attribute
has 50 random nominal values. Again build your tree with the
entropy impurity score function and determine depth, number of
leaves and the crossvalidation error. Can you observe a change in
the crossvalidation error? If yes, why?
 Implement the gain ratio impurity in the getScore
function. Build the tree with the dataset treedataset2.csv
and evaluate the depth, the number of leaves of the tree and the
cross validation error. Compare it to the previous result.
 Be careful when calculating the product in Matlab when . Matlab returns NaN in this
case.
 The function em unique might be useful to determine all
possible attribute values.
In this homework you are supposed to evaluate different parameter
settings for the Support Vector Machine (svm). Download the
svm library for matlab and follow the install
instructions (install.txt). Consult the files demsvm1.m or
svm_template.m to get an overview of the basic methods of
the svm library
 The file homework4.mat contains a training set
(x_train, t_train) and a test set (x_test,
t_test). The file is already included in the svm
directory.
 Train several svms with different polynomial kernels. Use
for the degree of the kernel. Plot the training points, the
decision boundary and the support vectors for all settings of
(see template).
Compare the plots. Also plot the error on the test set and on the
training set for increasing values. Which is the best setting?
 Train several svms with different rbf kernels (bandwidth set to
) and
set to . Plot the training points, the
decision boundary and the support vectors (see template) for all
three settings. Interpret the plots.
 Again train svms with different rbf kernels. This time set the
bandwidths to logspace(1, 2, 25)) and compute the error
on the test set for each bandwith. Repeat the expirement for
. Plot the test set error
versus the rbf kernel bandwith for each setting of (i.e. for each setting you get an individual
curve), use a semilogx plot to get to a good scaling of
the curves. Interpret your results, can you find any relation
between good values and
good rbf bandwidths? Can you find any explanation for this
relationship?
 Train a feedforward neural network with 20 hidden neurons
using weight decay. Use a learning algorithm of your choice, try
different regularization constants
(net.performParam.ratio). Can you achieve a comparable
performance than with svms?.
 The class labels used for the svm are and . Keep that in mind when training the neural
network.

