next up previous
Next: About this document ... Up: MLA_Projects Previous: Predicting Text Relevance from

Data Collection

For this project you are supposed to collect your own dataset and analyse it. There exists a huge number of sources for data on the web, in books, newspapers, or you can even perform your own experiments or surveys. Some ideas are: geographical or census data, biological data, economical data, astronomy, images, speech, experiments for your studies, ... Before collecting the data, define the task that you want to solve: Is it a classification, regression or clustering task? What is the target concept, and what input variables are necessary? Is there an intuitive solution for the problem?

The dataset should contain at least 50 instances with at least 4 input features and 1 or more target values. If you have a classification problem, you should try to balance the number of instances for each class. If you think that you have an interesting dataset which does not meet these criteria, contact me before you start.

Analyze the data after you have collected it. Create visualizations and calculate correlations between the features, or between features and target concepts. Select the best features and perform any preprocessing steps that you think are necessary. Apply one or more of the algorithms that were presented in the lecture to this dataset and evaluate their performance. Explain how and why you chose the particular algorithms, architectures and parameters. Try to extract rules or other interesting relationships from your data.

There are some rules for collecting the dataset. If you violate any of them, your project does not count.


next up previous
Next: About this document ... Up: MLA_Projects Previous: Predicting Text Relevance from
Pfeiffer Michael 2005-12-20