KDD Competition 2010: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 6: | Line 6: | ||
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm] | * [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm] | ||
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka] | * [http://www.cs.waikato.ac.nz/ml/weka/ Weka] | ||
* [http://www.kdnuggets.com/datasets/competitions.html | * [http://www.kdnuggets.com/datasets/competitions.html List of other competitions in which we could engage] | ||
* [[Machine Learning/Hadoop | Hadoop]] | |||
* [http://lucene.apache.org/mahout/ Mahout -- machine learning libraries for Hadoop] | |||
==TODOs== | ==TODOs== | ||
* Vikram -- will | * Vikram -- will help setting up Hadoop for the rest of us & create a guide for Mahout setup | ||
* Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so | * Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so | ||
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances | ** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances | ||
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data | ** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data | ||
* Andy -- will get Weka working on the data and put together a "how to" guide for doing so | * Andy -- will get Weka working on the data and put together a "how to" guide for doing so | ||
* Erin -- will work on data transformations and ways to create better representations of the data | * Erin -- will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets | ||
* We will need to make sure we don't get disqualified for people belonging to multiple teams! | * We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first. | ||
== Notes == | == Notes == | ||
* to zip the file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt | * to zip the file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt | ||
== Ideas == | == Ideas == | ||
* Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature | * Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature | ||
* Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM | * Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM |
Revision as of 23:14, 19 May 2010
We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.
Resources
- KDD Rules and Data Format
- R
- libsvm
- Weka
- List of other competitions in which we could engage
- Hadoop
- Mahout -- machine learning libraries for Hadoop
TODOs
- Vikram -- will help setting up Hadoop for the rest of us & create a guide for Mahout setup
- Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
- put together a perl script which will take random samples from the data, for working on smaller instances
- put together a simple R script for loading the data
- Andy -- will get Weka working on the data and put together a "how to" guide for doing so
- Erin -- will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
- We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
Notes
- to zip the file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt
Ideas
- Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
- Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM