KDD Competition 2010: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 17: | Line 17: | ||
* [https://www.noisebridge.net/mailman/listinfo/ml Our mailing list] | * [https://www.noisebridge.net/mailman/listinfo/ml Our mailing list] | ||
* [http://www.s3fox.net/ S3Fox] | * [http://www.s3fox.net/ S3Fox] | ||
==TODOs== | ==TODOs== | ||
Line 49: | Line 49: | ||
== How to run Weka (quick 'n dirty tutorial) == | == How to run Weka (quick 'n dirty tutorial) == | ||
* Download and install Weka | * Download and install Weka | ||
* Get your KDD data | * Get your KDD data & preprocess your data: | ||
this command takes 1000 lines from the given training data set and converts it into .csv file | |||
attention, in the last sed command you need to replace the long whitespace with a tab. In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces) | |||
head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/ /,/g' > algebra_2006_2007_train_1kFormatted.csv | head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/ /,/g' > algebra_2006_2007_train_1kFormatted.csv | ||
* The following screencast shows you how to do these steps: | * The following screencast shows you how to do these steps: | ||
Line 59: | Line 59: | ||
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2] | * [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2] | ||
== How to run | == How to run libSVM == | ||
* See the notes at [[Machine Learning/SVM]] | * See the notes at [[Machine Learning/SVM]] |
Revision as of 00:50, 23 May 2010
We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.
Resources
- KDD Rules and Data Format
- R language
- libsvm
- Weka
- List of other competitions in which we could engage
- Hadoop
- Mahout -- machine learning libraries for Hadoop
- So-so intro to Pig Video
- An AWESOME intro to Pig on Elastic Map Reduce!
- Pig language
- Pig Latin Manual
- Cloudera -- see videos for Hadoop intro
- Vikram's awesome Hadoop/EC2 scripts
- Our mailing list
- S3Fox
TODOs
- Vikram -- will create a guide for Mahout setup
- Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
- put together a perl script which will take random samples from the data, for working on smaller instances
- put together a simple R script for loading the data
- Andy --
- Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
Notes
- For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt
- We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
Ideas
- Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
- Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM
Who we are
- Andy; Machine Learning
- Thomas; Statistics
- Erin; Maths
- Vikram; Hadoop
(insert your name/contact info/expertise here)
How to run Weka (quick 'n dirty tutorial)
- Download and install Weka
- Get your KDD data & preprocess your data:
this command takes 1000 lines from the given training data set and converts it into .csv file attention, in the last sed command you need to replace the long whitespace with a tab. In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)
head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/ /,/g' > algebra_2006_2007_train_1kFormatted.csv
- The following screencast shows you how to do these steps:
- In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset.
- Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
- Screencast1
- Screencast2
How to run libSVM
- See the notes at Machine Learning/SVM