Editing KDD Competition 2010

We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data!  If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.

==Resources==
* [https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp KDD Rules and Data Format]
* [http://cran.r-project.org/ R language]
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm]
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka]
* [http://www.kdnuggets.com/datasets/competitions.html List of other competitions in which we could engage]
* [[Machine Learning/Hadoop | Hadoop]]
* [http://lucene.apache.org/mahout/ Mahout -- machine learning libraries for Hadoop]
* [http://www.cloudera.com/videos/introduction_to_pig So-so intro to Pig Video]
* [http://s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ElasticMapReduce-PigTutorial.html An AWESOME intro to Pig on Elastic Map Reduce!]
* [http://hadoop.apache.org/pig/ Pig language]
* [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin Manual]
* [http://www.cloudera.com/ Cloudera -- see videos for Hadoop intro]
* [http://github.com/voberoi/hadoop-mrutils Vikram's awesome Hadoop/EC2 scripts]
* [https://www.noisebridge.net/mailman/listinfo/ml Our mailing list]
* [http://www.s3fox.net/ S3Fox]
* [https://www.noisebridge.net/wiki/Machine_Learning/SVM Thomas' great libSVM writeup]

==TODOs==

* Vikram -- will create a guide for Mahout setup 
* Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
* Andy -- 
* Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets



== Notes ==
* For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.

== Ideas == 
* Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
* Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM


== Who we are ==
* Andy; Machine Learning
* Thomas; Statistics
* Erin; Maths
* Vikram; Hadoop
(insert your name/contact info/expertise here)


== How to run Weka (quick 'n dirty tutorial) == 
* Download and install Weka
* Get your KDD data
* preprocess your data: this command takes 1000 lines from the given training data set and converts it into .csv file
* attention, in the last sed command you need to replace the long whitespace with a tab.  In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)
 head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/       /,/g' > algebra_2006_2007_train_1kFormatted.csv
* The following screencast shows you how to do these steps: 
* In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset. 
* Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage1.swf Screencast1]
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2]

== How to run SVM ==
* See the notes at [[Machine Learning/SVM]]