KDD Competition 2010

We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.

Resources

KDD Rules and Data Format
R language
libsvm
Weka
List of other competitions in which we could engage
Hadoop
Mahout -- machine learning libraries for Hadoop
So-so intro to Pig Video
An AWESOME intro to Pig on Elastic Map Reduce!
Pig language
Pig Latin Manual
Cloudera -- see videos for Hadoop intro
Vikram's awesome Hadoop/EC2 scripts
Our mailing list
S3Fox
Importing data into Sqlite for SQL'ing the data
Visualizing Sqlite data in Omniscope for understanding the data

TODOs

Following our decisions: For orthogonalized bridge and algebra sets: Replace step name with unique step name; remove given features; add features: step success chance, student IQ, complexity, and perhaps frequency of skills (least important). These should be fairly straightforward computations, but on big datasets. We will call the resulting datasets "master raw 1 bridge/algebra" In parallel we can start clustering superskills: given the normalized skills, cluster groups of skills (=super skills) to replace the too-detailed skills; the resulting datasets will be called "master clustered 1 bridge/algebra". This will be our base datasets for the machine learning algorithms.

Vikram -- will create a guide for Mahout setup
Thomas --
- put together a perl script which will take random samples from the data, for working on smaller instances
- put together a simple R script for loading the data
Andy -- Think about clustering superskills; define features for sub-problems (student iq, step difficulty)
Erin -- will provide the orthogonalized data sets at next meetup

Notes

For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt
We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.

Ideas

Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM

Who we are

Andy; Machine Learning
Thomas; Statistics
Erin; Maths
Vikram; Hadoop

(insert your name/contact info/expertise here)

How to run Weka (quick 'n very dirty tutorial)

Download and install Weka
Get your KDD data & preprocess your data:

this command takes 1000 lines from the given training data set and converts it into .csv file attention, in the last sed command you need to replace the long whitespace with a tab. In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)

head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/       /,/g' > algebra_2006_2007_train_1kFormatted.csv

The following screencast shows you how to do these steps:
In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset.
Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
Screencast1
Screencast2

How to run libSVM

See the notes at Machine Learning/SVM

Contents

Resources

TODOs

Notes

Ideas

Who we are

How to run Weka (quick 'n very dirty tutorial)

How to run libSVM

Navigation menu

KDD Competition 2010

Resources

TODOs

Notes

Ideas

Who we are

How to run Weka (quick 'n very dirty tutorial)

How to run libSVM

Navigation menu

Search