Hands-on Machine Learning: Difference between revisions
Line 56: | Line 56: | ||
'''Task''': Labelling | '''Task''': Labelling | ||
'''Input data types''': nominal (or numeric, with conditionals) | '''Input data types''': nominal (or numeric, with conditionals) | ||
Revision as of 20:29, 7 January 2009
Hands-on Machine Learning
The goal of this presentation is to give a very applied, hands-on introduction to a range of Machine Learning techniques. There will be a quick discussion of what kinds of data you can have, what kinds of tasks machine learning techniques allow you to, and, finally, a survey of common techniques. At the end, if you know what kinds of data you have, and what your goal is, this will let you get down to just a list of techniques that could be appropriate for your task.
Ideally, each technique will include a hands-on section for just that technique. This should cover any tools that implement the algorithm, how to get your data into those tools, and how to extract the model from those tools and incorporate it into your own code.
The General Machine Learning Process
In general, applying machine learning techniques goes something like this:
- Collect your data
- Import that data into some training tool
- Train a "model" on that data
- Tweak your data, tweak model parameters, etc, and repeat training
- Eventually you get an output model
- Take that model and integrate it with your codebase
Input Data
There are basically just two sorts of input data: nominal and numeric.
Nominal values are things like "Red", "Orange", "Poltergeist", etc. They're a closed set of discrete values, typically strings instead of numbers. However, you can use numbers as a nominal value, so long as they're from a closed set. These are usually from human judgment: there's not necessarily a good line between red and orange, but a human makes the call and records the value. Alternately, a human decides where the arbitrary line is, and writes a bit of code that makes the decision based on that line.
Numeric values are exactly what they sound like: numbers! They can take many, many values, and are generally based on direct measurements. For example, the weight of a fruit a robot is holding, or the number of occurances of a given word in a document.
Machine Learning Tasks
Generally speaking, you can ask a Machine Learning algorithm to do one of three things:
Numeric prediction
Description: "Given the input you've seen in the past, and this set of current values, what values should I expect given this input?"
A trivial example: if you have a dataset that's pairs of (Yesterday's high temperature, today's high temperature), you could train a numeric predictor that would give you an estimate of today's high temperature given yesterday's. Or, given the highs for the last week, it could predict the highs for the next 3 days.
Labelling/Classification
Description: "Given the input you've seen in the past, and this set of current values, what would you label this?"
The best-known example of this is the spam filter. After labelling a bunch of email as spam or not-spam, you train a classifier. Then, you can use that classifier on new email to decide whether the machine believes it to be spam or not. Of course, this generalizes, and you can use exactly the same technique to separate personal, work, and hobby email.
Clustering
Description: "Given the input you've seen in the past, and this set of current values, what previous inputs is it most like?"
This task is very similar to classification, with one big difference: you don't have labels. For example, if you have a bunch of measurements of flowers, you can use clustering to discover if there are underlying patterns you've missed out on, perhaps representing growing conditions or a difference in (sub)species.
Specific Techniques
Decision Trees
Task: Labelling
Input data types: nominal (or numeric, with conditionals)
Description: A decision tree is something like a flow chart. It's a tree of decision boxes; you start at the root and, based on your data, follow decisions down to leaf nodes. At the leaf nodes, you'll typically have a label.
Training
Get your data into Weka Explorer by hook or crook, then choose Classifier -> Trees -> J48. Select the nominal value you want to use as your label in the dropdown. Make sure you've got cross-validation selected, ideally with 10-fold or so.
Hit "Run" and stand back. You'll get output like:
(using the iris.arff sample data) === Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: 10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 Time taken to build model: 0.03 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 144 96 % Incorrectly Classified Instances 6 4 % Kappa statistic 0.94 Mean absolute error 0.035 Root mean squared error 0.1586 Relative absolute error 7.8705 % Root relative squared error 33.6353 % Total Number of Instances 150 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.98 0 1 0.98 0.99 Iris-setosa 0.94 0.03 0.94 0.94 0.94 Iris-versicolor 0.96 0.03 0.941 0.96 0.95 Iris-virginica === Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 2 48 | c = Iris-virginica
Evaluation
If you train as above, you're using 10-fold cross-validation, which is a reasonably good evaluation of your training set. Otherwise, the normal evaluation of labelling algorithms can be used.
Application
To apply the values you've got out of the above, you want to turn this section
petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)
Into code in whatever language you use. This is, unfortunately, a manual process. However, it's very concise, and generally compact in terms of code. You can therefore use these decision trees on any computer, no matter how big or small.
Naive Bayes Classifier
Task: Labelling
Input data types: nominal
Description: Naive Bayes is a statistical technique for predicting the probability of all labels given a set of inputs. For instance, let's assume we've trained a naive Bayes system on (color, kind of fruit) pairs. Then, we can ask it for the probability distribution of "kind of fruit" given the color "yellow." This will tell us that it's almost certainly a banana or lemon, but it could be an apple, and might occasionally be an orange, etc. That is, it returns a list of labels with an associated probability.
Training:
Evaluation:
Application:
Support Vector Machines
Task: Labelling
Input data types: numeric
Description: Support Vector Machines work by finding lines that separate data points. Its input values are labelled points in a high-dimensional space.
Training: libsvm and svmlight.
Evaluation:
Application:
Polynomial Regression
Task: Numeric Prediction
Input data types: numeric
Description: This isn't technically machine learning. It's actually just an inference technique, but it's often a good technique to try as a baseline.
Training:
Evaluation:
Application:
Neural Networks
Task: Numeric Prediction
Input data types: numeric
Description: A neural network allows you predict a number of continuous numeric values based on other continuous values. "A Neural Network is the second best way to solve any problem."
Training:
Evaluation:
Application:
k-Means Clustering
Task: Clustering
Input data types: numeric or nominal
Description: k-Means clustering allows you to take a set of feature vectors and decide which group of feature vectors to associate it with. In a fruit-market universe, this will cluster all the "round, red, dense" things together, separate from the "orange, round, dense" things.
Training:
Evaluation:
Application:
Technique
Task:
Input data types:
Description:
Training:
Evaluation:
Application: