Machine Learning/weka

From Noisebridge
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Here are the weka commands I used for discretization, obfuscation (to reduce size of files) and classification of the KDD set. Note: I'm running on a latest gen Macbook that I've overclocked with 8GB ram, which was needed (-Xms4096m -Xmx8192m) during processing even for the obfuscated files.

Discretize: need to unset class temporarily in order to treat the class attribute the same as all other attributes; Not all filters support this, and they consequently cause a lot of pain to apply; This is a small detail in weka that makes it much less usable in many cases.

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -F -B 10 -i aUnified.csv -o
aUnifiedDiscretized.csv
 (to understand the command line options, you can invoke java -cp
weka.jar weka.filters.unsupervised.attribute.Discretize -h )

Obfuscating:

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Obfuscate -i atest.arff -o
atestObf.arff

Classification: this is the command I tried for producing predictions but I wasn't able to get the labels for the test data....

java -Xms4096m -Xmx8192m -cp weka.jar
weka.classifiers.bayes.NaiveBayesUpdateable -t atrainObf.arff -T
atestObf.arff -p last > aOut1.txt

So instead I wrote a small Java class to do this. Using an updateable classifier so it loads the file one line at a time, so it will fit into memory.


       log.info("Loading data...");
       NaiveBayesUpdateable nb;
       {
               ArffLoader loader = new ArffLoader();
               loader.setFile(new File("atrain.arff"));
               Instances structure = loader.getStructure();
               structure.setClassIndex(structure.numAttributes() - 1);

               // train NaiveBayes
               nb = new NaiveBayesUpdateable();
               nb.buildClassifier(structure);
               Instance current;
               while ((current = loader.getNextInstance(structure)) != null) {
                 nb.updateClassifier(current);
               }

       }

       log.info("Now classifying...");
       {
           FileWriter fw = new FileWriter("aPredictions.txt", true);
               ArffLoader loader = new ArffLoader();
               loader.setFile(new File("atest.arff"));
               Instances structure = loader.getStructure();
               structure.setClassIndex(structure.numAttributes() - 1);

               // classify using NaiveBayes
               Instance current;
               while ((current = loader.getNextInstance(structure)) != null) {
                   double clsLabel = nb.classifyInstance(current);
                   double[] distribution = nb.distributionForInstance(current);
// here I tried to cap the probability predictions at mean +- one
standard deviation of the iq; could instead also just predict
distribution[1] value
double estimate = Math.max(Math.min(distribution[1], 0.92d),0.80d); //
iq mean: 0.86, standard dev 6
                   log.info("ClassLabel: " + clsLabel + ", estimate: " +estimate);
                   fw.write("" + estimate + "\r\n");
               }
           fw.close();

       }