Machine Learning/Kaggle Social Network Contest: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Official Contest Links ==
* Overview: http://kaggle.com/socialNetwork
* Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)
== Official Data Downloads ==
* Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
* Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
* Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv
== Status ==
== Status ==
{| border="1" cellspacing="0" cellpadding="2"
{| border="1" cellspacing="0" cellpadding="2"
Line 15: Line 6:
!| target date
!| target date
!|subpage
!|subpage
|-
!|Lit review
| started
| -
| [[/lit review| Lit Review]]
|-
|-
|-
!|Load data
!|Load data
Line 20: Line 17:
| 11/24
| 11/24
| [[/load data| Load data]]
| [[/load data| Load data]]
|-
!|Describe network
| started
| 11/24
| [[/Network Description| Network Description]]
|-
|-
!|Choose problem representation
!|Choose problem representation
Line 41: Line 43:
| [[/what to do with all the money | Prize Plan]]
| [[/what to do with all the money | Prize Plan]]
|}
|}
== Official Contest Links ==
* Overview: http://kaggle.com/socialNetwork
* Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)
== Official Data Downloads ==
* Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
* Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
* Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv


== Key Contest Info ==
== Key Contest Info ==
Line 57: Line 68:
Don’t despair if your first couple of solutions score low, this is an explorative process.
Don’t despair if your first couple of solutions score low, this is an explorative process.


== Brainstorming on Process ==
== Our Working Data Dumps ==
* We shouldn't have a single approach to solving the problem.  If people have ideas they should run with them and report back their success/failure to the group.  The collaboration between our diverse ideas/approaches/experiences will be our strength in working together.
* Since this is throw away code for this competition only, we need not get hung up on efficiency or elegant implementations.  That said, if we hit a point where our code is not able to perform fast enough then we can address it at that point, instead of overengineering from the get-go.
* Theo suggested that we start by using things like python/ruby scripts to massage the starting data set into something more useful (with more features), then analyse and visualize that using things like R.
* Jared was wondering if people think it's legit to use the mailing list for discussion or if we should create a discussion list for the competition to prevent from spamming the main list with competition collboration? (Update: Maybe we can use wiki instead?)
* Also, as we transform the dataset into different views, we are going to end up with some large files that we will be passing around to each other.  Any suggestions on how to best do that? Jared has been using Dropbox (see dumps below).
 
== Brainstorming on Strategy ==
* The dataset forms a graph of directed edges between vertices.  At the core of this problem will performing analysis on that graph.  The first intuitive approach we had come to mind was that the shorter the distance between two vertices using existing edges, the more likely it would be that an edge could/should exist between those vertices.
* After the talk, Erin, Theo, and jared stumbled on the idea that some vertices might be uber-followers (meaning more outbound edges than the average vertex) and that some vertices might be uber-followees (meaning more inbound edges than average).  This reminded us of PageRank for link graphs, so perhaps we can draw from techniques in that vein. The application of this in our problem, might be in weighting. For example, people who follow lots of people might be more likely to follow someone further out in their "network", where someone who doesn't follow many people might less likely to follow someone outside their "network".
* Since the edges are directional, we know that it's possible for people to "follow" someone with out that person "following back".  At first glance it might make sense that the reverse edges would be likely in cases like this.  However consider a "hub" user with lots of followers who doesn't reciprocate with edges back to his followers, then the information of who follows him is less important in determining who he would follow.  Conversely, for a user who commonly reciprocates with followbacks, then the information on who follows her might be useful in suggesting who she follow.
 
== Useful Links ==
* http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
* http://en.wikipedia.org/wiki/Glossary_of_graph_theory
* On calculating in & outdegrees: [[http://books.google.com/books?id=CAm2DpIqRUIC&lpg=PA163&ots=HtNuxg3DOf&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PA163#v=onepage&q=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&f=false|Google Book on Social Network Analysis]]
* [[http://books.google.com/books?id=Ww3_bKcz6kgC&lpg=PA67&ots=aFSGYEjA_g&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PP7#v=onepage&q&f=false | Another Google Book on Social Network Analysis]]
* Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
* "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm
 
== Working Data Dumps ==
* Adjacency list based from the training data:
* Adjacency list based from the training data:
   http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv  
   http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv  
Line 87: Line 78:
   First column: inbound vertex
   First column: inbound vertex
   Remaining columns: list of vertices which point to it
   Remaining columns: list of vertices which point to it
   Note: This is useful if interested in following the edges backwards quickly. This is useful to load as a hashtable keyed on inbound vertex returning the list.
   Note: This is useful if interested in following the edges backwards quickly.
        This is useful to load as a hashtable keyed on inbound vertex returning the list.
* Degree Features for all Nodes:
  http://dl.dropbox.com/u/14895843/social-network-kaggle/node_degree_features.csv
  First column: Node Id
  Second column: Outbound Degree (count of the number of outbound edges from node)
  Third column: Inbound Degree (count of the number of inbound edges to node)
  Note: You can think of these as number of followees and followers (respectively).
        Additionally, note that only the first 32.7k rows have 'followees'


== Possible Features ==
== Useful Links ==
*nodeid
* [[http://arxiv.org/abs/1011.4071 | Supervised Random Walks: Predicting and Recommending Links in Social Networks]]
*nodetofollowid
* Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
*median path length
* "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm
*shortest distance from nodeid to nodetofollowid
* http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
*inbound edges
* http://en.wikipedia.org/wiki/Glossary_of_graph_theory
*outbound edges
* [[http://books.google.com/books?id=Ww3_bKcz6kgC&lpg=PA67&ots=aFSGYEjA_g&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PP7#v=onepage&q&f=false | Another Google Book on Social Network Analysis]]
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)
 
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Latest revision as of 19:55, 22 November 2010

Status[edit]

Tasks Status target date subpage
Lit review started - Lit Review
Load data started 11/24 Load data
Describe network started 11/24 Network Description
Choose problem representation started 11/24 Problem Representation
Generate candidate features 0% 11/24 Features
fit to model 0% Model
Win competition 0% Prize Plan

Official Contest Links[edit]

Official Data Downloads[edit]

Key Contest Info[edit]

The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.

You are given 7,237,983 contacts/edges from a social network (social_train.zip). The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.

There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.

The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.

The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.

You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.

Don’t despair if your first couple of solutions score low, this is an explorative process.

Our Working Data Dumps[edit]

  • Adjacency list based from the training data:
  http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv 
  First column: outbound vertex
  Remaining columns: list of vertices to which it points
  Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
  • Adjacency list of the reversed Graph:
  http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
  First column: inbound vertex
  Remaining columns: list of vertices which point to it
  Note: This is useful if interested in following the edges backwards quickly.
        This is useful to load as a hashtable keyed on inbound vertex returning the list.
  • Degree Features for all Nodes:
  http://dl.dropbox.com/u/14895843/social-network-kaggle/node_degree_features.csv
  First column: Node Id
  Second column: Outbound Degree (count of the number of outbound edges from node)
  Third column: Inbound Degree (count of the number of inbound edges to node)
  Note: You can think of these as number of followees and followers (respectively).
        Additionally, note that only the first 32.7k rows have 'followees'

Useful Links[edit]