Machine Learning/Kaggle Social Network Contest/Network Description: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
(→‎Conectivity: effect of adding test data on clusters)
Line 83: Line 83:


Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?
Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?
* TODO figure out probs of adding and removing nodes under different sampling hypotheses.
* TODO: figure out probs of adding and removing nodes under different sampling hypotheses.
* TODO: identify the edges which are merging the clusters
* I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.
* I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.



Revision as of 01:07, 24 November 2010

Here we can put the descriptive statistics of the network:

  • Number of fully sampled nodes: 37,689
    • ie the unique "outnodes" in the edge list
  • Total number of nodes: 1,133,547
  • number of edges: 7,237,983

Conectivity

"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." wikipedia

  • The Training Graph is not weakly connected
  • It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
    • c.f. igraph clustering
    • There is one very large cluster containing all but 154 verticies, then 4 with size 10 - 37, 8 sized 3 - 7 and 13 size 2
    • note that igraph seems to create a vertex labelled 0 but the labels in the traindata file range from 1 to 1133547
  • I also grabbed the number of strongly connected subgraphs
Cluster Size 1 2 3 4 5 9 10 32464
freq 1100647 162 18 5 4 1 1 1

When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.

Cluster Size 1 2 3 4 5 7 10 23 37 1133394 1133466
Train 1 13 3 2 2 1 1 2 1 1 0
Train + Test 1 13 2 1 1 1 1 1 0 0 1

Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?

  • TODO: figure out probs of adding and removing nodes under different sampling hypotheses.
  • TODO: identify the edges which are merging the clusters
  • I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.
  • Diameter of the directed graph is 14
    • This is the longest of the shortest directed paths between two nodes
    • R igraph
      • diameter (dg, directed = TRUE, unconnected = TRUE)
      • Was taking forever so I aborted (after 34 minutes...)
  • Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
    • For each of our 38k I calculated the number of outbound neighbours and summed it
    • R igraph: