Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions

Revision as of 16:44, 25 November 2010

TODO

Precisely define the listed features

Possible Features

Node Features
- nodeid
- outdegree
- indegree
- local clustering coefficient
- reciprocation of inbound probability (num of edges returned / num of inbound edges)
- reciprocation of outbound probability (num of edges returned / num of outbound edges)

Edge Features
- nodetofollowid
- shortest distance nodeid to nodetofollowid
- density? (~~median path length~~)
- does reverse edge exist? (aka is nodetofollowid following nodeid?)
- number of common friends
- indegrees & outdegrees of nodetofollowid

Network features
- unweighted random walk score
- global clustering coefficient
- Adamic-Adar score
  - see original paper
  - R igraph: similarity.invlogweighted

Clustering
- membership of the same strongly connected cluster
  - using igraph clusters

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Joe's attempt

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

the in-degree of s
the out-degree of s
the in-degree of t
the out-degree of t
RLD_-1(s)
RLD₁(s)
RLD₀(s)
RLD_-1(t)
RLD₁(t)
RLD₀(t)
AA₀¹(s,t)
AA₀^1.5(s,t)
AA₀²(s,t)
AA_-1¹(s,t)
AA_-1^1.5(s,t)
AA_-1²(s,t)
AA₁¹(s,t)
AA₁^1.5(s,t)
AA₁²(s,t)

where

RLD_x(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
- note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
- logs are taken to base e

I define N_x^h(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define C_x^h(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

C_x^h(s,t) = (N_x^h(s) ∩ N_-x^h(t)) \ ∪_{h' < h}(N_x^h'(s) ∩ N_-x^h'(t))
- h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
The sets C_x^h(s,t) are distinct for different h.
It is directional, ie sometimes C_x^h(s,t)≠C_x^h(t,s)
AA is the Adamic-Adar score calculated over different common neighbourhoods.
- the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
- the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
AA_x^h(s,t) = sum_{n ∈ C_x^h(s,t)} RLD₀(n)

@@ Line 57: / Line 57: @@
 where
-* RLD<sub>x</sub>(n) is 1 / log(the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
+* RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
+** note that I add 0.1 so that nodes with degree 1 have a score of  1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
+** logs are taken to base e
 I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1)  or edges from s towards t (x = 1).

Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions

Revision as of 16:44, 25 November 2010

TODO

Possible Features

Joe's attempt

Navigation menu

Search