Machine Learning Meetup Notes:2010-11-17

From Noisebridge
Jump to navigation Jump to search

1) values between 0 and 1 are ok.

2) Each pair/row has 2 nodes: the first (outbound) points to the second (inbound). Any node can have inbound as well as outbound edges.


3) The creation of the test data is indeed not trivial. It is random but with a lot of selection up front to ensure that the remaining network does not get destroyed and that guesses are not easy. You will notice that every outbound node only appears once in the test data. One test was run to make sure test and train look similar.


4) QUESTION: If a node-pair does not exist in either the training set or the test set, can we assume there's no edge connecting them in the complete dataset (from which the competition dataset was built) ? That is, for all the nodes in the competition dataset, is there any edge that's in the complete dataset but was not picked for the competition dataset ?

4) ANSWER: There might be confusion here. A node pair can only ever be in either train or test, never both. When a pair is in test, it can be true or false. All nodes should be in both.

5) All outbound edges of the nodes which have at least one outbound edge are in the training plus half of the test data set. So it is complete.

6) All nodes were chosen at random, after some pre-selection. For the false edges, 2 nodes were chosen at random and if they had an edge it was rejected and another random pair was chosen. They were sampled from the FROM and TO set.

7) the order does matter and the IDs don't.