Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions
Jump to navigation
Jump to search
Line 34: | Line 34: | ||
</pre> | </pre> | ||
{| border="1" | Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods! | ||
{| border="1" | |||
|- | |- | ||
!|Rows | !|Rows |
Revision as of 18:35, 19 November 2010
How to load the network into networkx
There is a network analysis package for Python called networkx. This package can be installed using easy_install.
The network can be loaded using the read_edgelist function in networkx or by manually adding edges
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
Method 1
import networkx as nx DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
Method 2
import networkx as nx import csv import time t0 = time.clock() DG = nx.DiGraph() netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',') for row in netcsv: tmp1 = int(row[0]) tmp2 = int(row[1]) DG.add_edge(tmp1, tmp2) print "Loaded in ", str(time.clock() - t0), "s"
Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!
Rows | 1M | 2M | 3M |
---|---|---|---|
Method 1 | 20s | 53s | 103s |
Method 2 | 15s | 41s | 86s |