Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions
(moved R to the top of the page) |
(→igraph) |
||
Line 13: | Line 13: | ||
dg <- graph.edgelist(data, directed=TRUE) | dg <- graph.edgelist(data, directed=TRUE) | ||
</pre> | </pre> | ||
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges. | |||
=Python= | =Python= |
Latest revision as of 13:28, 23 November 2010
R[edit]
igraph[edit]
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
Grab the package with:
install.packages("igraph")
Load the data using:
data <-as.matrix(read.csv("social_train.csv", header = FALSE)); dg <- graph.edgelist(data, directed=TRUE)
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
Python[edit]
How to load the network into networkx[edit]
There is a network analysis package for Python called networkx. This package can be installed using easy_install.
The network can be loaded using the read_edgelist function in networkx or by manually adding edges
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
Method 1
import networkx as nx DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
Method 2
import networkx as nx import csv import time t0 = time.clock() DG = nx.DiGraph() netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',') for row in netcsv: tmp1 = int(row[0]) tmp2 = int(row[1]) DG.add_edge(tmp1, tmp2) print "Loaded in ", str(time.clock() - t0), "s"
Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!
Rows | 1M | 2M | 3M |
---|---|---|---|
Method 1 | 20s | 53s | 103s |
Method 2 | 15s | 41s | 86s |
Ruby[edit]
Note on CSV Libraries[edit]
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
Loading Adjacency Lists[edit]
require 'rubygems' require 'faster_csv' def load_adj_list_faster(filename) adj_list_hash={} FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row| node_id=row.shift list_of_adj=row adj_list_hash[node_id] = list_of_adj end return adj_list_hash end adj_list_lookup = load_adj_list_faster('adj_list.out.csv') rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')