Machine Learning/Hadoop

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

About

Google had so much data that reading the data from disk took a lot of time, much less processing
- So they needed to parallelize everything, even disk access
- Make the processing local to where the data is, to avoid network issues
Parallelization is hard/error-prone
- Want to have a "shared-nothing" architecture
- Functional programming
Map

Runs the function on each item in the list, returns the list of output from running the function on each item

def map(func, list):
  return [func(item) for item in list]

Example:

def twice(num):
  return num*2

Reduce

Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating

def reduce(func, list):
  a = func(list[0], list[1])
  for

Examples/Actual

def map(key,value):
  # process
  emit(another_key, another_value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)

Average
- keys are line numbers, values are what's in it
- file:
  - 1 (1,2)
  - 4 (2,4)
  - 5 (3,5)
  - 6 (4,6)

def map(key,value):
  emit("exist",1)
  emit("x",value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)

Tutorials

http://www.cloudera.com/videos/introduction_to_pig

How to Debug

To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
- Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py

Tools

Hadoop
Hive
Pig: A high-level language for compiling down to MapReduce programs
MapReduce on Amazon (?)

Machine Learning/Hadoop

Contents

About

Examples/Actual

Tutorials

How to Debug

Tools

Navigation menu

Machine Learning/Hadoop

About

Examples/Actual

Tutorials

How to Debug

Tools

Navigation menu

Search