Machine Learning/Hadoop

From Noisebridge
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

About

  • Google had so much data that reading the data from disk took a lot of time, much less processing
    • So they needed to parallelize everything, even disk access
    • Make the processing local to where the data is, to avoid network issues
  • Parallelization is hard/error-prone
    • Want to have a "shared-nothing" architecture
    • Functional programming
  • Map

Runs the function on each item in the list, returns the list of output from running the function on each item

def map(func, list):
  return [func(item) for item in list]

Example:

def twice(num):
  return num*2
  • Reduce

Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating

def reduce(func, list):
  a = func(list[0], list[1])
  for 

Examples/Actual

def map(key,value):
  # process
  emit(another_key, another_value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)
  • Average
    • keys are line numbers, values are what's in it
    • file:
      • 1 (1,2)
      • 4 (2,4)
      • 5 (3,5)
      • 6 (4,6)
def map(key,value):
  emit("exist",1)
  emit("x",value)
def reduce(key, values):
  # process the key and all values associated with it
  emit(something)



Tutorials

How to Debug

  • To debug a streaming Hadoop process, cat your source file, pipe it to the mapper, then to sort, then to the reducer
    • Ex: cat princess_bride.txt | scripts/word-count/mapper.py | sort | scripts/word-count/reducer.py

Tools

  • Hadoop
  • Hive
  • Pig: A high-level language for compiling down to MapReduce programs
  • MapReduce on Amazon (?)