Machine Learning/Hadoop: Difference between revisions
Jump to navigation
Jump to search
ThomasLotze (talk | contribs) No edit summary |
ThomasLotze (talk | contribs) mNo edit summary |
||
Line 1: | Line 1: | ||
===About=== | ===About=== | ||
* Google had so much data that ''reading'' the data from disk took a lot of time, much less processing | |||
** So they needed to parallelize everything, even disk access | |||
** Make the processing local to where the data is, to avoid network issues | |||
* Parallelization is hard/error-prone | |||
** Want to have a "shared-nothing" architecture | |||
** Functional programming | |||
* Map | |||
Runs the function on each item in the list, returns the list of output from running the function on each item | |||
<pre> | |||
def map(func, list): | |||
return [func(item) for item in list] | |||
</pre> | |||
Example: | |||
<pre> | |||
def twice(num): | |||
return num*2 | |||
</pre> | |||
* Reduce | |||
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating | |||
<pre> | |||
def reduce(func, list): | |||
a = func(list[0], list[1]) | |||
for | |||
</pre> | |||
===Examples/Actual=== | |||
<pre> | |||
def map(key,value): | |||
# process | |||
emit(another_key, another_value) | |||
def reduce(key, values): | |||
# process the key and all values associated with it | |||
emit(something) | |||
</pre> | |||
* Average | |||
** keys are line numbers, values are what's in it | |||
** file: | |||
*** 1 (1,2) | |||
*** 4 (2,4) | |||
*** 5 (3,5) | |||
*** 6 (4,6) | |||
<pre> | |||
def map(key,value): | |||
emit("exist",1) | |||
emit("x",value) | |||
def reduce(key, values): | |||
# process the key and all values associated with it | |||
emit(something) | |||
</pre> | |||
===Tutorials=== | ===Tutorials=== |
Revision as of 21:14, 19 May 2010
About
- Google had so much data that reading the data from disk took a lot of time, much less processing
- So they needed to parallelize everything, even disk access
- Make the processing local to where the data is, to avoid network issues
- Parallelization is hard/error-prone
- Want to have a "shared-nothing" architecture
- Functional programming
- Map
Runs the function on each item in the list, returns the list of output from running the function on each item
def map(func, list): return [func(item) for item in list]
Example:
def twice(num): return num*2
- Reduce
Take a function (which takes two arguments) and a list, and iteratively continues through, accumulating
def reduce(func, list): a = func(list[0], list[1]) for
Examples/Actual
def map(key,value): # process emit(another_key, another_value) def reduce(key, values): # process the key and all values associated with it emit(something)
- Average
- keys are line numbers, values are what's in it
- file:
- 1 (1,2)
- 4 (2,4)
- 5 (3,5)
- 6 (4,6)
def map(key,value): emit("exist",1) emit("x",value) def reduce(key, values): # process the key and all values associated with it emit(something)
Tutorials
Tools
- Hadoop
- Hive
- Pig: A high-level language for compiling down to MapReduce programs
- MapReduce on Amazon (?)