Post on 06-Apr-2018
8/2/2019 Cloud MapReduce Zaharia
1/52
UC Berkeley
Introduction to MapReduce
and Hadoop
Matei Zaharia
UC Berkeley RAD Labmatei@eecs.berkeley.edu
mailto:matei@eecs.berkeley.edumailto:matei@eecs.berkeley.edu8/2/2019 Cloud MapReduce Zaharia
2/52
What is MapReduce?
Data-parallel programming model forclusters of commodity machines
Pioneered by Google Processes 20 PB of data per day
Popularized by open-source Hadoop project
Used by Yahoo!, Facebook, Amazon,
8/2/2019 Cloud MapReduce Zaharia
3/52
What is MapReduce used for?
At Google: Index building for Google Search
Article clustering for Google News
Statistical machine translation
At Yahoo!: Index building for Yahoo! Search
Spam detection for Yahoo! Mail
At Facebook: Data mining
Ad optimization
Spam detection
8/2/2019 Cloud MapReduce Zaharia
4/52
Example: Facebook Lexicon
www.facebook.com/lexicon
http://www.facebook.com/lexiconhttp://www.facebook.com/lexicon8/2/2019 Cloud MapReduce Zaharia
5/52
Example: Facebook Lexicon
www.facebook.com/lexicon
http://www.facebook.com/lexiconhttp://www.facebook.com/lexicon8/2/2019 Cloud MapReduce Zaharia
6/52
What is MapReduce used for?
In research: Analyzing Wikipedia conflicts (PARC)
Natural language processing (CMU)
Bioinformatics (Maryland) Astronomical image analysis (Washington)
Ocean climate simulation (Washington)
8/2/2019 Cloud MapReduce Zaharia
7/52
Outline
MapReduce architecture Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop:Pig and Hive
8/2/2019 Cloud MapReduce Zaharia
8/52
MapReduce Design Goals
1. Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/s = 23 days
Scan on 1000-node cluster = 33 minutes
2. Cost-efficiency: Commodity nodes (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)
8/2/2019 Cloud MapReduce Zaharia
9/52
Typical Hadoop Cluster
Aggregation switch
Rack switch
40 nodes/rack, 1000-4000 nodes in cluster 1 GBps bandwidth in rack, 8 GBps out of rack
Node specs (Yahoo terasort):8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
8/2/2019 Cloud MapReduce Zaharia
10/52
Typical Hadoop Cluster
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
8/2/2019 Cloud MapReduce Zaharia
11/52
Challenges
Cheap nodes fail, especially if you have many Mean time between failures for 1 node = 3 years
MTBF for 1000 nodes = 1 day
Solution: Build fault-tolerance into system
Commodity network = low bandwidth Solution: Push computation to the data
Programming distributed systems is hard Solution: Data-parallel programming model: users
write map and reduce functions, system handleswork distribution and fault tolerance
8/2/2019 Cloud MapReduce Zaharia
12/52
Hadoop Components
Distributed file system (HDFS) Single namespace for entire cluster
Replicates data 3x for fault-tolerance
MapReduce implementation Executes user jobs specified as map and
reduce functions
Manages work distribution & fault-tolerance
8/2/2019 Cloud MapReduce Zaharia
13/52
Hadoop Distributed File System
Files split into 128MB blocks
Blocks replicated across
several datanodes(usually 3)
Single namenodestoresmetadata (file names, blocklocations, etc)
Optimized for large files,sequential reads
Files are append-only
Namenode
Datanodes
1
2
34
1
2
4
2
1
3
1
4
3
3
2
4
File1
8/2/2019 Cloud MapReduce Zaharia
14/52
MapReduce Programming Model
Data type: key-value records
Map function:
(Kin, Vin) list(Kinter, Vinter)
Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
8/2/2019 Cloud MapReduce Zaharia
15/52
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):output(key, sum(values))
8/2/2019 Cloud MapReduce Zaharia
16/52
Word Count Execution
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 1
mouse, 1quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
8/2/2019 Cloud MapReduce Zaharia
17/52
MapReduce Execution Details
Single mastercontrols job execution onmultiple slavesas well as user scheduling
Mappers preferentially placed on same node
or same rack as their input block Push computation to data, minimize network use
Mappers save outputs to local disk rather
than pushing directly to reducers Allows having more reducers than nodes
Allows recovery if a reducer crashes
8/2/2019 Cloud MapReduce Zaharia
18/52
An Optimization: The Combiner
def combiner(key, values):
output(key, sum(values))
A combiner is a local aggregation functionfor repeated keys produced by same map
For associative ops. like sum, count, max
Decreases size of intermediate data
Example: local counting for Word Count:
8/2/2019 Cloud MapReduce Zaharia
19/52
Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output
the quickbrown fox
the fox atethe mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 1
mouse, 1quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 2fox, 1
how, 1now, 1
brown, 1ate, 1
mouse, 1
cow, 1
8/2/2019 Cloud MapReduce Zaharia
20/52
Outline
MapReduce architecture Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop:Pig and Hive
8/2/2019 Cloud MapReduce Zaharia
21/52
Fault Tolerance in MapReduce
1. If a task crashes: Retry on another node
Okay for a map because it had no dependencies
Okay for reduce because map outputs are on disk If the same task repeatedly fails, fail the job or
ignore that input block (user-controlled)
Note: For this and the other fault tolerancefeatures to work, your map and reducetasks must be side-effect-free
8/2/2019 Cloud MapReduce Zaharia
22/52
Fault Tolerance in MapReduce
2. If a node crashes: Relaunch its current tasks on other nodes
Relaunch any maps the node previously ran
Necessary because their output files were lostalong with the crashed node
8/2/2019 Cloud MapReduce Zaharia
23/52
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler): Launch second copy of task on another node
Take the output of whichever copy finishes
first, and kill the other one
Critical for performance in large clusters:
stragglers occur frequently due to failinghardware, bugs, misconfiguration, etc
8/2/2019 Cloud MapReduce Zaharia
24/52
Takeaways
By providing a data-parallel programmingmodel, MapReduce can control jobexecution in useful ways:
Automatic division of job into tasks Automatic placement of computation near data
Automatic load balancing
Recovery from failures & stragglers User focuses on application, not on
complexities of distributed computing
8/2/2019 Cloud MapReduce Zaharia
25/52
Outline
MapReduce architecture Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop:Pig and Hive
8/2/2019 Cloud MapReduce Zaharia
26/52
1. Search
Input: (lineNumber, line) records
Output: lines matching a given pattern
Map:if(line matches pattern):
output(line)
Reduce: identify function
Alternative: no reducer (map-only job)
8/2/2019 Cloud MapReduce Zaharia
27/52
pigsheepyakzebra
aardvarkant
beecowelephant
2. Sort
Input: (key, value) records Output: same records, sorted by key
Map: identity function Reduce: identify function
Trick: Pick partitioningfunction h such thatk1 h(k1)
8/2/2019 Cloud MapReduce Zaharia
28/52
3. Inverted Index
Input: (filename, text) records Output: list of files containing each word
Map:foreach word in text.split():
output(word, filename)
Combine: uniquify filenames for each word
Reduce:def reduce(word, filenames):
output(word, sort(filenames))
8/2/2019 Cloud MapReduce Zaharia
29/52
Inverted Index Example
to be ornot to be afraid, (12th.txt)
be, (12th.txt, hamlet.txt)greatness, (12th.txt)not, (12th.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)
hamlet.txt
be not
afraid ofgreatness
12th.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt
be, 12th.txtnot, 12th.txtafraid, 12th.txt
of, 12th.txtgreatness, 12th.txt
8/2/2019 Cloud MapReduce Zaharia
30/52
4. Most Popular Words
Input: (filename, text) records Output: the 100 words occurring in most files
Two-stage solution: Job 1:
Create inverted index, giving (word, list(file)) records
Job 2: Map each (word, list(file)) to (count, word)
Sort these records by count as in sort job
Optimizations: Map to (word, 1) instead of (word, file) in Job 1
Estimate count distribution in advance by sampling
8/2/2019 Cloud MapReduce Zaharia
31/52
Outline
MapReduce architecture Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop:Pig and Hive
8/2/2019 Cloud MapReduce Zaharia
32/52
Getting Started with Hadoop
Download from hadoop.apache.org To install locally, unzip and set JAVA_HOME
Details:hadoop.apache.org/core/docs/current/quickstart.html
Three ways to write jobs:
Java API
Hadoop Streaming (for Python, Perl, etc)
Pipes API (C++)
http://hadoop.apache.org/corehttp://hadoop.apache.org/core/docs/current/quickstart.htmlhttp://hadoop.apache.org/core/docs/current/quickstart.htmlhttp://hadoop.apache.org/core8/2/2019 Cloud MapReduce Zaharia
33/52
Word Count in Java
publicstaticclass MapClass extendsMapReduceBaseimplementsMapper {
privatefinalstaticIntWritable ONE = newIntWritable(1);
publicvoid map(LongWritable key, Text value,
OutputCollector output,Reporter reporter) throwsIOException {
String line = value.toString();
StringTokenizer itr = newStringTokenizer(line);
while(itr.hasMoreTokens()) {
output.collect(newtext(itr.nextToken()), ONE);
}}
}
8/2/2019 Cloud MapReduce Zaharia
34/52
Word Count in Java
publicstaticclass Reduce extendsMapReduceBaseimplements Reducer {
publicvoidreduce(Text key, Iterator values,
OutputCollector output,
Reporter reporter) throws IOException {
int sum = 0;while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
8/2/2019 Cloud MapReduce Zaharia
35/52
Word Count in Java
publicstaticvoid main(String[] args) throwsException {JobConf conf = newJobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, newPath(args[1]));
conf.setOutputKeyClass(Text.class);// out keys are words (strings)
conf.setOutputValueClass(IntWritable.class);// values are counts
JobClient.runJob(conf);
}
Word Count in Python with
8/2/2019 Cloud MapReduce Zaharia
36/52
Word Count in Python withHadoop Streaming
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "\t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("\t")
dict[word] = dict.get(word, 0) + int(count)
for word, count in counts:
print(word.lower() + "\t" + 1)
Mapper.py:
Reducer.py:
8/2/2019 Cloud MapReduce Zaharia
37/52
Outline
MapReduce architecture Fault tolerance in MapReduce
Sample applications
Getting started with Hadoop
Higher-level languages on top of Hadoop:Pig and Hive
8/2/2019 Cloud MapReduce Zaharia
38/52
Motivation
MapReduce is great, as many algorithmscan be expressed by a series of MR jobs
But its low-level: must think about keys,values, partitioning, etc
Can we capture common job patterns?
8/2/2019 Cloud MapReduce Zaharia
39/52
Pig
Started at Yahoo! Research Now runs about 30% of Yahoo!s jobs
Features:
Expresses sequences of MapReduce jobs
Data model: nested bags of items
Provides relational (SQL) operators
(JOIN, GROUP BY, etc) Easy to plug in Java functions
Pig Pen dev. env. for Eclipse
8/2/2019 Cloud MapReduce Zaharia
40/52
An Example Problem
Suppose you haveuser data in onefile, website data in
another, and youneed to find the top5 most visited
pages by usersaged 18 - 25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
8/2/2019 Cloud MapReduce Zaharia
41/52
In MapReduce
i m p o r t j a v a . i o . I O E x c e p t i o n ;
i m p o r t j a v a . u t i l . A r r a y L i s t ;i m p o r t j a v a . u t i l . I t e r a t o r ;i m p o r t j a v a . u t i l . L i s t ;i m p o r t o r g . a p a c h e . h a d o o p . f s . P a t h ;i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ;i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;im p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x t I n p u t F o r m a t ;i m p o r t o r g . ap a c h e . h a d o o p . m a p r e d . M a p p e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;i m po r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b Co n t r o l ;i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;p u b l i c c l a s s M R E x a m p l e {
p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e B a s ei m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,
O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / P u l l t h e k e y o u tS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;S t r i n g k e y = l i n e . s u bs t r i n g ( 0 , f i r s t C o m m a ) ;S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1 ) ;T e x t o u t K e y = n e w T e x t ( k e y ) ;/ / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w w h i c h f i l e/ / i t c a m e f r o m .T e x t o u t V a l = n e w T e x t ( " 1" + v a l u e ) ;o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}}p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , T e x t > {
p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l ,O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,
R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {/ / P u l l t h e k e y o u tS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;S t r i n g v a l u e = l i n e . s u b s t r i n g (f i r s t C o m m a + 1 ) ;i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ;i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ;S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ;T e x t o u t K e y = n e w T e x t ( k e y ) ;/ / P r e p e n d a n i n d e x t o t h e v a l u e s o we k n o w w h i c h f i l e/ / i t c a m e f r o m .T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;o c . c o l l e c t ( o u t K e y , o u t V a l ) ;
}}p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e x t > {
p u b l i c v o i d r e d u c e ( T e x t k e y ,I t e r a t o r < T e x t > i t e r ,O u t p u t C o l l e c t o r < T e x t , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n ds t o r e i t
/ / a c c o r d i n g l y .L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( ) ;L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > ( ) ;
w h i l e ( i t e r . h a s N e x t ( ) ) {
T e x t t = i t e r . n e x t ( ) ;S t r i n g v a l u e = t . t oS t r i n g ( ) ;i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' )
f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ;
r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
/ / D o t h e c r o s s p r o d u c t a n d c o l l e c t t h e v a l u e sf o r ( S t r i n g s 1 : f i r s t ) {
f o r ( S t r i n g s 2 : s e c o n d ) {S t r i n g o u t v a l = k e y + " , " + s 1 + " , " + s 2 ;o c . c o l l e c t ( n u l l , n e w T e x t ( o u t v a l ) ) ;r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}}
}}p u b l i c s t a t i c c l a s s L o a d J o i n e d e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n g W r i t a b l e > {
p u b l i c v o i d m a p (T e x t k ,T e x t v a l ,O u t p u t C o l l ec t o r < T e x t , L o n g W r i t a b l e > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / F i n d t h e u r lS t r i n g l i n e = v a l . t o S t r i n g ( ) ;i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ;i n t s e c o n d C o m m a = l i n e . i n d e x O f ( ' , ' , f i r s tC o m m a ) ;S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o m m a , s e c o n d C o m m a ) ;/ / d r o p t h e r e s t o f t h e r e c o r d , I d o n ' t n e e d i t a n y m o r e ,/ / j u s t p a s s a 1 f o r t h e c o m b i n e r / r e d u c e r t o s u m i n s t e a d .T e x t o u t K e y = n e w T e x t ( k e y ) ;
o c . c o l l e c t ( o u t K e y , n e w L o n g W r i t a b l e ( 1 L ) ) ;}
}p u b l i c s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R e d u c e B a s e
i m p l e m e n t s R e d u c e r < T e x t , L o n g W r i t a b l e , W r i t a b l e C o m p a r a b l e ,W r i t a b l e > {
p u b l i c v o i d r e d u c e (T e x t k ey ,I t e r a t o r < L o n g W r i t a b l e > i t e r ,O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a b l e , W r i t a b l e > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / A d d u p a l l t h e v a l u e s w e s e e
l o n g s u m = 0 ;w hi l e ( i t e r . h a s N e x t ( ) ) {
s u m + = i t e r . n e x t ( ) . g e t ( ) ;r e p o r t e r . s e t S t a t u s ( " O K " ) ;
}
o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u m ) ) ;}
}p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e
im p l e m e n t s M a p p e r < W r i t a b l e C o m p a r a b l e , W r i t a b l e , L o n g W r i t a b l e ,T e x t > {
p u b l i c v o i d m a p (W r i t a b l e C o m p a r a b l e k e y ,W r i t a b l e v a l ,O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,R e p o r t e r r e p o r t e r )t h r o w s I O E x c e p t i o n {
o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t ) k e y ) ;}
}
p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p R e d u c e B a s ei m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , L o n g W r i t a b l e , T e x t > {
i n t c o u n t = 0 ;p u b l i cv o i d r e d u c e (
L o n g W r i t a b l e k e y ,I t e r a t o r < T e x t > i t e r ,O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > o c ,R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i o n {
/ / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d sw h i l e ( c o u n t< 1 0 0 & & i t e r . h a s N e x t ( ) ) {
o c . c o l l e c t ( k e y , i t e r . n e x t ( ) ) ;c o u n t + + ;
}}
}p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o w s I O E x c e p t i o n {
J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s s ) ;l p . s et J o b N a m e ( " L o a d P a g e s " ) ;l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a s s ) ;
l p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;
l p . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;l p . s e t M a p p e r C l a s s ( L o a d P a g e s . c l a s s ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( l p , n e w
P a t h ( " /u s e r / g a t e s / p a g e s " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l p ,
n e w P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p "l p . s e t N u m R e d u c e T a s k s ( 0 ) ;J o b l o a d P a g e s = n e w J o b ( l p ) ;
J o b C o n f l f u = n e w J o b C o n f ( M R E x a m p l e . c l al f u . se t J o b N a m e ( " L o a d a n d F i l t e r U s e r s " ) ;l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a t . c l a sl f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;l f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t e r U s e r s . cF i l e I n p u t F o r m a t . a d dI n p u t P a t h ( l f u , n e w
P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( l f u ,
n e w P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ "l f u . s e t N u m R e d u c e T a s k s ( 0 ) ;J o b l o a d U s e r s = n e w J o b ( l f u ) ;
J o b C o n f j o i n = n e w J o b C o n f (M R E x a m p l e . c l a s s ) ;j o i n . s e t J o b N a m e ( " J o i n U s e r s a n d P a g e s " )j o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t F oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;j o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a pp e r . c l a s s ) ;j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s s ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) ;F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) ) ;F i l e O u t p u t F o r m a t . s et O u t p u t P a t h ( j o i n , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;J o b j o i n J o b = n e w J o b ( j o i n ) ;j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a g e s ) ;j o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s e r s ) ;
J o b C o n f g r o u p = n e w J o b C o n f ( M R Ex a m p l e . c l a s s ) ;g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) ;g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T e x t I n p u t Fg r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c l a s s ) ;g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g W r i t a b l e .g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e F il e O u t p u t F o r m a t . c l a s sg r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e d . c l a s s ) ;g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U r l s . c l a s sg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r l s . c l a s s )F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( g r o u p , n e w
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;J o b g r o u p J o b = n e w J o b ( g r o u p ) ;g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J o b ) ;
J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M R E x a m p l e .
t o p 1 0 0 . s e t J o b N a m e ( " T o p 1 0 0 s i t e s " ) ;t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e F i l e I n p u tt o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W r i t a b l e . ct o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x t . c l a s s ) ;t o p 1 0 0 . s e t O u t p u t F o r m a t ( S e q u e n c e F i l e O u t po r m a t . c l a s s ) ;
t o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c k s . c l a s s )t o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C l i c k s . c l at o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i c k s . c l a sF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t o p 1 0 0 , n e
P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p 1 0 0 , n e
P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 8 t o 2 5 " ) ) ;t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;J o b l i m i t = n e w J o b ( t o p 1 0 0 ) ;l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b ) ;
J o b C o n t r o l j c = n e w J o b C o n t r o l ( " F i n d t o1 0 0 s i t e s f o r
1 8 t o 2 5 " ) ;j c . a d d J o b ( l o a d P a g e s ) ;j c . a d d J o b ( l o a d U s e r s ) ;j c . a d d J o b ( j o i n J o b ) ;j c . a d d J o b ( g r o u p J o b ) ;j c . a d d J o b ( l i m i t ) ;j c . r u n ( ) ;
}}
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
8/2/2019 Cloud MapReduce Zaharia
42/52
Users = loadusersas (name, age);
Filtered = filter Users by
age >= 18 and age
8/2/2019 Cloud MapReduce Zaharia
43/52
Ease of Translation
Notice how naturally the components of the job translate into Pig Latin.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load
Fltrd = filter
Pages = load
Joined = join
Grouped = group
Summed = count()
Sorted = orderTop5 = limit
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
8/2/2019 Cloud MapReduce Zaharia
44/52
Ease of Translation
Notice how naturally the components of the job translate into Pig Latin.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load
Fltrd = filter
Pages = load
Joined = join
Grouped = group
Summed = count()
Sorted = orderTop5 = limit
Job 1
Job 2
Job 3
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
8/2/2019 Cloud MapReduce Zaharia
45/52
Hive
Developed at Facebook Used for majority of Facebook jobs
Relational database built on Hadoop
Maintains list of table schemas
SQL-like query language (HQL)
Can call Hadoop Streaming scripts from HQL
Supports table partitioning, clustering, complexdata types, some optimizations
C i Hi T bl
8/2/2019 Cloud MapReduce Zaharia
46/52
Creating a Hive Table
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
Partitioning breaks table into separate filesfor each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=US/hive/page_view/dt=2008-06-08,country=CA
Si l Q
8/2/2019 Cloud MapReduce Zaharia
47/52
Simple Query
SELECT page_views.*
FROM page_viewsWHERE page_views.date >= '2008-03-01'
AND page_views.date
8/2/2019 Cloud MapReduce Zaharia
48/52
Aggregation and Joins
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';
Count users who visited each page by gender:
Sample output:
page_url gender count(userid)home.php MALE 12,141,412
home.php FEMALE 15,431,579
photo.php MALE 23,941,451
photo.php FEMALE 21,231,314
Using a Hadoop Streaming
8/2/2019 Cloud MapReduce Zaharia
49/52
Using a Hadoop StreamingMapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dtFROM page_views;
C l i
8/2/2019 Cloud MapReduce Zaharia
50/52
Conclusions
MapReduces data-parallel programming modelhides complexity of distribution and fault tolerance
Principal philosophies:
Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and
administration costs (but requiring fault tolerance)
Hive and Pig further simplify programming
MapReduce is not suitable for all problems, butwhen it works, it may save you a lot of time
8/2/2019 Cloud MapReduce Zaharia
51/52
Yahoo! Super Computer Cluster - M45
Yahoo!s cluster is part of the Open Cirrus Testbed created by HP, Intel, and Yahoo! (seepress release at http://research.yahoo.com/node/2328).
The availability of the Yahoo! cluster was first announced in November 2007 (see pressrelease at http://research.yahoo.com/node/1879).
The cluster has approximately 4,000 processor-cores and 1.5 petabytes of disks.
The Yahoo! cluster is intended to run the Apache open source software Hadoop and Pig.
Each selected university will share the partition with up to three other universities. Theinitial duration of use is 6 months, potentially renewable for another 6 months upon
written agreement.
For further Information, please contact:http://cloud.citris-uc.org/Dr. Masoud Nikravesh
CITRIS and LBNL, Executive Director, CSENikravesh@eecs.berkeley.edu
Phone: (510) 643-4522
R
http://research.yahoo.com/node/2328http://research.yahoo.com/node/2328http://research.yahoo.com/node/2328http://research.yahoo.com/node/1879http://research.yahoo.com/node/1879mailto:Nikravesh@eecs.berkeley.edumailto:Nikravesh@eecs.berkeley.eduhttp://research.yahoo.com/node/1879http://research.yahoo.com/node/23288/2/2019 Cloud MapReduce Zaharia
52/52
Resources
Hadoop: http://hadoop.apache.org/core/ Hadoop docs:
http://hadoop.apache.org/core/docs/current/
Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive
Hadoop video tutorials from Cloudera:
http://www.cloudera.com/hadoop-training
http://hadoop.apache.org/core/http://hadoop.apache.org/core/docs/current/http://hadoop.apache.org/pighttp://hadoop.apache.org/hivehttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-traininghttp://hadoop.apache.org/hivehttp://hadoop.apache.org/pighttp://hadoop.apache.org/core/docs/current/http://hadoop.apache.org/core/