MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number...
Transcript of MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number...
![Page 1: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/1.jpg)
MapReduce
Hadoop Seminar, TUT, 2014-10-22
Antti Nieminen
![Page 2: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/2.jpg)
MapReduce
• MapReduce is a programming model for distributed processing of large data sets
• Scales ~linearly – Twice as many nodes -> twice as fast
– Achieved by exploiting data locality • Data processing where the data is
• Simple programming model – Programmer only needs to write two functions:
Map and Reduce
![Page 3: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/3.jpg)
Map & Reduce
• The programmer writes two functions: – Map maps input data to key/value pairs
– Reduce processes the list of values for a given key
• The MapReduce framework (such as Hadoop) does the rest – Distributes the job among nodes
– Moves the data to/from nodes
– Handles node failures
– etc.
![Page 4: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/4.jpg)
MapReduce
MAP TRE-1, 2°C TRE 2
TRE
8
SHUFFLE
TRE 9
REDUCE
TKU
8
HKI
4
TKU 8
HKI 9
TKU 5 HKI 7 TRE 5
TRE 3
HKI HKI HKI HKI HKI
TKU TKU TKU TKU
5 9 6 4 4
2 2 8 8
TRE 9 TRE 8
TKU-1, 5°C HKI-1, 7°C TRE-2, 1°C HKI-1, 5°C HKI-2, 9°C HKI-2, 6°C
…
![Page 5: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/5.jpg)
MapReduce
TRE-1, 2°C TKU-1, 5°C HKI-1, 7°C TRE-2, 1°C HKI-1, 5°C HKI-2, 9°C HKI-2, 6°C
…
MAP TRE 2 TKU 5 HKI 7 TRE 5
TRE 3
HKI HKI HKI HKI HKI
TKU TKU TKU TKU
5 9 6 4 4
2 2 8 8
TRE 9 TRE 8
TRE
8
TKU
8
HKI
4
SHUFFLE
TRE 9
REDUCE
TKU 8
HKI 9
Node 1
Node 2
Node 3
Node Y
Node Z
Node X
![Page 6: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/6.jpg)
MapReduce
MAP REDUCE
Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)
![Page 7: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/7.jpg)
Map & Reduce in Hadoop
• In Hadoop, Map and Reduce functions can be written in – Java
• org.apache.hadoop.mapreduce.lib
– C++ using Hadoop Pipes
– any language, using Hadoop Streaming
• Also a number of third party programming frameworks for Hadoop MapReduce – For Java, Scala, Python, Ruby, PHP, …
– See eg. this blog post
![Page 8: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/8.jpg)
Mapper Java example
public class MyMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context) {
String city = cityFromValue(value);
int temp = tempFromValue(value); context.write(new Text(city), new IntWritable(temp));
}
}
Input key and value types Output key and value types
• The Mapper input types depend on the defined InputFormat • By default TextInputFormat
• Key (LongWritable): position in the file • Value (Text): the line
![Page 9: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/9.jpg)
Reducer Java example
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } }
Input key and value types Output key and value types
![Page 10: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/10.jpg)
Run MapReduce example
Job job = new Job(); job.setJarByClass(MyClass.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path("~/input")); FileInputFormat.setOutputPath(job, new Path("~/output")); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.waitForCompletion(true);
![Page 11: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/11.jpg)
Hadoop Streaming
• Map and Reduce functions can be implemented in any language with the Hadoop Streaming API
• Input is read from standard input
• Output is written to standard output
• Input/output items are lines of the form key\tvalue – \t is the tabulator character
• Reducer input lines are grouped by key – One reducer instance may receive multiple keys
![Page 12: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/12.jpg)
Python example
• mapper.py
import sys for line in sys.stdin: city, temp = city_temp_from_line(line) print('%s\t%s' % (city, temp)) • reducer.py
import sys last_key = None max_val = None # < anything for line in sys.stdin: key, val = line.strip().split('\t') if key != last_key: print('%s\t%s' % (last_key, max_val)) max_val = None last_key = key max_val = max(val, max_val) if last_key: print('%s\t%s' % (last_key, max_val))
![Page 13: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/13.jpg)
Run Hadoop Streaming
• Debug using Unix pipes: cat sample.txt | ./mapper.py | sort | ./reducer.py
• On Hadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py
![Page 14: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/14.jpg)
Combiners
A A
1 5
A 3 B 2
Map node 1
A
Reduce node for key A
![Page 15: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/15.jpg)
Combiners
A A
1 5
A 3 B 2
Map node 1
A
Reduce node for key A
A 1 5 3
A 5
Combiner
![Page 16: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/16.jpg)
Combiners
• Combiner can ”compress” data on a mapper node before sending it forward
• Combiner input/output types must equal the mapper output types
• In Hadoop Java, Combiners use the Reducer interface
job.setCombinerClass(MyReducer.class);
![Page 17: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/17.jpg)
Reducer as a Combiner
• Reducer can be used as a Combiner if it is commutative and associative – Eg. max is
• max(1, 2, max(3,4,5)) = max(max(2, 4), max(1, 5, 3))
• true for any order of function applications…
– Eg. avg is not • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠
avg(avg(2, 4), avg(1, 5, 3)) = 3
• Note: if Reducer is not c&a, Combiners can still be used – The Combiner just has to be different from the Reducer
and designed for the specific case
![Page 18: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/18.jpg)
MapReduce example
• Find the number of unique purchasing locations for each product
• Data: – Users
• (user id, name, location, …)
– Transactions • (transaction id, product id, product name, user id, …)
• The example stolen from here – Here are some more examples…
![Page 19: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/19.jpg)
u1, Antti, FI u2, Bob, US u3, Carola, SE
p1, Apple, u1 p1, Apple, u2
p2, Banana, u2
u1 FI u2 US u3 SE
u1 p1
u2 p1 u2 p2
u1 FI p1
u2 US p1
p2
u3
p1 FI
p1 p2
US US
Users
Transactions
MAP 1 REDUCE 1
p1 FI p1 p2
US US
p1 FI p1 p2
US US
p1 FI US
p2 US
p1 2
p2 1
MAP 2 REDUCE 2
p1 FI
p1 p2
US US
![Page 20: MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number of unique purchasing locations for each product •Data: –Users •(user id, name,](https://reader034.fdocuments.us/reader034/viewer/2022051802/5aee05a27f8b9ac57a8b516b/html5/thumbnails/20.jpg)
The End
Questions? Comments?