Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data...
Transcript of Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data...
![Page 1: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/1.jpg)
1U Kang
Introduction to Data Mining
Lecture #4: MapReduce-2
U KangSeoul National University
![Page 2: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/2.jpg)
2U Kang
In This Lecture
Problems suited for MapReduce Host size Language model Distributed grep Reverse web link graph Term-vector per host Inverted Index
Measuring cost of MapReduce algorithms
![Page 3: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/3.jpg)
3U Kang
Outline
Problems Suited For Map-ReducePointers and Further Reading
![Page 4: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/4.jpg)
4U Kang
Example: Host size
Suppose we have a large web corpus Look at the metadata file Lines of the form: (URL, size, date, …)
For each host, find the total number of bytes That is, the sum of the page sizes for all URLs from that
particular host
![Page 5: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/5.jpg)
5U Kang
Example: Language Model
Statistical machine translation: Need to count number of times every 5-word
sequence occurs in a large corpus of documents Called n-gram (in this case, n = 5)
Very easy with MapReduce: Map:
Extract (5-word sequence, count) from document
Reduce: Combine the counts
![Page 6: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/6.jpg)
6U Kang
More Examples
Distributed Grep Map() : emits a line if it matches a supplied pattern
Reverse Web-Link graph Map() : output <target, source> for each target in a
source web page Reduce: output <target, list(source)>
![Page 7: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/7.jpg)
7U Kang
More Examples
Term-Vector per Host Term vector : summarizes most important words that
occur in a given host Map: output <hostname, term vector> for a given
document Reduce: output <hostname, term vector> for frequent
terms
![Page 8: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/8.jpg)
8U Kang
More Examples
Inverted index Map(): output <word, document ID> Reduce(): output <word, list(document ID)>
![Page 9: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/9.jpg)
9U Kang
Example: Join By Map-Reduce
Compute the natural join R(A,B) ⋈ S(B,C) R and S are each stored in files Tuples are pairs (a,b) or (b,c)
A Ba1 b1
a2 b1
a3 b2
a4 b3
B Cb2 c1
b2 c2
b3 c3
⋈A Ca3 c1
a3 c2
a4 c3
=
RS
Student id Course id Course idCourse name Student id
Course name
(Enrollment)(Course Info.)
![Page 10: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/10.jpg)
10U Kang
Map-Reduce Join
Use a hash function h from B-values to 1...k A Map process turns: Each input tuple R(a,b) into key-value pair (b,(a,R)) Each input tuple S(b,c) into (b,(c,S))
Map processes send each key-value pair with key bto Reduce process h(b) Hadoop does this automatically; just tell it what k is.
Each Reduce process matches all the pairs (b,(a,R))with all (b,(c,S)) and outputs (a,b,c).
![Page 11: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/11.jpg)
11U Kang
Cost Measures for Algorithms
In MapReduce we quantify the cost of an algorithm using
1. Communication cost = total I/O of all processes2. Elapsed communication cost = max of I/O along
any path3. (Elapsed) computation cost analogous, but
count only running time of processes
![Page 12: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/12.jpg)
12U Kang
Example: Cost Measures
For a map-reduce algorithm: Communication cost = input file size + 2 × (sum of the
sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes.
Elapsed communication cost is the sum of the largest input + output for any map process, plus the same for any reduce process
![Page 13: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/13.jpg)
13U Kang
What Cost Measures Mean
Either the I/O (communication) or processing (computation) cost dominates Ignore one or the other
Total cost tells what you pay in rent from your friendly neighborhood cloud
Elapsed cost is wall-clock time using parallelism
![Page 14: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/14.jpg)
14U Kang
Cost of Map-Reduce Join
Total communication cost= O(|R|+|S|+|R ⋈ S|)
Elapsed communication cost = O(s) We’re going to pick k (# of reducers) and the number
of mappers so that the I/O limit s is respected We put a limit s on the amount of input or output that
any one process can have. s could be: What fits in main memory What fits on local disk
In many cases, computation cost is linear in the input + output size So computation cost is like comm. cost
![Page 15: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/15.jpg)
15U Kang
Outline
Problems Suited For Map-ReducePointers and Further Reading
![Page 16: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/16.jpg)
16U Kang
Implementations
Google Not available outside Google
Hadoop An open-source implementation in Java Uses HDFS for stable storage Download: http://lucene.apache.org/hadoop/
![Page 17: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/17.jpg)
17U Kang
Cloud Computing
Ability to rent computing by the hour Additional services e.g., persistent storage
Amazon’s “Elastic Compute Cloud” (EC2)
![Page 18: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/18.jpg)
18U Kang
Reading
Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters http://static.googleusercontent.com/media/research.
google.com/ko//archive/mapreduce-osdi04.pdf Must Read!
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System http://static.googleusercontent.com/media/research.
google.com/ko//archive/gfs-sosp2003.pdf
![Page 19: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/19.jpg)
19U Kang
Resources
Hadoop Wiki Introduction
http://wiki.apache.org/lucene-hadoop/ Getting Started
http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop Map/Reduce Overview
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
Eclipse Environment http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
Javadoc http://lucene.apache.org/hadoop/docs/api/
![Page 20: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/20.jpg)
20U Kang
Resources
Releases from Apache download mirrors http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
Nightly builds of source http://people.apache.org/dist/lucene/hadoop/nightly/
Source code from subversion http://lucene.apache.org/hadoop/version_control.html
![Page 21: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/21.jpg)
21U Kang
Further Reading
Programming model inspired by functional language primitives Partitioning/shuffling similar to many large-scale sorting systems
NOW-Sort ['97] Re-execution for fault tolerance
BAD-FS ['04] and TACC ['97] Locality optimization has parallels with Active Disks/Diamond
work Active Disks ['01], Diamond ['04]
Backup tasks similar to Eager Scheduling in Charlotte system Charlotte ['96]
Dynamic load balancing solves similar problem as River's distributed queues River ['99]
![Page 22: Introduction to Data Miningukang/courses/19S-DM/L4-map... · 2019-03-08 · Introduction to Data Mining. Lecture #4: MapReduce-2. U Kang. Seoul National University. U Kang 2. ...](https://reader030.fdocuments.us/reader030/viewer/2022040602/5e99e238cd580f224f2a9115/html5/thumbnails/22.jpg)
22U Kang
Questions?