Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational...
-
Upload
maude-wilkins -
Category
Documents
-
view
217 -
download
0
Transcript of Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational...
![Page 1: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/1.jpg)
Introduction to MapReduce Paradigm for Data Mining
COSC 526 Class 2
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]
![Page 2: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/2.jpg)
2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Last class …
• Class Logistics
• Introduction to big data
• Types of data and compute systems
• Bonferoni Principle and “how-not-to-design-an-experiment”
• The Big Data Mining Process
![Page 3: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/3.jpg)
3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
This class…
• Need for Map Reduce Paradigm
• Map Reduce
• Decision making and Design of Map Reduce algorithms
• Example usage for easy statistics:– Word count
– Co-occurrence counts
![Page 4: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/4.jpg)
4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What are common to data mining/analytic algorithms?
Acquire (data)
Extract and
Clean
Aggregate and
Integrate
Represent
Analyze and
Model
Interpret
Iterate over a large set of data
Extract some quantities of interest from the data
Shuffle and sort the data
Aggregate intermediate resultsMake it look pretty!
![Page 5: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/5.jpg)
5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Traditional Architecture of Data Mining
• Classical Machine Learning/ Data Mining
• Data fetched from disk loaded onto main memory and processed in the CPUs
CPU
Memory
Disk
![Page 6: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/6.jpg)
6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Compute Intensive vs. Data Intensive Computing
Compute Intensive• Traditionally designed for
optimizing floating point operations (FLOPS)
• Key assumption: Working set data will fit main memory
• Memory bandwidth is usually high (and optimized)
• “Computationally Dense” – meaning all applications will have to rethink how to optimize use of compute resources
Data Intensive• Has to be optimized for data
movement, storage, analysis
• Data ops not FLOPS are important
• Key assumption: Working data set will not fit (may not be even available on the same machine)
• Current architectures are optimized for either media or transactional use
![Page 7: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/7.jpg)
7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Compute Intensive vs. Data Intensive Computing (2)
2BerkinO Dzisikyilmaz,RamanathanNarayanan,JosephZambreno,GokhanMemik,andAlokN. Choudhary. An architectural characterization study of data mining and bioinformatics work- loads. In IISWC, pages 61–70, 2006
integer Floating point
Key Take home message: Current compute architectures are not optimized for Data mining/analytic operations!
![Page 8: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/8.jpg)
8 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programmers shoulder responsibility in traditional HPC environments!
P1 P2 P3 P4 P5
Mem
ory
P1 P2 P3 P4 P5
Message Passing Shared Memory
• Issues related to scheduling, data distribution, synchronization, inter-process communication, etc.
• Architectural considerations: SIMD/MIMD, Network topology, etc. • OS issues: mutexes, deadlocks, etc.
![Page 9: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/9.jpg)
9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Scalable Algorithms for Data Mining
• Data sizes are vast (> 100 Terabytes)
• Even assuming nominal read speed of 35 MB/sec, it can take over a month to just access/read the data!
• How about answering more useful questions?– Number of categories
– Types of datasets represented, etc.
• Takes even longer!!
![Page 10: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/10.jpg)
10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Challenges
• How to ease access of data?– (Reasonably) Fast and efficient
– (Somewhat) fault tolerant access
• How to distribute computation?– Parallel Programming is hard!
– Use commodity clusters for processing
Hadoop Distributed File System (HDFS)Google File System
Hadoop MapReduce / Google MapReduce
MapReduce is an elegant paradigm of working with Big Data
![Page 11: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/11.jpg)
11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What would you change in the underlying architectures?
Hybrid Memory Cloud (HMC)Non-volatile Random Access MemoryGlobal Address Space (GAS)
Synergistic Challenges in Data Intensive and Exascale Computing (DOE ASCAC report 2013)
![Page 12: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/12.jpg)
12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Let’s talk about distributing computations
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
……
Switch Switch
Switch
• Commodity clusters
• What do we do when we have supercomputers?
![Page 13: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/13.jpg)
13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programming Model for Data Mining
• Transferring data over network can take time
• Key Ideas:– Bring the computation close to the data
– Replicate the data multiple times for reliability
• MapReduce:– Provides a storage infrastructure
– @Google: GFS; @class: Hadoop-HDFS
– Programming Model
– Parallel paradigm, easier than conventional MPI
![Page 14: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/14.jpg)
14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
MapReduce Architecture
![Page 15: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/15.jpg)
15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
This is not the first lecture on MapReduce…
• Material is (in part) inspired by:– William Cohen’s lectures (10601 class in CMU)
– Jure Leskovec (Stanford)
– Aditya Prakash (Viriginia Tech)
– Cloudera
– And many many others!
• Materials “redrawn”, “reorganized” and “reworked” to reflect how we use it
![Page 16: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/16.jpg)
16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
1st Key Idea MapReduce: Bring computations close to the data
• Programmers specify two functions:– Map(in_k, in_v) <out_k, inter_v’> list
– Reduce(<out_k, inter_v> list) <out_v> list
• All values with the same key are reduced together
• Let the “runtime” handle everything else:– Scheduling, I/O, networking, Inter-process
communication, etc.
![Page 17: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/17.jpg)
17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Visual interpretation of MapReduce
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 a 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
![Page 18: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/18.jpg)
18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Other things programmer also considers
• Partition(out_k, numberOfPartitions):– A simple hash e.g., hash(out_k) mod n
– Divides the key space for parallel reduce operations
• Combine(out_k, inter_v) <out_k, inter_v> list:– Mini reduce function that run in memory after
the map phase
– Optimize the network traffic
![Page 19: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/19.jpg)
19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
![Page 20: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/20.jpg)
20 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Let’s understand the MapReduce Runtime
• Scheduling:– workers are assigned to map and reduce tasks
• Data distribution:– move processes to the data
• Synchronization:– Gather, sort, and shuffle intermediate data
• Fault tolerance:– Detect worker failures and restarts
• Hadoop Distributed File System
![Page 21: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/21.jpg)
21 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now for an example: WordCount
• Let’s look at a corpus of documents
• How do we write the algorithm?
Joe likes toastJane likes toast with jamJoe burnt toast
![Page 22: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/22.jpg)
22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Word Count (2)
def map(String doc_id, String text):for each word w in text:
emit(w, 1);
def reduce(String term, Iterator<int> values):int sum = 0;for each v in values:
sum += v;Emit(term, sum);
![Page 23: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/23.jpg)
23 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
![Page 24: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/24.jpg)
24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (3): Slow Motion (SloMo) Map
![Page 25: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/25.jpg)
25 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
![Page 26: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/26.jpg)
26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (4): SloMo Shuffle & Sort
Joe1
Likes 1
Toast1
Jane1
Likes 1
Toast1
With1
Jam1
Joe1
burnt1
Toast1
Joe1
Joe1Jane1
likes1
likes1toast
1toast
1Toast
1with
1jam
1burnt
1the
1
Inp
ut
Ou
tpu
t
![Page 27: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/27.jpg)
27 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
![Page 28: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/28.jpg)
28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (5): SloMo Reduce
Joe1
Joe1Jane1
likes1
likes1toast
1toast
1Toast
1with
1jam
1burnt
1the
1
Inp
ut
Joe2
Jane1
likes2
toast3
with1
jam1
burnt1
the1
Ou
tpu
t
![Page 29: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/29.jpg)
29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
A look under the hood: what happens when you invoke WordCount?
Split 1
Split 2
Split 3
Split 4
Split 0
• Split into 64MB chunk per piece
• Multiple copies of the program across cluster
User Program
Master
Worker
Worker
Worker
fork
assign map
read
• Master task is special
• M map tasks and R reduce tasks
• Idle workers are picked to run
• Worker reads the split it is assigned to
• <in,out> key value pairs are written to buffer
local write
Worker
Worker
assign reduce
remote read
OutputFile0
OutputFile1
fork
• Reduce workers are notified by the master about locations of files
• Reduce workers sort and present results
• Final results are stored with the correct intermediate key
![Page 30: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/30.jpg)
30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How do commodity clusters use this?
Compute Nodes
NAS
SAN
• Main problem: how to handle the data store + compute?
![Page 31: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/31.jpg)
31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
2nd Key Idea MapReduce: Replicate the data multiple times for reliability
• Hadoop Distributed File System:– Store data (replicates) on the local disks
– Start running jobs on nodes that have data
• Why?– Not enough RAM to hold the data on main
memory
– Disk access is slow, but throughput is usually higher
![Page 32: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/32.jpg)
32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
File storage design
• Files stored as chunks (e.g., 128 MB)
• Reliability through replication:– Each chunk replicated across 3+ chunkservers
• Single master to coordinate access + metadata:– Centralized management
• No data caching:– Little benefits for large data, streaming reads
• Simple API
![Page 33: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/33.jpg)
33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How does HDFS work?• NameNode stores cluster metadata • Files and directories are represented by
inodes• Inodes store attributes like permissions,
etc.
• Data is stored across datanodes• Replicated effectively
![Page 34: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/34.jpg)
34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount Code: Main
others:• KeyValueInputFormat• SequenceFileInputForm
at
![Page 35: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/35.jpg)
35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount: Map Function
![Page 36: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/36.jpg)
36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount: Reduce Function
![Page 37: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/37.jpg)
37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
MapReduce Limits
• Moving data is very expensive:– writing and reading are both expensive
• No reduce jobs can start until:– All map jobs are done
– Data in its partition is shuffled/sorted
![Page 38: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/38.jpg)
38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Limitations of MapReduce
• No control of the order in which reduce jobs are performed– Only ordering is that reduce jobs start after map
jobs
• Assume that the map and reduce jobs will take place:– across different machines
– across different memory spaces
![Page 39: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/39.jpg)
39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programming Pitfalls• Don’t make a static variable and assume that other
processes read it
– They can’t
– It appears that they can when run locally, but they can’t
• Do not communicate between mappers or between reducers
– overhead is high
– you don’t know which mappers/reducers are actually running at any given point
– there’s no easy way to find out what machine they’re running on
• because you shouldn’t be looking for them anyway
Thanks to Shannon Quinn for his pointers!
![Page 40: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/40.jpg)
40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Designing MapReduce Algorithms
![Page 41: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/41.jpg)
41 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
A slightly more complex example: Term Co-occurrence
• Given a large text collection compute a matrix of all words:– M = N x N matrix (N = vocabulary size)
– Mij: number of times i and j co-occur in a
sentence
• Why?– Distributional profiles are a way of measuring
semantic distance
– Semantic distance is important for NLP tasks
![Page 42: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/42.jpg)
42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Example of a large counting problem
• Term co-occurrence matrix computation:– A large event space (no. of terms)
– A large number of observations (no. of documents)
– Keep track of interesting statistics about events
• Approach:– Mappers generate partial counts
– Reducers aggregate counts
![Page 43: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/43.jpg)
43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
First Approach: “Pairs”
• Let each mapper take a sentence:– Generate all co-occurring term pairs
– For all pairs, emit(a, b) count
• Reducers sum up counts associated with these pairs
• User combiners to aggregate results
• Advantages: • Easy to implement, understand
• Disadvantages: • Upper bound on pairs is unknown
![Page 44: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/44.jpg)
44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Second approach: “Stripes”
• Group together pairs into an associative array
• Each mapper takes a sentence:– Generate all co-occurring term pairs
– For each term, emit a {b: countb, c: countc, …}
• Reducers perform element-wise sum of associative arrays
(a, b) 1(a, c) 2
(a, d) 5 a: {b: 1, c: 2, d: 5, e: 3, f: 2}
(a, e) 3(a, f) 2
a {b: 1, d: 5, e:3}a {b: 1, c: 2, d: 5, f: 2}-------------------------------------------------a {b:2, c:2, d: 10, e: 3, f: 2}
• Advantages: • Far less sorting and shuffling of key value pairs• Can make better use of combiners
• Disadvantages: • More difficult to implement• Underlying objects are “larger” than a typical
intermediate results
![Page 45: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/45.jpg)
45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How do the runtimes compare
![Page 46: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/46.jpg)
46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Summary and To Dos
![Page 47: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/47.jpg)
47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Summary
• MapReduce is a big data programming paradigm– map jobs
– reduce jobs
• Careful consideration of data movement is required
![Page 48: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/48.jpg)
48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Notes and What to expect next?
• Please form project teams as soon as possible– 2 is good; 3 is okay.
– More team members higher expectations!
• Assignment 1 is due today!
• Additional notes are put up on the website for Hadoop
• Next class:– Probability and Statistics Review Basics
– Naïve Bayes and Logistic Regression on Hadoop
![Page 49: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649f565503460f94c798d4/html5/thumbnails/49.jpg)
THANK YOU!!!