Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland...
Hadoop/MapReduce as a Platform for Data-Intensive Computing
Jimmy LinUniversity of Maryland(currently at Twitter)
Friday, December 2, 2011
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Our World: Large DataSource: Wikipedia (Hard disk drive)
How much data?
9 PB of user data +>50 TB/day (11/2011)
processes 20 PB a day (2008)
36 PB of user data + 80-90 TB/day (6/2010)
Wayback Machine: 3 PB + 100 TB/month (3/2009)
LHC: ~15 PB a year(at full capacity)
LSST: 6-10 PB a year (~2015)
640K ought to be enough for anybody.
150 PB on 50k+ servers running 15k apps
S3: 449B objects, peak 290k request/second (7/2011)
Source: Wikipedia (Everest)
Why large data? ScienceEngineeringCommerce
Science Emergence of the 4th Paradigm
Data-intensive e-Science
Maximilien Brice, © CERN
Engineering The unreasonable effectiveness of data
Count and normalize!
Source: Wikipedia (Three Gorges Dam)
Commerce Know thy customers
Data Insights Competitive advantages
Source: Wikiedia (Shinjuku, Tokyo)
+ simple, distributed programming models cheap commodity clusters
= data-intensive computing for the masses!
(or utility computing)
Source: flickr (turtlemom_nancy/2046347762)
Data nirvana requires the right infrastructure store, manage, organize, analyze, distribute, visualize, …store, manage, organize, analyze, distribute, visualize, …
How large data?Why large data?
“Result”
Divide et impera Chop problem into smaller parts
Combine partial results
Source: Wikiedia (Forest)
“Work”w1 w2 w3
r1 r2 r3
Parallel computing is hard!
Message Passing
P1 P2 P3 P4 P5
Shared Memory
P1 P2 P3 P4 P5
Me
mo
ry
Different programming models
Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …
Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …
Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …
Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence
The reality: programmer shoulders the burden of managing concurrency…(I want my students developing new algorithms, not debugging race conditions)
Source: Ricardo Guimarães Herrmann
Source: MIT Open Courseware
Source: NY Times (6/14/2006)
The datacenter is the computer!
MapReduce
MapReduce Functional programming meets distributed processing
Independent per-record processing in parallel Aggregation of intermediate results to generate final output
Programmers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer
The execution framework handles everything else… Handles scheduling Handles data management, transport, etc. Handles synchronization Handles errors and faults
Recall “count and normalize”? Perfect!
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
split 0
split 1
split 2
split 3
split 4
worker
worker
worker
worker
worker
UserProgram
outputfile 0
outputfile 1
(1) submit
(2) schedule map (2) schedule reduce
(3) read(4) local write
(5) remote read(6) write
Inputfiles
Mapphase
Intermediate files(on local disk)
Reducephase
Outputfiles
Adapted from (Dean and Ghemawat, OSDI 2004)
Master
(I want my students developing new algorithms, not debugging race conditions)
MapReduce Implementations Google has a proprietary implementation in C++
Bindings in Java, Python
Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem
Source: Wikipedia (Rosetta Stone)
Statistical Machine Translation (Chris Dyer)
Translation Model
LanguageModel
Decoder
Foreign Input Sentence
maria no daba una bofetada a la bruja verde
English Output Sentence
mary did not slap the green witch
Word Alignment
Statistical Machine Translation
(vi, i saw)(la mesa pequeña, the small table)…
Phrase Extraction
i saw the small table
vi la mesa pequeña
Parallel Sentences
he sat at the tablethe service was good
Target-Language Text
Training Data
ˆ e 1I argmax
e1I
P(e1I | f1
J ) argmaxe1
I
P(e1I )P( f1
J | e1I )
Maria no dio una bofetada a la bruja verde
Mary not
did not
no
did not give
give a slap to the witch green
slap
a slap
to the
to
the
green witch
the witch
by
slap
Translation as a Tiling Problem
Mary
did not
slap
the
green witch
ˆ e 1I argmax
e1I
P(e1I | f1
J ) argmaxe1
I
P(e1I )P( f1
J | e1I )
The Data Bottleneck
“Every time I fire a linguist, the performance of our … system goes up.”- Fred Jelinek
Translation Model
LanguageModel
Decoder
Foreign Input Sentence
maria no daba una bofetada a la bruja verde
English Output Sentence
mary did not slap the green witch
Word Alignment
Statistical Machine Translation
(vi, i saw)(la mesa pequeña, the small table)…
Phrase Extraction
i saw the small table
vi la mesa pequeña
Parallel Sentences
he sat at the tablethe service was good
Target-Language Text
Training Data
We’ve built MapReduce implementations of these two components! (2008)
ˆ e 1I argmax
e1I
P(e1I | f1
J ) argmaxe1
I
P(e1I )P( f1
J | e1I )
HMM Alignment: Giza
Single-core commodity server
HMM Alignment: MapReduce
Single-core commodity server
38 processor cluster
HMM Alignment: MapReduce
38 processor cluster
1/38 Single-core commodity server
What’s the point? The optimally-parallelized version doesn’t exist!
MapReduce occupies a sweet spot in the design space for a large class of problems: Fast… in terms of running time + scaling characteristics Easy… in terms of programming effort Cheap… in terms of hardware costs
Source: Wikipedia (DNA)
Sequence Assembly (Michael Schatz)
Strangely-Formatted Manuscript Dickens: A Tale of Two Cities
Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates Dickens: A Tale of Two Cities
“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction Dickens accidently shreds the manuscript
How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k
fragments The short fragments from every copy are mixed together Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Greedy Assembly
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age of
was the age of foolishness,
the worst of times, it
The repeated sequence make the correct reconstruction ambiguous!
Alternative: model sequence reconstruction as a graph problem…
de Bruijn Graph Construction Dk = (V,E)
V = All length-k subfragments (k < l) E = Directed edges between consecutive subfragments
(Nodes overlap by k-1 words)
Locally constructed graph reveals the global structure Overlaps between sequences implicitly computed
It was the best of
Original Fragment
It was the best was the best of
Directed Edge
de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001
de Bruijn Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age ofthe age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
A unique Eulerian tour of the graph reconstructs the
original text
If a unique tour does not exist, try to simplify the
graph as much as possible
de Bruijn Graph Assembly
the age of foolishness
It was the best of times, it
of times, it was the
it was the worst of times, it
it was the age ofthe age of wisdom, it was theA unique Eulerian tour of the
graph reconstructs the original text
If a unique tour does not exist, try to simplify the
graph as much as possible
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGCAATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject genome
Sequencer
Reads
Human genome: 3 gbpA few billion short reads (~100 GB compressed data)
Short Read Assembly Genome assembly as finding an Eulerian tour of the de
Bruijn graph Human genome: >3B nodes, >10B edges
Present short read assemblers require tremendous computation: Velvet (serial): > 2TB of RAM ABySS (MPI): 168 cores × ~96 hours SOAPdenovo (pthreads): 40 cores × 40 hours, >140 GB RAM
Can we get by with MapReduce on commodity clusters? Horizontal scaling-out in the cloud!
(Zerbino & Birney, 2008)(Simpson et al., 2009)(Li et al., 2010)
Graph Compression
Challenges– Nodes stored on different machines– Nodes can only access direct neighbors
Randomized Solution– Randomly assign H / T to each
compressible node– Compress H T links
Fast Graph Compression
Initial Graph: 42 nodes
Fast Graph Compression
Round 1: 26 nodes (38% savings)
Fast Graph Compression
Round 2: 15 nodes (64% savings)
Fast Graph Compression
Round 3: 6 nodes (86% savings)
Fast Graph Compression
Round 4: 5 nodes (88% savings)
Contrail De Novo Assembly of the Human Genome
African male NA18507 (SRA000271, Bentley et al., 2008) Input: 3.5B 36bp reads, 210bp insert (~40x coverage)
Initial
NMax
>7 B27 bp
Compressed
>1 B303 bp
5.0 M14,007 bp
B
B’
A
Clip Tips
4.2 M20,594 bp
Pop Bubbles
B
B’
A C
Aside: How to do this better… MapReduce is a poor abstraction for graphs
No separation of computation from graph structure Poor locality: unnecessary data movement
Bulk synchronous parallel (BSP) as a better model: Google’s Pregel, open source Giraph clone
Interesting (open?) question: how many hammers and how many nails?
Source: flickr (stuckincustoms/4051325193)
Source: Wikipedia (Tide)
Commoditization of large-data processing capabilities allows us to ride the rising tide!
ScienceEngineeringCommerce
Source: flickr (60in3/2338247189)
Best thing since sliced bread? Distributed programming models:
MapReduce is the first Definitely not the only And probably not even the best Alternatives: Pig, Dryad/DryadLINQ, Pregel, etc.
It’s all about the right level of abstraction The von Neumann architecture won’t cut it anymore
Separating the what from how Developer specifies the computation that needs to be performed Execution framework handles actual execution Framework hides system-level details from the developers
Source: NY Times (6/14/2006)
The datacenter is the computer!
What are the appropriate abstractions for the datacenter computer?
What new abstractions do applications demand?
What exciting applications do new abstractions enable?
How do we achieve true impact and change the world?
Education Teaching students to “think at web scale”
Rise of the data scientist: necessary skill set?
Source: flickr (infidelic/3008675635)
Questions?