MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...
Transcript of MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...
Lecture Roadmap
• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable
Last Lecture
Google File System (GFS) - Lec 8
MapReduce - Lec 9 BigTable - Lec 9
Google Applications, e.g., Gmail and Google Map
BigTable - Lec 9
Today’s Lecture
Google File System (GFS) - Lec 8
MapReduce - Lec 9
Google Applications, e.g., Gmail and Google Map
Master Chunkserver1
Google File System [SOSP’03]
Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Data
Google File System [SOSP’03]
GFS
Metadata 1 2 3
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
GFS
Metadata 1 2 3
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]The design insights:- Metadata is used for indexing chunks- Huge files -> 64 MB for each chunk -> fewer chunks- Reduce client-master interaction and metadata size
GFS
1 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Data
Google File System [SOSP’03]
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Data
Google File System [SOSP’03]
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Data
Google File System [SOSP’03]The design insights:- Replicas are used to ensure availability- Master can choose the nearest replicas for the client - Read and append-only makes it easy to manage replicas
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
read
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
read- IP address for each chunk- the ID for each chunk
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
read<Chunkserver1’s IP, Chunk1><Chunkserver1’s IP, Chunk2><Chunkserver2‘s IP, Chunk3>
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
read
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Google File System [SOSP’03]
read
Why GFS tries to avoid random write?
After GFS, what we need to do next?
GFS
1 2 1 3 1 2 3 2 3Metadata
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
GFS
1 2 1 3 1 2 3 2 3Metadata
Processing and Analyzing Data
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Very Important!
Very Important!
GFS
1 2 1 3 1 2 3 2 3Metadata
Processing and Analyzing Data
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Very Important!
GFS
1 2 1 3 1 2 3 2 3Metadata
Processing and Analyzing Data
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Very Important!
GFS
1 2 1 3 1 2 3 2 3Metadata
Processing and Analyzing Data
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Very Important!
GFS
1 2 1 3 1 2 3 2 3Metadata
Processing and Analyzing Data
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
• A toy problem: The word count- We have 10 billion documents
- Average document’s size is 20KB => 10 billion docs = 200TB
• Our solution:
Processing and Analyzing Data
• A toy problem: The word count- We have 10 billion documents
- Average document’s size is 20KB => 10 billion docs = 200TB
• Our solution:
for each document d{ for each word w in d {word_count[w]++;} }
Processing and Analyzing Data
• A toy problem: The word count- We have 10 billion documents
- Average document’s size is 20KB => 10 billion docs = 200TB
• Our solution:
for each document d{ for each word w in d {word_count[w]++;} }
Approximately one month.
Processing and Analyzing Data
• Inspired from map and reduce operations commonly used in functional programming language like LISP
• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>
- 2. Reduce: <key2, value2> <value3>
MapReduce Programming Model
MapReduce Programming Model
Map ReduceInput
Map(k,v)-->(k’,v’)Group (k’,v’)s by k’
Reduce(k’,v’[])-->v’’
Output
• Inspired from map and reduce operations commonly used in functional programming language like LISP
• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>
- 2. Reduce: <key2, value2[ ]> <value3>
MapReduce Programming Model
Map ReduceInput
Map(k,v)-->(k’,v’)Group (k’,v’)s by k’
Reduce(k’,v’[])-->v’’
Output
• Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs, e.g., <doc, id> <doc, content>
• Inspired from map and reduce operations commonly used in functional programming language like LISP
• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>
- 2. Reduce: <key2, value2[ ]> <value3>
MapReduce Programming Model
Map ReduceInput
Map(k,v)-->(k’,v’)Group (k’,v’)s by k’
Reduce(k’,v’[])-->v’’
Output
• After map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer for aggregating/merging the result.
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]
• GFS is responsible for storing data for MapReduce- Data is split into chunks and distributed across nodes- Each chunk is replicated - Offers redundant storage for massive amounts of data
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]
MapReduce
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]
MapReduce
Heard Hadoop?HDFS + Hadoop MapReduce
GFS
MapReduce [OSDI’04]
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]• Two core components
- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce
MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
GFS
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]• Two core components
- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce programs
MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
MapReduce
GFS
JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]
Documtent A2Documtent A3Document
MapReduce
GFS
JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
MapReduce [OSDI’04]
Documtent A1Documtent A2Documtent A3
MapReduce
GFS
JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Documtent A1Documtent A2Documtent A3
MapReduce
GFS
JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker
Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5
Documtent A1Documtent A2Documtent A3 TaskTracker
• Word count is challenging over massive amounts of data• Fundamentals of statistics often are aggregate functions• Most aggregation functions have distributive nature• MapReduce breaks complex tasks into smaller pieces which
can be executed in parallel
Why we care about word count
Count the # of occurrences of each word in a large amount of input data
Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}
Map Phase (On a Worker)
Count the # of occurrences of each word in a large amount of input data
Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}
(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)
• Input to the Mapper
Map Phase (On a Worker)
Count the # of occurrences of each word in a large amount of input data
Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}
(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)
• Output from the Mapper
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
• Input to the Mapper
Map Phase (On a Worker)
Count the # of occurrences of each word in a large amount of input data
Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}
(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)
• Output from the Mapper
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
• Input to the Mapper
Map Phase (On a Worker)
• After the Map, all the intermediate values for a given intermediate key are combined together into a list
Reducer (On a Worker)
• After the Map, all the intermediate values for a given intermediate key are combined together into a list
Add up all the values associated with each intermediate key:
Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}
Reducer (On a Worker)
• Input of the Reducer
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
Reducer (On a Worker)
Add up all the values associated with each intermediate key:
Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}
• Input of the Reducer
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
Reducer (On a Worker)
Add up all the values associated with each intermediate key:
Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}
• Output from the Reducer
(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)
• Input of the Reducer
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
Reducer (On a Worker)
Add up all the values associated with each intermediate key:
Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}
• Output from the Reducer
(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)
Grouping + Reducer• Input of the grouping
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
• After the Map, all the intermediate values for a given intermediate key are combined together into a list
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
Mapper Output
aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]
Reducer Input
Grouping/Shuffling
the cat sat on the matthe aardvark sat on the sofa
Mapper Input
(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)
aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]
aardvark, 1cat, 1mat, 1on, 2sat, 2sofa, 1the, 4
Mapping Grouping
Map + Reduce
Reducing
Let’s use MapReduce to help Google MapIndia
We want to compute the average temperature for each state
Lecture Roadmap
• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable
BigTable - Lec 9
Today’s Lecture
Google File System (GFS) - Lec 8
MapReduce - Lec 9
Google Applications, e.g., Gmail and Google Map
Motivation for BigTable• Lots of (semi-)structured data at Google
- URLs: Content, crawl metadata, links, anchors [Search Engine]
- Per-user data: User preference settings, queries [Hangout]- Geographic locations: Physical entities and satellite image
data [Google maps and Google earth]
• Scale is large:- Billions of URLs, many versions/page (~20K/version)
- Hundreds of millions of users, thousands of queries/sec- 100TB+ of satellite image data
Why not just use commercial DB?• Scale is too large for most commercial databases
• Even if it weren’t, cost would be very high- Building internally means system can be applied across many
projects for low incremental cost
• Low-level storage optimizations help performance significantly
Fun and challenging to build large-scale DB systems :)
BigTable [OSDI’06]
• Distributed multi-level map:- With an interesting data model
• Fault-tolerant, persistent• Scalable:
- Thousands of servers- Terabytes of in-memory data- Petabyte of disk-based data- Millions of reads/writes per second, efficient scans
BigTable Status in 2006• Design/initial implementation started beginning of 2004• Currently ~100 BigTable cells• Production use or active development for many projects
- Google print
- My search history- Crawling/indexing pipeline- Google Maps/Google Earth
• Largest BigTable cell manages ~200TB of data spread over several thousand machines (larger cells planned)
Building Blocks for BigTable
• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data
Remember what is the difference between Database and file system
Building Blocks for BigTable
• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?
2. How to implement it?
BigTable’s Data Model Design
• BigTable is NOT a relational database• BigTable appears as a large table
- A BigTable is a sparse, distributed, persistent multidimensional sorted map
BigTable’s Data Model Design• BigTable is NOT a relational database• BigTable appears as a large table
- A BigTable is a sparse, distributed, persistent multidimensional sorted map
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
... <!DOCTYPE html PUBLIC ...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
columnsrows
com.cnn.www
BigTable’s Data Model Design
• (row, column, timestamp) is cell content
EN
EN
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
columnsrows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
t2t4
t3t6
t2
t2t3
t11
versionscom.cnn.www
Rows
EN
EN
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
columnsrows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
t2t4
t3t6
t2
t2t3
t11
versionscom.cnn.www
Rows
EN
EN
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
columnsrows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
t2t4
t3t6
t2
t2t3
t11
versionscom.cnn.www
• Row name is an arbitrary string and is used as key - Access to data in a row is atomic- Row creation is implicit upon storing data
• Rows ordered lexicographically
Column
EN
EN
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
columnsrows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
t2t4
t3t6
t2
t2t3
t11
com.cnn.www
• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data
• Rows ordered lexicographically
versions
Column
EN
EN CNN CNN.com
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
rows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
com.cnn.www
“anchor:cnnsi.com”
“anchor:mylook.ca”
• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data
• Rows ordered lexicographically
Column
EN
EN CNN CNN.com
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
rows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
com.cnn.www
“anchor:cnnsi.com”
“anchor:mylook.ca”
• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data
• Rows ordered lexicographicallyfamily
Column
EN
EN CNN CNN.com
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
rows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
com.cnn.www
“anchor:cnnsi.com”
“anchor:mylook.ca”
• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data
• Rows ordered lexicographically
qualifier
Column
EN
EN CNN CNN.com
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
rows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
com.cnn.www
• Column family- Unit of access control- Has associated type information
• Qualifier gives unbounded columns- Additional level of indexing, if desired
“anchor:cnnsi.com”
“anchor:mylook.ca”
family:optional_qualifier
Column
EN
EN CNN CNN.com
EN
...
“language” “content”
com.aaa
com.weather
... ...
sort
ed
Webtable example
rows
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....
<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....
com.cnn.www
- www.cnn.com is referenced by Sports illustrated (cnnsi.com) and mylook (mylook.ca)- The value (“com.cnn.www”, “anchor:cnnsi.com”) is “CNN”, the reference text from
cnnsi.com
“anchor:cnnsi.com”
“anchor:mylook.ca”
Tablet and Table
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
... <!DOCTYPE html PUBLIC ...
“language” “content”
com.aaa
com.weather
... ...
• A table starts as one tablet• As it grows, it is split into multiple tablets
- Approximate size: 100-200 MB per tablet by default
Tablet
com.cnn.www
Tablet and Table
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
... <!DOCTYPE html PUBLIC ...
“language” “content”
com.aaa
com.cnn.www
com.weather
... ...
• A table starts as one tablet• As it grows, it is split into multiple tablets
- Approximate size: 100-200 MB per tablet by default
Tablet
Tablet and Table
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
“language” “content”
com.aaacom.cnn.www
com.weather
• A table starts as one tablet• As it grows, it is split into multiple tablets
- Approximate size: 100-200 MB per tablet by default
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
EN <!DOCTYPE html PUBLIC ...
com.tech
com.wikipedia
com.zoom
Tablet 1
Tablet 2
Building Blocks for BigTable
• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?
2. How to implement it?
BigTable Architecture
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
The First Thing: Locating Tablets• Since tablets move around from server to server, given a row,
how do clients find the right machine?- We need to find tablet whose row range covers the target row
• One solution: could use the BigTable master- Central server almost certainly would be bottleneck in large system
• Instead: store special tables containing tablet location information in BigTable cell itself
The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets
- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables
Pointer to META0
location
META0 table
META1 table Actual tablet in table T
Chubby
The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets
- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables
Key: A (1.1.1.1)
Pointer to META0
location
META0 table
META1 table Actual tablet in table T
Key: F (1.1.1.2)
Key: I (1.1.1.3)
Key: S (1.1.1.4)
... ...
Key: F (1.1.1.7)
Key: G (1.1.1.8)Key: H (1.1.1.9) Key: Ha (1.1.1.100)
Key: Hc (1.1.1.101)
Key: Hi (1.1.1.104)... ...
Search row key “Hi”
1.1.1.2
1.1.1.9
Chubby
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
BigTable Master
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
1.1.1.2
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
1.1.1.21.1.1.9
BigTableclient library
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
Tablet Tablet
Tablet Server
ChubbyLock serviceGoogle File System
Cluster Scheduling
BigTable Master
Metadata OperationsCreate/delete tables
Create/delete column familieschange metadata
BigTable’s APIs• Metadata operations
- Create/delete tables, column families, change metadata
• Writes: Single-row, atomic- Set(): write cells in a row- DeleteCells(): delete cells in a row- DeleteRow(): delete all cells in a rw
• Reads: Scanner abstraction- Read arbitrary cells in a Bigtable table
BigTable’s APIs• Metadata operations
- Create/delete tables, column families, change metadata
• Writes: Single-row, atomic- Set(): write cells in a row- DeleteCells(): delete cells in a row- DeleteRow(): delete all cells in a rw
• Reads: Scanner abstraction- Read arbitrary cells in a Bigtable table
BigTable’s Write Path
Client Log Server MemstoreTablet Server
Put/DeleteWrite to Log
File System
Write to memstore
Append