MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...

104
CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University

Transcript of MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...

CPSC 426/526MapReduce & BigTable

Ennan ZhaiComputer Science Department

Yale University

Lecture Roadmap

• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable

Last Lecture

Google File System (GFS) - Lec 8

MapReduce - Lec 9 BigTable - Lec 9

Google Applications, e.g., Gmail and Google Map

BigTable - Lec 9

Today’s Lecture

Google File System (GFS) - Lec 8

MapReduce - Lec 9

Google Applications, e.g., Gmail and Google Map

Recall: How GFS works?

Master Chunkserver1

Google File System [SOSP’03]

Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Data

Google File System [SOSP’03]

GFS

Metadata 1 2 3

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

GFS

Metadata 1 2 3

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]The design insights:- Metadata is used for indexing chunks- Huge files -> 64 MB for each chunk -> fewer chunks- Reduce client-master interaction and metadata size

GFS

1 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Data

Google File System [SOSP’03]

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Data

Google File System [SOSP’03]

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Data

Google File System [SOSP’03]The design insights:- Replicas are used to ensure availability- Master can choose the nearest replicas for the client - Read and append-only makes it easy to manage replicas

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

read

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

read- IP address for each chunk- the ID for each chunk

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

read<Chunkserver1’s IP, Chunk1><Chunkserver1’s IP, Chunk2><Chunkserver2‘s IP, Chunk3>

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

read

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Google File System [SOSP’03]

read

Why GFS tries to avoid random write?

Put GFS in a Datacenter

Put GFS in a Datacenter

Each rack has a master

Put GFS in a Datacenter

Each rack has a master

After GFS, what we need to do next?

GFS

1 2 1 3 1 2 3 2 3Metadata

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Very Important!

Very Important!

GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Very Important!

GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Very Important!

GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Very Important!

GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

• A toy problem: The word count- We have 10 billion documents

- Average document’s size is 20KB => 10 billion docs = 200TB

• Our solution:

Processing and Analyzing Data

• A toy problem: The word count- We have 10 billion documents

- Average document’s size is 20KB => 10 billion docs = 200TB

• Our solution:

for each document d{ for each word w in d {word_count[w]++;} }

Processing and Analyzing Data

• A toy problem: The word count- We have 10 billion documents

- Average document’s size is 20KB => 10 billion docs = 200TB

• Our solution:

for each document d{ for each word w in d {word_count[w]++;} }

Approximately one month.

Processing and Analyzing Data

• Inspired from map and reduce operations commonly used in functional programming language like LISP

• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>

- 2. Reduce: <key2, value2> <value3>

MapReduce Programming Model

MapReduce Programming Model

Map ReduceInput

Map(k,v)-->(k’,v’)Group (k’,v’)s by k’

Reduce(k’,v’[])-->v’’

Output

• Inspired from map and reduce operations commonly used in functional programming language like LISP

• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>

- 2. Reduce: <key2, value2[ ]> <value3>

MapReduce Programming Model

Map ReduceInput

Map(k,v)-->(k’,v’)Group (k’,v’)s by k’

Reduce(k’,v’[])-->v’’

Output

• Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs, e.g., <doc, id> <doc, content>

• Inspired from map and reduce operations commonly used in functional programming language like LISP

• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>

- 2. Reduce: <key2, value2[ ]> <value3>

MapReduce Programming Model

Map ReduceInput

Map(k,v)-->(k’,v’)Group (k’,v’)s by k’

Reduce(k’,v’[])-->v’’

Output

• After map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer for aggregating/merging the result.

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]

• GFS is responsible for storing data for MapReduce- Data is split into chunks and distributed across nodes- Each chunk is replicated - Offers redundant storage for massive amounts of data

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]

MapReduce

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]

MapReduce

Heard Hadoop?HDFS + Hadoop MapReduce

GFS

MapReduce [OSDI’04]

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]• Two core components

- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce

MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]• Two core components

- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce programs

MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

MapReduce

GFS

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]

Documtent A2Documtent A3Document

MapReduce

GFS

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

MapReduce [OSDI’04]

Documtent A1Documtent A2Documtent A3

MapReduce

GFS

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Documtent A1Documtent A2Documtent A3

MapReduce

GFS

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

Documtent A1Documtent A2Documtent A3 TaskTracker

Word Count Example

• Word count is challenging over massive amounts of data• Fundamentals of statistics often are aggregate functions• Most aggregation functions have distributive nature• MapReduce breaks complex tasks into smaller pieces which

can be executed in parallel

Why we care about word count

Count the # of occurrences of each word in a large amount of input data

Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}

Map Phase (On a Worker)

Count the # of occurrences of each word in a large amount of input data

Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}

(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)

• Input to the Mapper

Map Phase (On a Worker)

Count the # of occurrences of each word in a large amount of input data

Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}

(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)

• Output from the Mapper

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

• Input to the Mapper

Map Phase (On a Worker)

Count the # of occurrences of each word in a large amount of input data

Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}

(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)

• Output from the Mapper

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

• Input to the Mapper

Map Phase (On a Worker)

• After the Map, all the intermediate values for a given intermediate key are combined together into a list

Reducer (On a Worker)

• After the Map, all the intermediate values for a given intermediate key are combined together into a list

Add up all the values associated with each intermediate key:

Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}

Reducer (On a Worker)

• Input of the Reducer

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

Reducer (On a Worker)

Add up all the values associated with each intermediate key:

Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}

• Input of the Reducer

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

Reducer (On a Worker)

Add up all the values associated with each intermediate key:

Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}

• Output from the Reducer

(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)

• Input of the Reducer

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

Reducer (On a Worker)

Add up all the values associated with each intermediate key:

Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}

• Output from the Reducer

(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)

Grouping + Reducer• Input of the grouping

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

• After the Map, all the intermediate values for a given intermediate key are combined together into a list

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

Mapper Output

aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]

Reducer Input

Grouping/Shuffling

the cat sat on the matthe aardvark sat on the sofa

Mapper Input

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]

aardvark, 1cat, 1mat, 1on, 2sat, 2sofa, 1the, 4

Mapping Grouping

Map + Reduce

Reducing

High-Level Picture for MR

Let’s use MapReduce to help Google MapIndia

We want to compute the average temperature for each state

Let’s use MapReduce to help Google Map

We want to compute the average temperature for each state

Let’s use MapReduce to help Google Map

MP: 75 CG: 72OR: 72

Let’s use MapReduce to help Google Map

Let’s use MapReduce to help Google Map

Let’s use MapReduce to help Google Map

Let’s use MapReduce to help Google Map

Let’s use MapReduce to help Google Map

Lecture Roadmap

• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable

BigTable - Lec 9

Today’s Lecture

Google File System (GFS) - Lec 8

MapReduce - Lec 9

Google Applications, e.g., Gmail and Google Map

Motivation for BigTable• Lots of (semi-)structured data at Google

- URLs: Content, crawl metadata, links, anchors [Search Engine]

- Per-user data: User preference settings, queries [Hangout]- Geographic locations: Physical entities and satellite image

data [Google maps and Google earth]

• Scale is large:- Billions of URLs, many versions/page (~20K/version)

- Hundreds of millions of users, thousands of queries/sec- 100TB+ of satellite image data

Why not just use commercial DB?• Scale is too large for most commercial databases

• Even if it weren’t, cost would be very high- Building internally means system can be applied across many

projects for low incremental cost

• Low-level storage optimizations help performance significantly

Fun and challenging to build large-scale DB systems :)

BigTable [OSDI’06]

• Distributed multi-level map:- With an interesting data model

• Fault-tolerant, persistent• Scalable:

- Thousands of servers- Terabytes of in-memory data- Petabyte of disk-based data- Millions of reads/writes per second, efficient scans

BigTable Status in 2006• Design/initial implementation started beginning of 2004• Currently ~100 BigTable cells• Production use or active development for many projects

- Google print

- My search history- Crawling/indexing pipeline- Google Maps/Google Earth

• Largest BigTable cell manages ~200TB of data spread over several thousand machines (larger cells planned)

Building Blocks for BigTable

• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data

Remember what is the difference between Database and file system

Building Blocks for BigTable

• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?

2. How to implement it?

BigTable’s Data Model Design

• BigTable is NOT a relational database• BigTable appears as a large table

- A BigTable is a sparse, distributed, persistent multidimensional sorted map

BigTable’s Data Model Design• BigTable is NOT a relational database• BigTable appears as a large table

- A BigTable is a sparse, distributed, persistent multidimensional sorted map

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

... <!DOCTYPE html PUBLIC ...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

com.cnn.www

BigTable’s Data Model Design

• (row, column, timestamp) is cell content

EN

EN

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

Rows

EN

EN

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

Rows

EN

EN

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

• Row name is an arbitrary string and is used as key - Access to data in a row is atomic- Row creation is implicit upon storing data

• Rows ordered lexicographically

Column

EN

EN

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

t2t4

t3t6

t2

t2t3

t11

com.cnn.www

• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data

• Rows ordered lexicographically

versions

Column

EN

EN CNN CNN.com

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

rows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

com.cnn.www

“anchor:cnnsi.com”

“anchor:mylook.ca”

• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data

• Rows ordered lexicographically

Column

EN

EN CNN CNN.com

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

rows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

com.cnn.www

“anchor:cnnsi.com”

“anchor:mylook.ca”

• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data

• Rows ordered lexicographicallyfamily

Column

EN

EN CNN CNN.com

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

rows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

com.cnn.www

“anchor:cnnsi.com”

“anchor:mylook.ca”

• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data

• Rows ordered lexicographically

qualifier

Column

EN

EN CNN CNN.com

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

rows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

com.cnn.www

• Column family- Unit of access control- Has associated type information

• Qualifier gives unbounded columns- Additional level of indexing, if desired

“anchor:cnnsi.com”

“anchor:mylook.ca”

family:optional_qualifier

Column

EN

EN CNN CNN.com

EN

...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

rows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

com.cnn.www

- www.cnn.com is referenced by Sports illustrated (cnnsi.com) and mylook (mylook.ca)- The value (“com.cnn.www”, “anchor:cnnsi.com”) is “CNN”, the reference text from

cnnsi.com

“anchor:cnnsi.com”

“anchor:mylook.ca”

Tablet and Table

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

... <!DOCTYPE html PUBLIC ...

“language” “content”

com.aaa

com.weather

... ...

• A table starts as one tablet• As it grows, it is split into multiple tablets

- Approximate size: 100-200 MB per tablet by default

Tablet

com.cnn.www

Tablet and Table

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

... <!DOCTYPE html PUBLIC ...

“language” “content”

com.aaa

com.cnn.www

com.weather

... ...

• A table starts as one tablet• As it grows, it is split into multiple tablets

- Approximate size: 100-200 MB per tablet by default

Tablet

Tablet and Table

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

“language” “content”

com.aaacom.cnn.www

com.weather

• A table starts as one tablet• As it grows, it is split into multiple tablets

- Approximate size: 100-200 MB per tablet by default

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

EN <!DOCTYPE html PUBLIC ...

com.tech

com.wikipedia

com.zoom

Tablet 1

Tablet 2

Building Blocks for BigTable

• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?

2. How to implement it?

BigTable Architecture

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

The First Thing: Locating Tablets• Since tablets move around from server to server, given a row,

how do clients find the right machine?- We need to find tablet whose row range covers the target row

• One solution: could use the BigTable master- Central server almost certainly would be bottleneck in large system

• Instead: store special tables containing tablet location information in BigTable cell itself

The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets

- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables

Pointer to META0

location

META0 table

META1 table Actual tablet in table T

Chubby

The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets

- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables

Key: A (1.1.1.1)

Pointer to META0

location

META0 table

META1 table Actual tablet in table T

Key: F (1.1.1.2)

Key: I (1.1.1.3)

Key: S (1.1.1.4)

... ...

Key: F (1.1.1.7)

Key: G (1.1.1.8)Key: H (1.1.1.9) Key: Ha (1.1.1.100)

Key: Hc (1.1.1.101)

Key: Hi (1.1.1.104)... ...

Search row key “Hi”

1.1.1.2

1.1.1.9

Chubby

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

BigTable Master

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

1.1.1.2

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

1.1.1.21.1.1.9

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

Metadata OperationsCreate/delete tables

Create/delete column familieschange metadata

BigTable’s APIs• Metadata operations

- Create/delete tables, column families, change metadata

• Writes: Single-row, atomic- Set(): write cells in a row- DeleteCells(): delete cells in a row- DeleteRow(): delete all cells in a rw

• Reads: Scanner abstraction- Read arbitrary cells in a Bigtable table

BigTable’s APIs• Metadata operations

- Create/delete tables, column families, change metadata

• Writes: Single-row, atomic- Set(): write cells in a row- DeleteCells(): delete cells in a row- DeleteRow(): delete all cells in a rw

• Reads: Scanner abstraction- Read arbitrary cells in a Bigtable table

BigTable’s Write Path

Client Log Server MemstoreTablet Server

Put/DeleteWrite to Log

File System

Write to memstore

Append

Next Lecture• In the lec-10, I will cover:

- Transactions in distributed systems- Consistency models- Two phase commit- Consensus protocol: Paxos