MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...

CPSC 426/526MapReduce & BigTable

Ennan ZhaiComputer Science Department

Yale University

Lecture Roadmap

• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable

Last Lecture

Google File System (GFS) - Lec 8

MapReduce - Lec 9 BigTable - Lec 9

Google Applications, e.g., Gmail and Google Map

BigTable - Lec 9

Today’s Lecture


MapReduce - Lec 9


Recall: How GFS works?

Master Chunkserver1

Google File System [SOSP’03]

Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5

GFS

Master Chunkserver1 Chunkserver2 Chunkserver3 Chunkserver4 Chunkserver5


GFS


Data


GFS

Metadata 1 2 3



GFS

Metadata 1 2 3


Google File System [SOSP’03]The design insights:- Metadata is used for indexing chunks- Huge files -> 64 MB for each chunk -> fewer chunks- Reduce client-master interaction and metadata size

GFS

1 2 3Metadata


Data


GFS

1 2 1 3 1 2 3 2 3Metadata


Data


GFS

1 2 1 3 1 2 3 2 3Metadata


Data

Google File System [SOSP’03]The design insights:- Replicas are used to ensure availability- Master can choose the nearest replicas for the client - Read and append-only makes it easy to manage replicas

GFS

1 2 1 3 1 2 3 2 3Metadata



read

GFS

1 2 1 3 1 2 3 2 3Metadata



read- IP address for each chunk- the ID for each chunk

GFS

1 2 1 3 1 2 3 2 3Metadata



read<Chunkserver1’s IP, Chunk1><Chunkserver1’s IP, Chunk2><Chunkserver2‘s IP, Chunk3>

GFS

1 2 1 3 1 2 3 2 3Metadata



read

GFS

1 2 1 3 1 2 3 2 3Metadata



read

Why GFS tries to avoid random write?

Put GFS in a Datacenter

Put GFS in a Datacenter

Each rack has a master

After GFS, what we need to do next?

GFS

1 2 1 3 1 2 3 2 3Metadata


GFS

1 2 1 3 1 2 3 2 3Metadata

Processing and Analyzing Data


Very Important!

Very Important!

GFS

1 2 1 3 1 2 3 2 3Metadata



• A toy problem: The word count- We have 10 billion documents

- Average document’s size is 20KB => 10 billion docs = 200TB

• Our solution:




• Our solution:

for each document d{ for each word w in d {word_count[w]++;} }




• Our solution:

for each document d{ for each word w in d {word_count[w]++;} }

Approximately one month.


• Inspired from map and reduce operations commonly used in functional programming language like LISP

• Users implement interface of two primary methods:- 1. Map: <key1, value1> <key2, value 2>

- 2. Reduce: <key2, value2> <value3>

MapReduce Programming Model


Map ReduceInput

Map(k,v)-->(k’,v’)Group (k’,v’)s by k’

Reduce(k’,v’[])-->v’’

Output



- 2. Reduce: <key2, value2[ ]> <value3>


Map ReduceInput



Output

• Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs, e.g., <doc, id> <doc, content>



- 2. Reduce: <key2, value2[ ]> <value3>


Map ReduceInput



Output

• After map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer for aggregating/merging the result.

GFS


MapReduce [OSDI’04]

• GFS is responsible for storing data for MapReduce- Data is split into chunks and distributed across nodes- Each chunk is replicated - Offers redundant storage for massive amounts of data

GFS



MapReduce

GFS



MapReduce

Heard Hadoop?HDFS + Hadoop MapReduce

GFS



MapReduceJobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker

GFS


MapReduce [OSDI’04]• Two core components

- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce


GFS


MapReduce [OSDI’04]• Two core components

- JobTracker: assigning tasks to different workers- TaskTracker: executing map and reduce programs


MapReduce

GFS

JobTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker



Documtent A2Documtent A3Document

MapReduce

GFS




Documtent A1Documtent A2Documtent A3

MapReduce

GFS



Documtent A1Documtent A2Documtent A3

MapReduce

GFS



Documtent A1Documtent A2Documtent A3 TaskTracker

Word Count Example

• Word count is challenging over massive amounts of data• Fundamentals of statistics often are aggregate functions• Most aggregation functions have distributive nature• MapReduce breaks complex tasks into smaller pieces which

can be executed in parallel

Why we care about word count

Count the # of occurrences of each word in a large amount of input data

Map(input_key, input_value) { foreach word w in input_value: emit(w, 1);}

Map Phase (On a Worker)



(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)

• Input to the Mapper




(3414, ‘the cat sat on the mat’)(3437, ‘the aardvark sat on the sofa’)

• Output from the Mapper

(‘the’, 1), (‘cat’, 1), (‘sat’, 1), (‘on’, 1),(‘the’, 1), (‘mat’, 1), (‘the’, 1), (‘aardvark’, 1), (‘sat’, 1), (‘on’, 1), (‘the’, 1), (‘sofa’, 1)

• Input to the Mapper


• After the Map, all the intermediate values for a given intermediate key are combined together into a list

Reducer (On a Worker)


Add up all the values associated with each intermediate key:

Reduce(output_key, intermediate_vals) { set count = 0; foreach v in intermediate_vals: count += v; emit(output_key, count);}


• Input of the Reducer





• Output from the Reducer

(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)






• Output from the Reducer

(‘the’, 4), (‘sat’, 2), (‘on’, 2), (‘sofa’, 1), (‘mat’, 1), (‘cat’, 1), (‘aardvark’, 1)

Grouping + Reducer• Input of the grouping




Mapper Output

aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]

Reducer Input

Grouping/Shuffling

the cat sat on the matthe aardvark sat on the sofa

Mapper Input


aardvark, 1cat, 1mat, 1on [1, 1]sat [1, 1]sofa, 1the [1, 1, 1, 1]

aardvark, 1cat, 1mat, 1on, 2sat, 2sofa, 1the, 4

Mapping Grouping

Map + Reduce

Reducing

High-Level Picture for MR

Let’s use MapReduce to help Google MapIndia

We want to compute the average temperature for each state

Let’s use MapReduce to help Google Map

We want to compute the average temperature for each state


MP: 75 CG: 72OR: 72

Lecture Roadmap

• Cloud Computing Overview• Challenges in the Clouds• Distributed File Systems: GFS• Data Process & Analysis: MapReduce• Database: BigTable

BigTable - Lec 9

Today’s Lecture


MapReduce - Lec 9


Motivation for BigTable• Lots of (semi-)structured data at Google

- URLs: Content, crawl metadata, links, anchors [Search Engine]

- Per-user data: User preference settings, queries [Hangout]- Geographic locations: Physical entities and satellite image

data [Google maps and Google earth]

• Scale is large:- Billions of URLs, many versions/page (~20K/version)

- Hundreds of millions of users, thousands of queries/sec- 100TB+ of satellite image data

Why not just use commercial DB?• Scale is too large for most commercial databases

• Even if it weren’t, cost would be very high- Building internally means system can be applied across many

projects for low incremental cost

• Low-level storage optimizations help performance significantly

Fun and challenging to build large-scale DB systems :)

BigTable [OSDI’06]

• Distributed multi-level map:- With an interesting data model

• Fault-tolerant, persistent• Scalable:

- Thousands of servers- Terabytes of in-memory data- Petabyte of disk-based data- Millions of reads/writes per second, efficient scans

BigTable Status in 2006• Design/initial implementation started beginning of 2004• Currently ~100 BigTable cells• Production use or active development for many projects

- Google print

- My search history- Crawling/indexing pipeline- Google Maps/Google Earth

• Largest BigTable cell manages ~200TB of data spread over several thousand machines (larger cells planned)

Building Blocks for BigTable

• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data

Remember what is the difference between Database and file system


• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?

2. How to implement it?

BigTable’s Data Model Design

• BigTable is NOT a relational database• BigTable appears as a large table

- A BigTable is a sparse, distributed, persistent multidimensional sorted map

BigTable’s Data Model Design• BigTable is NOT a relational database• BigTable appears as a large table

- A BigTable is a sparse, distributed, persistent multidimensional sorted map

EN <!DOCTYPE html PUBLIC ...



... <!DOCTYPE html PUBLIC ...

“language” “content”

com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

com.cnn.www

BigTable’s Data Model Design

• (row, column, timestamp) is cell content

EN

EN

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows

<!DOCTYPE html ....<!DOCTYPE html ....

<!DOCTYPE html ....


<!DOCTYPE html ....<!DOCTYPE html ....<!DOCTYPE html ....

t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

Rows

EN

EN

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows


<!DOCTYPE html ....



t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

Rows

EN

EN

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows


<!DOCTYPE html ....



t2t4

t3t6

t2

t2t3

t11

versionscom.cnn.www

• Row name is an arbitrary string and is used as key - Access to data in a row is atomic- Row creation is implicit upon storing data

• Rows ordered lexicographically

Column

EN

EN

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

columnsrows


<!DOCTYPE html ....



t2t4

t3t6

t2

t2t3

t11

com.cnn.www

• Columns have two-level name structure - family:optional_qualifier- Row creation is implicit upon storing data


versions

Column

EN

EN CNN CNN.com

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

rows


<!DOCTYPE html ....



com.cnn.www

“anchor:cnnsi.com”

“anchor:mylook.ca”



Column

EN

EN CNN CNN.com

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

rows


<!DOCTYPE html ....



com.cnn.www




• Rows ordered lexicographicallyfamily

Column

EN

EN CNN CNN.com

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

rows


<!DOCTYPE html ....



com.cnn.www





qualifier

Column

EN

EN CNN CNN.com

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

rows


<!DOCTYPE html ....



com.cnn.www

• Column family- Unit of access control- Has associated type information

• Qualifier gives unbounded columns- Additional level of indexing, if desired



family:optional_qualifier

Column

EN

EN CNN CNN.com

EN

...


com.aaa

com.weather

... ...

sort

ed

Webtable example

rows


<!DOCTYPE html ....



com.cnn.www

- www.cnn.com is referenced by Sports illustrated (cnnsi.com) and mylook (mylook.ca)- The value (“com.cnn.www”, “anchor:cnnsi.com”) is “CNN”, the reference text from

cnnsi.com



http://www.cnn.com

http://www.cnn.com

Tablet and Table






com.aaa

com.weather

... ...

• A table starts as one tablet• As it grows, it is split into multiple tablets

- Approximate size: 100-200 MB per tablet by default

Tablet

com.cnn.www

Tablet and Table






com.aaa

com.cnn.www

com.weather

... ...



Tablet

Tablet and Table





com.aaacom.cnn.www

com.weather






com.tech

com.wikipedia

com.zoom

Tablet 1

Tablet 2


• BigTable uses of building blocks:- Google File System (GFS): stores persistent state and data- Scheduler: schedules jobs involved in BigTable serving- Lock service: master election, location bootstrapping- MapReduce: often used to process BigTable data1. What is the data model?

2. How to implement it?

BigTable Architecture

BigTableclient library

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

ChubbyLock serviceGoogle File System

Cluster Scheduling

BigTable Master

The First Thing: Locating Tablets• Since tablets move around from server to server, given a row,

how do clients find the right machine?- We need to find tablet whose row range covers the target row

• One solution: could use the BigTable master- Central server almost certainly would be bottleneck in large system

• Instead: store special tables containing tablet location information in BigTable cell itself

The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets

- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables

Pointer to META0

location

META0 table

META1 table Actual tablet in table T

Chubby

The First Thing: Locating Tablets• 3-level hierarchical lookup scheme for tablets

- Location is IP:port of relevant server- 1st level: bootstrapped from lock service, points to owner of META0- 2nd level: uses META0 data to find owner of appropriate META1 tablet- 3rd level: META1 table holds locations of tablets of all other tables

Key: A (1.1.1.1)

Pointer to META0

location

META0 table

META1 table Actual tablet in table T

Key: F (1.1.1.2)

Key: I (1.1.1.3)

Key: S (1.1.1.4)

... ...

Key: F (1.1.1.7)

Key: G (1.1.1.8)Key: H (1.1.1.9) Key: Ha (1.1.1.100)

Key: Hc (1.1.1.101)

Key: Hi (1.1.1.104)... ...

Search row key “Hi”

1.1.1.2

1.1.1.9

Chubby


Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

BigTable Master


Cluster Scheduling


Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server


Cluster Scheduling

BigTable Master


Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server


Cluster Scheduling

BigTable Master

1.1.1.2


Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server


Cluster Scheduling

BigTable Master

1.1.1.21.1.1.9


Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server

Tablet Tablet

Tablet Server


Cluster Scheduling

BigTable Master

Metadata OperationsCreate/delete tables

Create/delete column familieschange metadata

BigTable’s APIs• Metadata operations

- Create/delete tables, column families, change metadata

• Writes: Single-row, atomic- Set(): write cells in a row- DeleteCells(): delete cells in a row- DeleteRow(): delete all cells in a rw

• Reads: Scanner abstraction- Read arbitrary cells in a Bigtable table

BigTable’s Write Path

Client Log Server MemstoreTablet Server

Put/DeleteWrite to Log

File System

Write to memstore

Append

Next Lecture• In the lec-10, I will cover:

- Transactions in distributed systems- Consistency models- Two phase commit- Consensus protocol: Paxos

MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...

Documents

Transcript of MapReduce & BigTable - Zoo | Yale University · MapReduce & BigTable ... • Inspired from map and...