Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  ·...

19
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. UWCS OS Seminar Discussion Erik Paulson 2 October 2006 See also the (other)UW presentation by Jeff Dean in September of 2005 (See the link on the seminar page, or just google for “google bigtable”)

Transcript of Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  ·...

Page 1: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

Google BigtableFay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,

Robert E. GruberGoogle, Inc.

UWCS OS Seminar DiscussionErik Paulson

2 October 2006

See also the (other)UW presentation by Jeff Dean in September of 2005 (See the link on the seminar page, or just google for “google bigtable”)

Page 2: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

2 of 19

Before we begin…

• Intersection of databases and distributed systems

• Will try to explain (or at least warn) when we hit a patch of database

• Remember this is a discussion!

Page 3: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

3 of 19

Google Scale• Lots of data

– Copies of the web, satellite data, user data, email and USENET, Subversion backing store

• Many incoming requests• No commercial system big enough

– Couldn’t afford it if there was one– Might not have made appropriate design choices

• Firm believers in the End-to-End argument• 450,000 machines (NYTimes estimate, June 14th

2006

Page 4: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

4 of 19

Building Blocks• Scheduler (Google WorkQueue)• Google Filesystem• Chubby Lock service• Two other pieces helpful but not required

– Sawzall– MapReduce (despite what the Internet says)

• BigTable: build a more application-friendly storage service using these parts

Page 5: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

5 of 19

Google File System

• Large-scale distributed “filesystem”• Master: responsible for metadata• Chunk servers: responsible for reading

and writing large chunks of data• Chunks replicated on 3 machines, master

responsible for ensuring replicas exist• OSDI ’04 Paper

Page 6: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

6 of 19

Chubby

• {lock/file/name} service• Coarse-grained locks, can store small

amount of data in a lock• 5 replicas, need a majority vote to be

active• Also an OSDI ’06 Paper

Page 7: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

7 of 19

Data model: a big map•<Row, Column, Timestamp> triple for key - lookup, insert, and delete API

•Arbitrary “columns” on a row-by-row basis

•Column family:qualifier. Family is heavyweight, qualifier lightweight

•Column-oriented physical store- rows are sparse!

•Does not support a relational model

•No table-wide integrity constraints

•No multirow transactions

Page 8: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

8 of 19

SSTable• Immutable, sorted file of key-value

pairs• Chunks of data plus an index

– Index is of block ranges, not values

Index

64K block

64K block

64K block

SSTable

Page 9: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

9 of 19

Tablet

• Contains some range of rows of the table• Built out of multiple SSTables

Index

64K block

64K block

64K block

SSTable

Index

64K block

64K block

64K block

SSTable

Tablet Start:aardvark End:apple

Page 10: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

10 of 19

Table• Multiple tablets make up the table• SSTables can be shared• Tablets do not overlap, SSTables can overlap

SSTable SSTable SSTable SSTable

Tabletaardvark apple

Tabletapple_two_E boat

Page 11: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

11 of 19

Finding a tablet

Page 12: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

12 of 19

Servers• Tablet servers manage tablets, multiple tablets

per server. Each tablet is 100-200 megs– Each tablet lives at only one server– Tablet server splits tablets that get too big

• Master responsible for load balancing and fault tolerance– Use Chubby to monitor health of tablet servers,

restart failed servers– GFS replicates data. Prefer to start tablet server on

same machine that the data is already at

Page 13: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

13 of 19

Editing a table• Mutations are logged, then applied to

an in-memory version• Logfile stored in GFS

SSTable SSTable

Tablet

apple_two_E boat

Insert

InsertDelete

InsertDelete

Insert

Memtable

Page 14: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

14 of 19

Compactions• Minor compaction – convert the memtable into

an SSTable– Reduce memory usage – Reduce log traffic on restart

• Merging compaction– Reduce number of SSTables– Good place to apply policy “keep only N versions”

• Major compaction– Merging compaction that results in only one SSTable– No deletion records, only live data

Page 15: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

15 of 19

Locality Groups

• Group column families together into an SSTable– Avoid mingling data, ie page contents and

page metadata– Can keep some groups all in memory

• Can compress locality groups• Bloom Filters on locality groups – avoid

searching SSTable

Page 16: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

16 of 19

Microbenchmarks

Page 17: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

17 of 19

Page 18: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

18 of 19

Application at Google

Page 19: Google Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2011/DC_6_bigtable.… ·  · 2013-07-10Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, ... Each tablet is

19 of 19

Lessons learned

• Interesting point- only implement some of the requirements, since the last is probably not needed

• Many types of failure possible• Big systems need proper systems-level

monitoring• Value simple designs