Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h...

39
Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Transcript of Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h...

Page 1: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Bigtable

CSE 490h – Introduction to Distributed Computing, Winter 2008

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Page 2: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

GFS vs Bigtable

GFS provides raw data storage We need:More sophisticated storageFlexible enough to be usefulStore semi-structured dataReliable, scalable, etc

Page 3: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Examples

URLs:Contents, crawl metadata, links, anchors,

pagerank, … Per-user data:User preference settings, recent queries/search

results, … Geographic locations:Physical entities (shops, restaurants, etc.), roads,

satellite image data, user annotations, …

Page 4: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Commercial DB

Why not use commercial database?Not scalable enoughToo expensiveWe need to perform low-level optimizations

Page 5: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

MapReduce/BigTable is a step backwards?

Schemas are good.

Separation of the schema from the application is good.

High-level access languages are good.

Page 6: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

BigTable Features

Distributed key-value map Fault-tolerant, persistent Scalable

Thousands of servers Terabytes of in-memory data Petabytes of disk-based dataMillions of reads / writes per second, efficient scans

Self managing Servers can be added / removed dynamically Servers adjust to load imbalance

Page 7: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

7

Basic Data Model Distributed multi-dimensional sparse map

(row, column, timestamp) cell contents

“www.cnn.com”

“contents:”

Rows

Columns

Timestamps

t3t11t17“<html>…”

• Good match for most of Google applications

Page 8: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

8

Rows

Name is an arbitrary stringAccess to data in a row is atomicRow creation is implicit upon storing data

Rows ordered lexicographicallyRows close together lexicographically usually

reside on one or a small number of machines

Page 9: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

9

Columns

“www.cnn.com”

“contents:”

“<html>…” “CNN home page”

“anchor:cnnsi.com”

“CNN”

“anchor:stanford.edu”

Columns have two-level name structure: family:optional_qualifier

Column family Unit of access control Has associated type information

Qualifier gives unbounded columns Additional level of indexing, if desired

Page 10: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

10

Column Families

Must be created before data can be stored

Small number of column families

Unbounded number of columns

Page 11: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

11

Timestamps

Used to store different versions of data in a cellNew writes default to current time, but

timestamps for writes can also be set explicitly by clients

Page 12: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

12

Timestamps

Garbage CollectionPer-column-family settings to tell Bigtable to

GC “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”

Page 13: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

APIs

Metadata operations Create/delete tables, column families, change metadata

Writes Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row

Reads Scanner: read arbitrary cells in a bigtable

Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific

columns

7/10/2013 EECS 584, Fall 201113

Page 14: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

14

API

Create / delete tables and column families

Table *T = OpenOrDie(“/bigtable/web/webtable”);RowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:www.c-span.org”, “CNN”);r1.Delete(“anchor:www.abc.com”);Operation op;Apply(&op, &r1);

Page 15: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

15

Locality Groups

Column families can be assigned to a locality groupUsed to organize underlying storage

representation for performance scans over one locality group are

O(bytes_in_locality_group) , not O(bytes_in_table)Data in a locality group can be explicitly

memory-mapped

Page 16: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Implementation

Single-master distributed system Three major components

Library that linked into every clientOne master server

Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations

Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large

7/10/2013 EECS 584, Fall 201116

Page 17: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Implementation

7/10/2013 EECS 584, Fall 201117

Page 18: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Building Blocks – underlying Google infrastructure “Chubby” for the following tasks

Store the root tablet, schema information, access control lists.

Synchronize and detect tablet servers

What is Chubby ? Highly available persistent lock service. Simple file system with directories and small files Reads and writes to files are atomic.When session ends, clients loose all locks

Page 19: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

19

Chubby

Namespace that consists of directories and small files Each directory or file can be used as lock

Chubby client maintains session with Chubby service Expires if unable to renew its session lease within

expiration time If expired, client loses any locks and open handles

Page 20: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

20

Building Blocks

Relies on lock service called ChubbyEnsure there is at most one active

masterStore bootstrap location of Bigtable dataFinalize table server deathStore column family informationStore access control lists

Page 21: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Building Blocks - Continued

GFS to store log and data files. SSTable is used internally to store data files. What is SSTable ?

Ordered ImmutableMappings from keys to values, both arbitrary byte

arraysOptimized for storage in GFS and can be optionally

mapped into memory.

Page 22: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

22

SSTable

OperationsLook up value for keyIterate over all key/value pairs in specified

range Sequence of blocks (64 KB)Block index used to locate blocks

How do we find block by block index?Binary search on in-memory indexOr, map complete SSTable into memory

Page 23: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Memtable

Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable;

the older updates are stored in a sequence of SSTables.

Page 24: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Building Blocks - Continued

Bigtable depends on Google cluster management system for the following:Scheduling jobsManaging resources on shared machinesMonitoring machine statusDealing with machine failures

Page 25: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablets

Each Tablets is assigned to one tablet server. Tablet holds contiguous range of rows

Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet

Tablet server is responsible for ~100 tablets Fast recovery:

100 machines each pick up 1 tablet for failed machine Fine-grained load balancing:

Migrate tablets away from overloaded machine Master makes load-balancing decisions

7/10/2013 EECS 584, Fall 201125

Page 26: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablet Location

3-level hierarchy for location storingOne file in Chubby for location of Root TabletRoot tablet contains location of Metadata

tabletsMetadata table contains location of user

tablets Row-Key: [Tablet’s Table ID] + [End Row]

Client library caches tablet locationsMoves up the hierarchy if location N/A

Page 27: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

How to locate a Tablet?

Given a row, how do clients find the location of the tablet whose row range covers the target row?

7/10/2013 EECS 584, Fall 201127

METADATA: Key: table id + end row, Data: location Aggressive Caching and Prefetching at Client side

Page 28: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablet Assignment

Master server keeps track of the set of live tablet servers and current assignments of tablets to servers.

When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.

It uses Chubby to monitor health of tablet servers, and restart/replace failed servers.

7/10/2013 EECS 584, Fall 201128

Page 29: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablet Assignment (Chubby)

Tablet server registers itself by getting a lock in a specific directory chubby Chubby gives “lease” on lock, must be renewed periodically Server loses lock if it gets disconnected

Master monitors this directory to find which servers exist/are alive If server not contactable/has lost lock, master grabs lock and

reassigns tablets GFS replicates data. Prefer to start tablet server on same

machine that the data is already at

7/10/2013EECS 584, Fall 201129

Page 30: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

30

Tablets

Rows A - E

Rows F - R

Rows S - Z

As table grows, split tables into

tablets

Page 31: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablet Changes

Tablet Created/Deleted/Merged master

Tablet Split tablet serverServer commits by recording new tablet’s info

in MetadataNotifies the master

Page 32: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

R/W in Tablet

Server authorizes the senderReading list of permitted users in a chubby file

WriteValid mutation written to commit log

(memtable) ReadExecuted on merged view of SStables and

memtable

Page 33: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Tablet Serving

Page 34: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Compaction Minor compaction

(Memtable size > threshold) New memtableOld one converted to an SSTable, written to GFS

Shrink memory usage & Reduce log length in recovery

Merging compaction Reading and shrinking few SSTables and memtable

Major compaction Rewrites all SSTables into exactly one table reclaim resources for deleted data Deleted data disappears (especially sensitive data)

Page 35: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Refinements – Locality Groups

Client groups multiple col-families together A separate SSTable for each LG in tablet Dividing families not accessed together

Example (Language & checksum) VS (page content)

More efficient reads Tuning params for each group

An LG declared to be in memory Useful for small pieces accessed frequently

Page 36: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Refinements – Compression

Client can compress SSTable for an LG Compress format applied to each SSTable block

Small table portion read without complete decomp. Usually two pass compress

Long common strings through large window Fast repetition looking in a small window (16 KB)

Great reduction Data layout

Page 37: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

Conclusion

Bigtable has achieved its goals of high performance, data availability and scalability. It has been successfully deployed in real apps

(Personalized Search, Orkut, GoogleMaps, …)

Significant advantages of building own storage system like flexibility in designing data model, control over implementation and other infrastructure on which Bigtable relies on.

Page 38: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

课程评估(1)目标:系统理解+表达能力

4 pages 描述一个系统

4张图 (每张图半页, Microsoft visio )描述一个系统

用最精炼的文字描述图形

2人一组

必须与我看到的不能完全相同

否则不会高于75分

Page 39: Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h –Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the

课程评估(2) 7月27号前提交

[email protected] & [email protected]

邮件标题: Re: 数据中心计算课程评估