Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h...
Transcript of Bigtable - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_5_0_BigTable.pdfBigtable CSE 490h...
Bigtable
CSE 490h – Introduction to Distributed Computing, Winter 2008
Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
GFS vs Bigtable
GFS provides raw data storage We need:More sophisticated storageFlexible enough to be usefulStore semi-structured dataReliable, scalable, etc
Examples
URLs:Contents, crawl metadata, links, anchors,
pagerank, … Per-user data:User preference settings, recent queries/search
results, … Geographic locations:Physical entities (shops, restaurants, etc.), roads,
satellite image data, user annotations, …
Commercial DB
Why not use commercial database?Not scalable enoughToo expensiveWe need to perform low-level optimizations
MapReduce/BigTable is a step backwards?
Schemas are good.
Separation of the schema from the application is good.
High-level access languages are good.
BigTable Features
Distributed key-value map Fault-tolerant, persistent Scalable
Thousands of servers Terabytes of in-memory data Petabytes of disk-based dataMillions of reads / writes per second, efficient scans
Self managing Servers can be added / removed dynamically Servers adjust to load imbalance
7
Basic Data Model Distributed multi-dimensional sparse map
(row, column, timestamp) cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t3t11t17“<html>…”
• Good match for most of Google applications
8
Rows
Name is an arbitrary stringAccess to data in a row is atomicRow creation is implicit upon storing data
Rows ordered lexicographicallyRows close together lexicographically usually
reside on one or a small number of machines
9
Columns
“www.cnn.com”
“contents:”
“<html>…” “CNN home page”
“anchor:cnnsi.com”
“CNN”
“anchor:stanford.edu”
Columns have two-level name structure: family:optional_qualifier
Column family Unit of access control Has associated type information
Qualifier gives unbounded columns Additional level of indexing, if desired
10
Column Families
Must be created before data can be stored
Small number of column families
Unbounded number of columns
11
Timestamps
Used to store different versions of data in a cellNew writes default to current time, but
timestamps for writes can also be set explicitly by clients
12
Timestamps
Garbage CollectionPer-column-family settings to tell Bigtable to
GC “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”
APIs
Metadata operations Create/delete tables, column families, change metadata
Writes Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row
Reads Scanner: read arbitrary cells in a bigtable
Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific
columns
7/10/2013 EECS 584, Fall 201113
14
API
Create / delete tables and column families
Table *T = OpenOrDie(“/bigtable/web/webtable”);RowMutation r1(T, “com.cnn.www”);r1.Set(“anchor:www.c-span.org”, “CNN”);r1.Delete(“anchor:www.abc.com”);Operation op;Apply(&op, &r1);
15
Locality Groups
Column families can be assigned to a locality groupUsed to organize underlying storage
representation for performance scans over one locality group are
O(bytes_in_locality_group) , not O(bytes_in_table)Data in a locality group can be explicitly
memory-mapped
Implementation
Single-master distributed system Three major components
Library that linked into every clientOne master server
Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations
Many tablet servers Tablet servers handle read and write requests to its table Splits tablets that have grown too large
7/10/2013 EECS 584, Fall 201116
Implementation
7/10/2013 EECS 584, Fall 201117
Building Blocks – underlying Google infrastructure “Chubby” for the following tasks
Store the root tablet, schema information, access control lists.
Synchronize and detect tablet servers
What is Chubby ? Highly available persistent lock service. Simple file system with directories and small files Reads and writes to files are atomic.When session ends, clients loose all locks
19
Chubby
Namespace that consists of directories and small files Each directory or file can be used as lock
Chubby client maintains session with Chubby service Expires if unable to renew its session lease within
expiration time If expired, client loses any locks and open handles
20
Building Blocks
Relies on lock service called ChubbyEnsure there is at most one active
masterStore bootstrap location of Bigtable dataFinalize table server deathStore column family informationStore access control lists
Building Blocks - Continued
GFS to store log and data files. SSTable is used internally to store data files. What is SSTable ?
Ordered ImmutableMappings from keys to values, both arbitrary byte
arraysOptimized for storage in GFS and can be optionally
mapped into memory.
22
SSTable
OperationsLook up value for keyIterate over all key/value pairs in specified
range Sequence of blocks (64 KB)Block index used to locate blocks
How do we find block by block index?Binary search on in-memory indexOr, map complete SSTable into memory
Memtable
Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable;
the older updates are stored in a sequence of SSTables.
Building Blocks - Continued
Bigtable depends on Google cluster management system for the following:Scheduling jobsManaging resources on shared machinesMonitoring machine statusDealing with machine failures
Tablets
Each Tablets is assigned to one tablet server. Tablet holds contiguous range of rows
Clients can often choose row keys to achieve locality Aim for ~100MB to 200MB of data per tablet
Tablet server is responsible for ~100 tablets Fast recovery:
100 machines each pick up 1 tablet for failed machine Fine-grained load balancing:
Migrate tablets away from overloaded machine Master makes load-balancing decisions
7/10/2013 EECS 584, Fall 201125
Tablet Location
3-level hierarchy for location storingOne file in Chubby for location of Root TabletRoot tablet contains location of Metadata
tabletsMetadata table contains location of user
tablets Row-Key: [Tablet’s Table ID] + [End Row]
Client library caches tablet locationsMoves up the hierarchy if location N/A
How to locate a Tablet?
Given a row, how do clients find the location of the tablet whose row range covers the target row?
7/10/2013 EECS 584, Fall 201127
METADATA: Key: table id + end row, Data: location Aggressive Caching and Prefetching at Client side
Tablet Assignment
Master server keeps track of the set of live tablet servers and current assignments of tablets to servers.
When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.
It uses Chubby to monitor health of tablet servers, and restart/replace failed servers.
7/10/2013 EECS 584, Fall 201128
Tablet Assignment (Chubby)
Tablet server registers itself by getting a lock in a specific directory chubby Chubby gives “lease” on lock, must be renewed periodically Server loses lock if it gets disconnected
Master monitors this directory to find which servers exist/are alive If server not contactable/has lost lock, master grabs lock and
reassigns tablets GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
7/10/2013EECS 584, Fall 201129
30
Tablets
Rows A - E
Rows F - R
Rows S - Z
As table grows, split tables into
tablets
Tablet Changes
Tablet Created/Deleted/Merged master
Tablet Split tablet serverServer commits by recording new tablet’s info
in MetadataNotifies the master
R/W in Tablet
Server authorizes the senderReading list of permitted users in a chubby file
WriteValid mutation written to commit log
(memtable) ReadExecuted on merged view of SStables and
memtable
Tablet Serving
Compaction Minor compaction
(Memtable size > threshold) New memtableOld one converted to an SSTable, written to GFS
Shrink memory usage & Reduce log length in recovery
Merging compaction Reading and shrinking few SSTables and memtable
Major compaction Rewrites all SSTables into exactly one table reclaim resources for deleted data Deleted data disappears (especially sensitive data)
Refinements – Locality Groups
Client groups multiple col-families together A separate SSTable for each LG in tablet Dividing families not accessed together
Example (Language & checksum) VS (page content)
More efficient reads Tuning params for each group
An LG declared to be in memory Useful for small pieces accessed frequently
Refinements – Compression
Client can compress SSTable for an LG Compress format applied to each SSTable block
Small table portion read without complete decomp. Usually two pass compress
Long common strings through large window Fast repetition looking in a small window (16 KB)
Great reduction Data layout
Conclusion
Bigtable has achieved its goals of high performance, data availability and scalability. It has been successfully deployed in real apps
(Personalized Search, Orkut, GoogleMaps, …)
Significant advantages of building own storage system like flexibility in designing data model, control over implementation and other infrastructure on which Bigtable relies on.
课程评估(1)目标:系统理解+表达能力
4 pages 描述一个系统
4张图 (每张图半页, Microsoft visio )描述一个系统
用最精炼的文字描述图形
2人一组
必须与我看到的不能完全相同
否则不会高于75分