Google's Database

download Google's Database

of 51

Transcript of Google's Database

  • 7/29/2019 Google's Database

    1/51

    Bigtable: A Distributed Storage System forStructured Data

    Presenter: Guangdong LiuJan 24 th , 2012

  • 7/29/2019 Google's Database

    2/51

    Dec 8 th , 2011Dec 8 th , 2011

    Googles Motivation: Scale!

    Consider Google Lots of data

    Copies of the web, satellite data, user data,email and USENET, Subversion backing store

    Millions of machines Different projects/applications Hundreds of millions of users

    Many incoming requests

  • 7/29/2019 Google's Database

    3/51

    Dec 8 th , 2011Dec 8 th , 2011

    Why not A DBMS?

    Few DBMSs support the requisite scale Required DB with wide scalability, wide

    applicability, high performance and highavailability

    Couldnt afford it if there was one Most DBMSs require very expensive

    infrastructure DBMSs provide more than Google needs

    E.g., full transactions, SQL Google has highly optimized lower-level systems

    that could be exploited

    GFS, Chubby, MapReduce, Job scheduling

  • 7/29/2019 Google's Database

    4/51

    Dec 8 th , 2011Dec 8 th , 2011

    What is a Bigtable?

    A BigTable is a sparse, distributed,persistent multidimensional sorted map. Themap is indexed by a row key, a column key,and a timestamp; each value in the map is anuninterpreted array of bytes.

  • 7/29/2019 Google's Database

    5/51

    Dec 8 th , 2011Dec 8 th , 2011

    Relative to a DBMS, BigTable provides Simplified data retrieval mechanism

    A map -> string

    No relational operators Atomic updates only possible at row level Arbitrary number of columns per row Arbitrary data type for each column Designed for Googles application set Provides extremely large scale (data,

    throughput) at extremely small cost

  • 7/29/2019 Google's Database

    6/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data model: Row

    Row keys are arbitrary strings Row is the unit of transactional consistency

    Every read or write of data under a single rowis atomic

    Data is maintained in lexicographic order by rowkey

    Rows with consecutive keys (Row Range) aregrouped together as tablets . Unit of distribution and load-balancing reads of short row ranges are efficient and

    typically require communication with only asmall number of machines

  • 7/29/2019 Google's Database

    7/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data model: Column

    Column keys are grouped into sets called column families , which form the unit of access control.

    Data stored under a column family is usually of thesame type

    A column family must be created before data canbe stored in a column key

    After a family has been created, any column keywithin the family can be used for queries

    Column key is named using the following syntax:family :qualifier

    Access control and disk/memory accounting areperformed at column family level

  • 7/29/2019 Google's Database

    8/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data model: timestamps

    Each cell in Bigtable can contain multiple versionsof data, each indexed by timestamp Timestamps are 64-bit integers Assigned by:

    Bigtable: real-time in microseconds Client application: when unique timestamps are

    a necessity Data is stored in decreasing timestamp order, so

    that most recent data is easily accessed Application specifies how many versions (n) of data items

    are maintained in a cell Bigtable garbage collects obsolete versions

  • 7/29/2019 Google's Database

    9/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Zoo

  • 7/29/2019 Google's Database

    10/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Zoo

    row key col. key timestamp

  • 7/29/2019 Google's Database

    11/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Zoo

    row key col. key timestamp

    - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs

  • 7/29/2019 Google's Database

    12/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Zoo

    row key col. key timestamp

    - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs

    Each key is sorted inLexicographic order

  • 7/29/2019 Google's Database

    13/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Zoo

    row key col. key timestamp

    - (zebras, length, 2006) --> 7 ft - (zebras, weight, 2007) --> 600 lbs- (zebras, weight, 2006) --> 620 lbs

    Timestamp orderingis defined as mostrecent appearsfirst

  • 7/29/2019 Google's Database

    14/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Example: Web Indexing

  • 7/29/2019 Google's Database

    15/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

  • 7/29/2019 Google's Database

    16/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Row

  • 7/29/2019 Google's Database

    17/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Columns

  • 7/29/2019 Google's Database

    18/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Cells

  • 7/29/2019 Google's Database

    19/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    timestamps

  • 7/29/2019 Google's Database

    20/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Column family

  • 7/29/2019 Google's Database

    21/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Column famil y

    family: qualifier

  • 7/29/2019 Google's Database

    22/51

    Dec 8 th , 2011Dec 8 th , 2011

    Data Model

    Column family

    family: qualifier

  • 7/29/2019 Google's Database

    23/51

    Dec 8 th , 2011Dec 8 th , 2011

    Client APIs

    Bigtable APIs provide functions for: Creating/deleting tables, column families Changing cluster , table and column family

    metadata such as access control rights Support of single row transactions Allowing cells to be used as integer counters Executing client supplied scripts in the address

    space of servers

  • 7/29/2019 Google's Database

    24/51

    Dec 8 th , 2011Dec 8 th , 2011

    Building Blocks

    Bigtable uses the distributed Google FileSystem (GFS) to store log and data files

    The Google SSTable file format is usedinternally to store Bigtable data

    An SSTable provides a persistent , orderedimmutable map from keys to values Operations are provided to look up the value

    associated with a specified key, and to iterate overall key/value pairs in a specified key range

  • 7/29/2019 Google's Database

    25/51

    Dec 8 th , 2011Dec 8 th , 2011

    Building Blocks

    Bigtable relies on a highly-available andpersistent distributed lock service calledChubby

    Chubby provides a namespace that consistsof directories and small files. Eachdirectory or file can be used as a lock Consists of 5 active replicas, one replica is the

    master and serves requests Service is functional when majority of thereplicas are running and in communication withone another when there is a quorum

  • 7/29/2019 Google's Database

    26/51

    Dec 8 th , 2011Dec 8 th , 2011

    Building Blocks

    Shared pool of machines that also run other distributed applications

  • 7/29/2019 Google's Database

    27/51

    Dec 8 th , 2011Dec 8 th , 2011

    SSTable

    An SSTable provides a persistent , orderedimmutable map from keys to values

    Each SSTable contains a sequence of blocks A block index (stored at the end of SSTable)

    is used to locate blocks Each SSTable contains a sequence of blocks A block index (stored at the end of SSTable)

    is used to locate blocks The index is loaded into memory when the

    SSTable is open One disk access to get the block

  • 7/29/2019 Google's Database

    28/51

    Dec 8 th , 2011Dec 8 th , 2011

    SSTable

    64KBBlock

    64KBBlock

    64KBBlock

    Index (block ranges)

  • 7/29/2019 Google's Database

    29/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet

    Contains some range of rows of the table Built out of multiple SSTables

    Index

    64Kblock

    64Kblock

    64Kblock

    SSTable

    Index

    64Kblock

    64Kblock

    64Kblock

    SSTable

    Tablet Start :aardvark End:apple

  • 7/29/2019 Google's Database

    30/51

    Dec 8 th , 2011Dec 8 th , 2011

    Table

    Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap

    SSTable SSTable SSTable SSTable

    Tabletaardvark apple

    Tabletapple boat

  • 7/29/2019 Google's Database

    31/51

    Dec 8 th , 2011Dec 8 th , 2011

    Organization

    A Bigtable cluster stores tables Each table consists of tablets Initially each table consists of one tablet As a table grows it is automatically split into

    multiple tablets Tablets are assigned to tablet servers

    Multiple tablets per server. Each tablet is 100-200 MB

    Each tablet lives at only one server

  • 7/29/2019 Google's Database

    32/51

    Dec 8 th , 2011Dec 8 th , 2011

    Implementation: Three major components

    A library that is linked into every client One master server

    Assigning tablets to tablet servers Detecting the addition and deletion of tablet servers

    Balancing tablet-server load Garbage collection of files in GFS

    Many tablet servers Tablet servers manage tablets Tablet server splits tablets that get too big

    Client communicates directly with tablet serverfor reads/writes.

  • 7/29/2019 Google's Database

    33/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Location

    Since tablets move around from server toserver, given a row, how do clients find theright machine? Need to find a tablet whose row range covers

    the target row

  • 7/29/2019 Google's Database

    34/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Location

    A 3-level hierarchy analogous to that of a B+-tree to store tablet location information : A file stored in chubby contains location of the root

    tablet Root tablet contains location of Metadata tablets

    The root tablet never splits Each meta-data tablet contains the locations of a set of

    user tablets

    Client reads the Chubby file that points to the roottablet This starts the location process

    Client library caches tablet locations

    Moves up the hierarchy if location N/A

  • 7/29/2019 Google's Database

    35/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet AssignmentBigTable

    BigTable Master

    Performs metadata opsand load balancing

    BigTable Tablet Server BigTable Tablet Server

    Serves data Serves data

    Cluster scheduling system GFS Chubby

    Holds tabletdata, logs

    Holds metadata, handlesmaster election

    Handles failover,monitoring

    BigTable Client

    BigTable ClientLibrary

  • 7/29/2019 Google's Database

    36/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Server

    When a tablet server starts, it creates andacquires exclusive lock on, a uniquely-namedfile in a specific Chubby directory Call this servers directory

    A tablet server stops serving its tablets ifit loses its exclusive lock This may happen if there is a network

    connection failure that causes the tablet serverto lose its Chubby session

  • 7/29/2019 Google's Database

    37/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Server

    A tablet server will attempt to reacquirean exclusive lock on its file as long as thefile still exists

    If the file no longer exists then the tabletserver will never be able to serve again Kills itself At some point it can restart; it goes to a pool of

    unassigned tablet servers

  • 7/29/2019 Google's Database

    38/51

    Dec 8 th , 2011Dec 8 th , 2011

    Master Startup Operation

    Upon start up the master needs to discoverthe current tablet assignment. Grabs unique master lock in Chubby

    Prevents concurrent master instantiations Scans servers directory in Chubby for live servers Communicates with every live tablet server

    Discover all tablets Scans METADATA table to learn the set of

    tablets Unassigned tablets are marked for assignment

  • 7/29/2019 Google's Database

    39/51

    Dec 8 th , 2011Dec 8 th , 2011

    Master Operation

    Detect tablet server failures/resumption Master periodically asks each tablet server

    for the status of its lock

  • 7/29/2019 Google's Database

    40/51

    Dec 8 th , 2011Dec 8 th , 2011

    Master Operation

    Tablet server lost its lock or mastercannot contact tablet server: Master attempts to acquire exclusive lock on

    the servers file in the servers directory

    If master acquires the lock then the tabletsassigned to the tablet server are assigned toothers

    Master deletes the servers file in the

    servers directory Assignment of tablets should be balanced If master loses its Chubby session then it

    kills itself An election can take place to find a new master

  • 7/29/2019 Google's Database

    41/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Server Failure

    Tablet server

    GFS Chunkserver

    SSTable SSTable SSTable

    Tablet Tablet Tablet

    Tablet server

    GFS Chunkserver

    SSTable

    (replica)SSTable

    SSTable

    Tablet Tablet Tablet

    (replica)SSTable

    Logicalview:

    Physicallayout:

    SSTable

    Chubby Server Master

  • 7/29/2019 Google's Database

    42/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Server Failure

    Chubby Server

    Tablet server

    GFS Chunkserver

    SSTable SSTable SSTable

    Tablet Tablet Tablet

    Tablet server

    GFS Chunkserver

    SSTable

    (replica)SSTable

    SSTable

    Tablet Tablet Tablet

    (replica)SSTable

    Logicalview:

    Physicallayout: SSTable

    X

    X X X X

    Master

  • 7/29/2019 Google's Database

    43/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Server Failure

    Tablet server

    GFS Chunkserver

    SSTable SSTable SSTable

    Tablet Tablet Tablet

    (replica)SSTable

    Logicalview:

    Physicallayout:

    Tablet

    (other tablet serversdrafted to serve other abandoned tablets )

    Backup copy of tabletmade primary

    Message sent to tabletserver by master

    Extra replica of tablet created automatically by GFS

    Chubby Server Master

  • 7/29/2019 Google's Database

    44/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Serving

    Commit log stores the updates that aremade to the data

    Recent updates are stored in memtable

    Older updates are stored in SStable files

  • 7/29/2019 Google's Database

    45/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Serving

    Recovery process Reads/Writes that arrive at tablet server

    o Is the request Well-formed?o Authorization: Chubby holds the permission

    fileo If a mutation occurs it is wrote to commit log

    and finally a group commit is used

  • 7/29/2019 Google's Database

    46/51

    Dec 8 th , 2011Dec 8 th , 2011

    Tablet Serving

    Tablet recovery processo Read metadata containing SSTABLES and redo

    points

    o Redo points are pointers into any commit logso Apply redo points

  • 7/29/2019 Google's Database

    47/51

    Dec 8 th , 2011Dec 8 th , 2011

    Compactions

    Minor compaction convert the memtableinto an SSTable Reduce memory usage Reduce log traffic on restart

    Merging compaction Reduce number of SSTables Good place to apply policy keep only N

    versions

    Major compaction Merging compaction that results in only oneSSTable

    No deletion records, only live data

  • 7/29/2019 Google's Database

    48/51

    Dec 8 th , 2011Dec 8 th , 2011

    Refinements

    Locality groups Clients can group multiple column families

    together into a locality group. Compression

    Compression applied to each SSTable blockseparately

    Uses Bentley and McIlroy's scheme and fast compression algorithm

    Caching for read performance Uses Scan Cache and Block Cache

    Bloom filters Reduce the number of disk accesses

  • 7/29/2019 Google's Database

    49/51

    Dec 8 th , 2011Dec 8 th , 2011

    Refinements

    Commit-log implementation Suppose one log per tablet rather have one logper tablet server

    Exploiting SSTable immutability No need to synchronize accesses to file system

    when reading SSTables Concurrency control over rows efficient Deletes work like garbage collection on

    removing obsolete SSTables Enables quick tablet split: parent SSTables

    used by children

  • 7/29/2019 Google's Database

    50/51

    Dec 8 th , 2011Dec 8 th , 2011

    Source

    Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach, Mike Burrows, TusharChandra, Andrew Fikes, and Robert E. Gruber.Bigtable: A Distributed Storage System forStructured Data. In 7 th OSDI (2006)

    http://www.csd.uwo.ca/courses/CS9842/Lectures /google- bigtable .ppt

    http://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppthttp://www.csd.uwo.ca/courses/CS9842/Lectures/google-bigtable.ppt
  • 7/29/2019 Google's Database

    51/51

    Dec 8 th , 2011

    2007 The Board of Regents of the University of Nebraska All rights reserved

    Thanks