Accumulo Extensions to Google's Bigtable Design - Community Central
Accumulo design
-
Upload
scsorensen -
Category
Data & Analytics
-
view
299 -
download
0
description
Transcript of Accumulo design
APACHE ACCUMULOFrom a design perspective
SCALABLE KEY-VALUE STORE BASED ON GOOGLE'S
BIGTABLE
BIGTABLE FEATURES• Distributes data across many commodity servers
• Sorts data by key for fast lookup of values by key
• Scan across multiple key value pairs
• Highly consistent writes to single row
• Support for MapReduce jobs
DATA MODEL
Key
ValueRow ID
ColumnTimestamp
Family Qualifier
Row ID Col Fam Col Qual Timestamp Value
Bob Email id0023 20120301 Hey joe, can you send ...
Bob Email id0024 20120302 Re: next Thursday ...
Bob UserPrefs Background 20130101 Grey
Fred Email id0001 20080302 Welcome to gmail ...
Sarah Email id0004 20130201 Hi again ...
Sara Videos ytid009 20100303 nsu736:)jdudjdk$:)378;'$$)
Tablet servers HDFS DataNodesCommit Layer Replication Layer
SINCE 2006• Several BigTable implementations
• Apache Hbase
• Apache Cassandra
• Apache Accumulo
• others …
BIGTABLE IS BIGTABLE RIGHT?
HBASE
HBASE• Open source Apache project started by developers at
Powerset, bought by Microsoft
• Now used at Facebook, StumbleUpon, other big web sites
• Fast reads
• Row-oriented API
• Each column family has it's own set of files
CASSANDRA
CASSANDRA• Apache project started at Facebook
• Combines elements of BigTable and Amazon's Dynamo into one system
• Used at Netflix, other web sites
• Fast writes
• Tunable consistency
Tablet serversCommit and Replication Layer
CONSISTENCY
• Highly consistent means: writes in one place
• Eventually consistent: writes in > one place
• Writes in > one place: network partition tolerance
• Partition tolerance: geographically distributed servers
• *Google uses Spanner to synchronize multiple dbs
Tablet serversData Center A Data Center B
Data Center A Data Center BTablet servers
OVERVIEW
• Both highly scalable
• Used to build web applications that can serve millions of users at once
• Serves as a low-latency persistence layer for real time service of requests
• Available in single data center or cross data center options
USE CASE
• Most data comes from users
• Schema defined by the application
• Data builds up over time
Many UsersDbWeb
application
ACCUMULO
ACCUMULO
• Can support the web application use-case
• But what are those other extra features for?
ACCUMULO ‘EXTRAS’• Dynamic Column Families
• Column Visibility
• Key-value oriented API
• Iterators
• Batch Scanners
BIG ORGANIZATIONS
• Missions other than internet services
• Various disparate operational systems that generate data
• Desire to look across and analyze that data
• Desire to deliver results to their own population
USE CASE IS DISCOVERING AND ANALYZING ALL DATA
ISSUES
• Scale
• Unknown / multiple schema
• Support for analysis without data movement
• Varying levels of sensitivity in the same system
• Support a high number of low-latency user requests
Many Users
Analyze
Db
Data sets
SCALE?
CHECK (IT’S BIGTABLE)
NO CONTROL OVER OR MANY DIFFERENT SCHEMA?
MAP EXISTING FIELDS TO COLUMNS DYNAMICALLY
INCLUDING COLUMN FAMILIES
VARYING LEVELS OF DATA SENSITIVITY?
COLUMN VISIBILITY
DATA MODEL
Key
ValueRow ID
ColumnTime
stampFamily Qualifier Visibility
Row ID Col Fam Col Qual Col Vis Timestamp Value
Bob Email id0023 personal comms 20120301 Hey joe, can
you send ...
Bob Email id0024 personal comms 20120302 Re: next
Thursday ...
Bob UserPrefs Background prefs 20130101 Grey
Fred Email id0001 personal comms 20080302 Welcome to
gmail ...
Sarah Email id0004 personal comms 20130201 Hi again ...
Sara Videos ytid009 public post 20100303nsu736:)jdu
djdk$:)378;'$$)
DATA OF VARYING SENSITIVITY LEVELS CAN BE PHYSICALLY CO-LOCATED
FRAMEWORKS LIKE HADOOP MAP REDUCE LOVE IT WHEN
DATA IS ALL TOGETHER
LOOK ACROSS DATASETS?
SECONDARY INDICES
SECONDARY INDICES
• Application-created data: known
• Pre-existing data? unknown
DATA DISCOVERY!
SECONDARY INDICESRowID Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
RowID Col Fam Col Qual
43 age RID00003
54 age RID00001
5’9” height RID00003
bob name RID00001
carl name RID00004
evan name RID00005
fred name RID00002
harry name RID00003
PARTIAL ROW SCANS
BATCH SCANNERS
RowID Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
Batch Scanner
COLUMN VISIBILITY APPLIES TO INDEXES TOO
ANALYSIS?
MAPREDUCE: CHECK
SHUFFLE-SORTED?
• Between Map and Reduce phases is shuffle-sort
• Sorting by key is necessary so all the values for a given key end up next to each other …
• BigTable also sorts keys …
ITERATORS
Value combine(Iterator<Value> values)
PRE-COMPUTATION
Many Users
Analyze
Db
Data sets
ACCUMULO