Database as a Service - Tutorial @ICDE 2010
-
Upload
dbis-ilmenau-university-of-technology -
Category
Education
-
view
3.958 -
download
0
Transcript of Database as a Service - Tutorial @ICDE 2010
1
Database as a ServiceSeminar, ICDE 2010, Long Beach, March 04
Wolfgang Lehner | Dresden University of Technology, Germany
Kai-Uwe Sattler | Ilmenau University of Technology, Germany
2
Introduction Motivation SaaS Cloud Computing Use Cases
3
Software as a Service (SaaS)
Traditional
Software
On-Demand Utility
Build Your Own Plug In, Subscribe
Pay-per-Use
4
Comparison of business modelTraditional packaged software Software as a service (SaaS)
Designed for customers to install, manage and maintain
Designed for delivery as Internet-based services
Architect solutions to be run by an individual company in a dedicated instantiation of the software
Designed to run thousands of different customers on a single code
Infrequent, major upgrades, sold individually to each installed base customer.
Frequent small upgrades to minimize customer disruption and enhance satisfaction.
Version control, upgrade fee Fixing a problem for one customer
fixes it for everyone
5
Avoid hidden cost of traditional SWTraditional
SoftwareSW Licenses
Maintenance
Hardware
IT Staff
Training
Customization
Subscription Fee
Training
Customization
SaaS
Your Large Customers
Dozens of markets of millions or millions of markets of dozens?
$ /
C
ust
om
er
# of Customers
Your Typical Customers
(Currently) “non addressable” Customers
What if you lower your cost of sale (i.e. lower barrier to entry) and you also lower cost of operations
New addressable market >> current market
The Long Tail
6
7
EC2 & S3
Cloud Computing:
A style of computing where massively scalable, IT-enabled capabilities are provided "as a service" across the Internet to multiple external customers.
"It's about economies of scale, with effective and dynamic sharing"
Acquisition ModelServiceBusiness Model Pay for usage
Technical Model Scalable, elastic, shareable
Access Model Internet
"All that matters is results — I don't care how it is done"
"I don't want to own assets — I wantto pay for elastic usage, like a
utility"
"I want accessibility from anywhere from any device"
What is Cloud? – Gartner’s Definition
8
To Qualify as a Cloud Common, Location-independent, Online Utility on Demand*
Common implies multi-tenancy, not single or isolated tenancy Utility implies pay-for-use pricing on Demand implies ~infinite, ~immediate, ~invisible scalability
Alternatively, a “Zero-One-Infinity” definition:** 0 On-premise infrastructure, acquisition cost, adoption cost,
support cost 1 Coherent and resilient environment – not a brittle “software
stack” Scalability in response to changing need, Integratability/ Interoperability with legacy assets and other services
Customizability/Programmability from data, through logic, up into the user interface without compromising robust
multi-tenancy * Joe Weinman, Vice President of Solutions Sales, AT&T, 3 Nov. 2008** From The Jargon File: “Allow none of foo, one of foo, or any number of foo”
9
Cloud Differentials: Service Models Cloud Software as a Service (SaaS)
Use provider’s applications over a network
Cloud Platform as a Service (PaaS) Deploy customer-created applications to a cloud
Cloud Infrastructure as a Service (IaaS) Rent processing, storage, network capacity, and
other fundamental computing resources
10
Cloud Differentials: Characteristics Size/Location
Large Scale(AWS, Google, BM/Google),
Small Scale(SMB, Academia)
Purpose General Purpose Special Purpose (e.g., DB-
Cloud)
Administration/Jurisdiction Public Private
Platform Physical – Virtual Homogenous –
Heterogeneous
Design Paradigms Storage CPU Bandwidth
Usage Model Exclusive Shared Pseudo-Shared
11
Use Cases: Large-Scale Data Analytics Outsource your data and use cloud resources
for analysis Historical and mostly non-critical data Parallelizable, read-mostly workload, high variant
workloads Relaxed ACID guarantees
Examples (Hadoop PoweredBy): Yahoo!: research for ad systems and Web search Facebook: reporting and analytics Netseer.com: crawling and log analysis Journey Dynamics: traffic speed forecasting
12
Use Cases: Database Hosting Public datasets
Biological databases: a single repository instead of > 700 separate databases
Semantic Web Data, Linked data, ... Sloan Digital Sky Survey TwitterCache
Already on Amazon AWS: annotated human genome data, US census, Freebase, ...
Archiving, Metadata Indexing, ...
13
Use Cases: Service Hosting Data management for SaaS solutions
Run the services near the data = ASP
Already many existing applications CRM, e.g. Salesforce, SugarCRM Web Analytics Supply Chain Management Help Desk Management Enterprise Resource Planning, e.g. SAP Business
ByDesign ...
14
Foundations & Architectures Virtualization Programming models Consistency models
& replication SLAs & Workload
management Security
15
Topics covered in this Seminar
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
17
... it‘s simple!
18
Virtualization Separating the abstract view of computing
resources from the implementation of these resources adds flexibility and agility to the computing
infrastructure soften problems related to provisioning,
manageability, … lowers TCO: fewer computing resources
Classical driving factor: server consolidation
EDBT2008 Tutorial (Aboulnaga e.a.)
Consolidate
E-mail serverLinux
Web serverLinux
Database serverLinux
Virtualization
E-mail serverLinux
Web serverLinux
Database serverLinux
Improved utilization using consolidation
What can be virtualized – the big four.
19
20
Different Types of Virtualization
APP 1 APP 4APP 2 APP 3 APP 5
OPERATING SYSTEM OPERATING SYSTEM
VIRTUAL MACHINE MONITOR (VMM)
VIRTUAL MACHINE 1 VIRTUAL MACHINE 2
PHYSICAL MACHINE
CPU CPU CPUMEM MEM NET
CPU NETMEMCPU
CPU
PHYSICAL STORAGE
21
Virtual Machines Technique with long history (since the 1960's)
Prominent since IBM 370 mainframe series Today
large scale commodity hardware and operating systems
Virtual Machine Monitor (Hypervisor) strong isolation between virtual machines (security, privacy, fault
tolerance) flexible mapping between virtual machines and physical resources classical operations
pause, resume, checkpoint, migrate (admin / load balancing)
Software deployment Preconfigured virtual appliances Repositories of virtual appliances on the web
22
DBMS on top of Virtual Machines ... yet another
application?
... Overhead? SQL Server
within VMware
23
Virtualization Design Advisor What fraction of node
resources goes to what DBMS? Configuring VM parameters
What parameter settings are best for a given resource configuration Configuring the DBMS
parameters Example
Workload 1: TPC-H (10GByte) Workload 2: TPC-H (10GByte)
only Q18 (132 copies) Virtualization design advisor
20% of CPU to Workload 1 80% of CPU to Workload 2
24
Some Experiments Workload Definition based on TPC-H
Q18 is one of the most CPU intensive queries Q21 is one of the least CPU intensive queries Workload Units
C: 25x Q18 I: 1x Q21
Experiment: Sensitivity to workload Resource Needs W1 = 5C + 5I W2 = kC + (10-k)I (increase of k -> more CPU intensive)
DB2 Postgres
26
Virtualization in DBaaS environments
HW Layer
VM VM VM VM VMVM Layer VM
DB Server
DB Server
DB Server
DB Server Layer
Instance
Instance
Instance
Instance
Instance
Instance
Instance Layer
DB DB DB DB DBDB Layer
27
Existing Tools for Node Virtualization
HW Layer
VM VM VMVM Layer
DB Server
DB Server Layer
Instance
Instance
Instance Layer
DB DB DB DB DBDB Layer DB Ad2visor• Indexes• MQTs• MDC• Redistribution of
TablesDB Workload
Manager
VM Configuration• Monitoring• Resources
Configuration• (manual) Migration
Static Environment Assumptions• Advisor expects static hardware environment• VM expects static (peak) resource requirements• Interactions between layers can improve performance/utilization
NodeRessource
Model
28 BP
Index Storage
Expected Performance
5%
10%
15%
25%
20%
200MB
400MB
600MB
800MB
1GB
Layer Interactions (2) Experiment
DB2 on Linux TPC-H workload on 1GB database Ranges for resource grants
Main memory (BP) – 50 MB to 1GB Additional storage (Indexes) – 5% to 30% DB size
Varying advisor output (17-26 indexes) Different possible improvement Different expected Performance after improvement
DB Advisor
VM Configuration
BP
Index Storage
90%
<1% <3%
35%
Possible Improvement
5%
10%
15%
25%
20%
200MB
400MB
600MB
800MB
1GB
29
Storage Virtualization General Goal
provide a layer of indircetion to allow the definition of virtual storage devices minimize/avoid downtime (local and remote mirroring) improve performance (distribution/balancing – provisioning - control
placement) reduce cost of storage administration
Operations create, destroy, grow, shrink virtual devices change size, performance, reliability, ...
workload fluctuations hierarchical storage management
versioning, snapshots, point-in-time copies backup, checkpoints
exploit CPU and memory in the storage system caching execute low-level DBMS functions
30
Virtualization in DBaaS Environments (2)
HW Layer
VM VM VM VM VMVM Layer VM
DB Server
DB Server
DB Server
DB Server Layer
Instance
Instance
Instance
Instance
Instance
Instance
Instance Layer
DB DB DB DB DBDB Layer
Storage LayerShared Disk
Local Disk
31
Virtualization in DBaaS Environments (2)
HW Layer
VM VMVM Layer VM
DB Server
DB Server Layer
Instance
Instance
Instance Layer
DB DB DB DB DBDB Layer
Storage LayerShared Disk
Local Disk
DB Advisor• Indexes• MQTs• MDC• Redistribution of
TablesDB Workload
Manager
Storage Configuration• Device Bundling• Replication• Archiving
StorageRessource
Model
One way to go? Paravirtualization
CPU and Memory Paravirtualization extends the guest to allow direct interaction with the underlying hypervisor reduces the monitor cost including memory and System call operations. gains from paravirtualization are workload specific
Device Paravirtualization places a high performance virtualization-aware device driver into the guest paravirtualized drivers are more CPU efficient (less CPU overhead for
virtualization) Paravirtualized drivers can also take advantage of HW features, like partial
offload
33
Outline
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
34
Multi Tenancy Goal: consolidate multiple customers onto the
same operational system
Requirements: Extensibility: customer-specific schema changes Security: preventing unauthorized data accesses by other
tenants Performance/scalability: scale-up & scale-out Maintenance: on tenant level instead of on database level
separate DBper tenant
shared DBshared schema
shared DBseparate schema
flexible,but
limited scalabili
ty
best resource utilizatio
n
35
Flexible Schema Approaches Goal: allow tenant-specific schema additions (columns)
Universal Table Extension Table
PivotTable
36
Flexible Schema Approaches: Comparison
Applic
ati
on o
wns
the s
chem
aD
ata
base
ow
ns
the sch
em
a
Pivot table
XML columns
Extension tablePrivate tables
Universal table
Chunk folding
Best performanceFlexible schema evolution
Universal table: requires techniques for handling sparse data Fine-grained index support not possible
Pivot table: Requires joins for reconstructing logical tuples
Chunk folding: similar to pivot tables Group of columns are combined in a chunk and mapped into a chunk
table Requires complex query transformation
37
Access Control in Multi-Tenant DB Shared DB approaches require row-level
access control Query transformation
.... where TenantID = 42 ... Potential security risks
DBMS-level control, e.g. IBM DB2 LBAC Label-based Access control Controls read/write access to individual rows and
columns Security labels with policies Requires separate account for each tenant
38
In a Nutshell How shall virtualization be handled on
Machine level (VM to HW) DBMS level (database to instance to database server) Schema level (multi tenancy)
... using … Allocation between layers Configuration inside layers Flexible schemas
… when … Characteristics of the workloads are known Virtual machines are transparent Tenant-specific schema extensions
… demanding that … SLAs and security are respected Each node’s utilization is maximized Number of nodes is minimized
39
Outline
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
40
MapReduce Background Programming model and an associated implementation for
large-scale data processing Google and related approaches: Apache Hadoop and Microsoft Dryad User-defined map & reduce functions
Infrastructure hides details of parallelization provides fault-tolerance, data distribution, I/O scheduling, load
balancing, ...
M
M
map (in_key, in_value) -> (out_key,
intermediate_value) list
R
R
reduce (out_key, intermediate_value list) ->
out_value list
M
{ (key,value) }
Logic Flow of WordCount
Hadoop Map/Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner…
Mapper
1 Hadoop Map/Reduce is a
17 software framework for
45 easily writing applications
Hadoop 1
Map 1
Reduce 1
is 1
a 1
Hadoop [1, 1, 1, …,1]
Map [1, 1, 1, …, 1]
Reduce [1, 1, 1, …, 1]
is [1, 1, 1, …, 1]
a [1, 1, 1, …, 1]
Sort/Shuffle
Hadoop 5
Map 12
Reduce 12
is 42
a 23
Reducer… …
42
MapRecude Disadvantages
MM RR
Extremely rigid data flow
Common operations must be coded by hand join, filter, split, projection, aggregates, sorting, distinct
User plans may be suboptimal and lead to performance degradation
Semantics hidden inside map-reduce functions Inflexible, difficult to maintain, extend and optimize
Combination of high-level declarative querying and low-level programming with MapReduce
Dataflow Programming Languages Hive, JAQL and Pig
43
PigLatin PigLatin
On top of map-reduce/ Hadoop Mix of declarative style of SQL and procedural style of
map-reduce Consists of two parts
PigLatin: A Data Processing Language Pig Infrastructure: An Evaluator for PigLatin
programs Pig compiles Pig Latin into physical plans Plans are to be executed over Hadoop
30% of all queries at Yahoo! in Pig-Latin Open-source, http://incubator.apache.org/pig
44
Example
User URL Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
URL Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Visits URL Info
Task: Determine the most visited websites in each category.
45
Implementation in MapReduceimport java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));
reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }
46
Example Workflow in Pig-Latin
load Visitsload Visits
group by urlgroup by url
foreach urlgenerate countforeach urlgenerate count
load URL Infoload URL Info
join on urljoin on url
group by categorygroup by category
foreach categorygenerate top10 URLs
foreach categorygenerate top10 URLs
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topURLs’;
Operate directly over files.
Schemas optional. Can be assigned dynamically.
User-defined functions (UDFs) can be used in every construct• load, store• group, filter, foreach
47
Compilation in MapReduce
load Visitsload Visits
group by urlgroup by url
foreach urlgenerate countforeach urlgenerate count
load URL Infoload URL Info
join on urljoin on url
group by categorygroup by category
foreach categorygenerate top10 URLs
foreach categorygenerate top10 URLs
Every group or join operation forms a map-reduce boundary
Other operations pipelined into map and reduce phases
Map
1Re
duce
1
Map2
Reduce2Map3
Reduce3
48
Data warehouse infrastructure built on top of Hadoop, providing: Data Summarization Ad hoc querying
Simple query language: Hive QL (based on SQL) Extendable via custom mappers and reducers Subproject of Hadoop No „Hive format“ http://hadoop.apache.org/hive/
Hive
49
Hive - ExampleLOAD DATA INPATH `/data/visits` INTO TABLE visits
INSERT OVERWRITE TABLE visitCountsSELECT url, category, count(*)FROM visitsGROUP BY url, category;
LOAD DATA INPATH ‘/data/urlInfo’ INTO TABLE urlInfo
INSERT OVERWRITE TABLE visitCountsSELECT vc.*, ui.*FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url);
INSERT OVERWRITE TABLE gCategoriesSELECT category, count(*)FROM visitCountsGROUP BY category;
INSERT OVERWRITE TABLE topUrlsSELECT TRANSFORM (visitCounts) USING ‘top10’;
50
Higher level query language for JSON documents Developed at IBM‘s Almaden research center Supports several operations known from SQL
Grouping, Joining, Sorting
Built-in support for Loops, Conditionals, Recursion
Custom Java methods extend JAQL JAQL scripts are compiled to MapReduce jobs Various I/O
Local FS, HDFS, Hbase, Custom I/O adapters
http://www.jaql.org/
JAQL
51
JAQL - ExampleregisterFunction(„top“, „de.tuberlin.cs.dima.jaqlextensions.top10“);
$visits = hdfsRead(„/data/visits“);
$visitCounts =$visits-> group by $url = $
into { $url, num: count($)};
$urlInfo = hdfsRead(„data/urlInfo“);
$visitCounts =join $visitCounts, $urlInfowhere $visitCounts.url == $urlInfo.url;
$gCategories =$visitCounts-> group by $category = $
into {$category, num: count($)};
$topUrls = top10($gCategories);
hdfsWrite(“/data/topUrls”, $topUrls);
52
Outline
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
53
ACID vs. BASE
ACIDBasically Available Soft-state Eventual consistent
Strong consistency Isolation Focus on „commit“ Availability? Pessimistic Difficult evolution
(e.g. schema)
Weak consistency Availability first Best effort Optimistic
(aggressive) Fast and simple Easier evolution
Traditional distributed data management
Web-scale data management
54
CAP Theorem [Brewer 2000] Consistency: all clients have the same view,
even in case of updates Availability: all clients find a replica of data,
even in the presence of failures Tolerance to network partitions: system
properties hold even when the network (system) is partitioned
You can have at most two of these properties for any shared-data system.
55
CAP TheoremNo consistency
guarantees➟ updates with
conflict resolutionOn a partition
event, simply wait until data is
consistent again➟ pessimistic
locking
All nodes are in contact with each
other or put everything in a single
box➟ 2 phase commit
56
CAP: Explanations
Network partitions ➫ M is not delivered Solutions?
Synchronous message: <PA,M> is atomic Possible latency problems (availability)
Transaction <PA, M, PB>: requires to control when PB happens Impacts partition tolerance or availability
PA :=update(o) PB:=read(o)
M
1.2.
3.
57
Consistency Models [Vogels 2008]
Strong consistency: after the update completes, any subsequent access from A, B, C will
return D1
Weak consistency: does not guarantee that subsequent accesses will return D1 → a
number of conditions need to be met before D1 is returned
Eventual consistency: Special form of weak consistency Guarantees that if no new updates are made, eventually all accesses
will return D1
D0
Distributedstorage system
A B Cupdate: D0→D1
read(D)
58
Variations of Eventual Consistency Causal consistency:
If A notifies B about the update, B will read D1 (but not C!)
Read-your-writes: A will always read D1 after its own update
Session consistency: Read-your-writes inside a session
Monotonic reads: If a process has seen Dk, any subsequent access
will never return any Di with i < k
Monotonic writes: guarantees to serialize the writes of the same
process
59
Database Replication store the same data on multiple nodes in
order to improve reliability, accessibility, fault-tolerance
1-copy consistency relaxed consistency
Single master
Multimaster
Optimisticreplication
Optimistic strategies = lazy replication Allows replicas to diverge; requires conflict resolution Allow data be accessed without a-priori synchronization Updates are propagated in the background Occasional conflicts are fixed after they happen
Improved availability, flexibility, scalabability, but see CAP theorem
60
Optimistic Replication: Elements
1
2
1. operation submission
1
1 2
3. scheduling
2211 1
1 2
2. propagation
22
1+2
4. conflict resolution
1+2
1+2
5. commitment
Y. Saito, M. Shapiro: Optimistic Replication, ACM Computing Surveys, 5(3):1-44, 2005
61
Conflict Resolution & Update Propagation
Prohibit Ignore Reduce Syntactic
Detect & repair
Semantic
Single master
Thomas write rule
Dividingobjects,
...
Vector clocks
App-specific
ordering or preconditio
ns
Epidemic information dissemination Updates pass through the system like infectious diseases Pairwise communication: a site contacts others (randomly
chosen) and sends ist information, e.g. about updates All sites process messages in the same way Proactive behaviour: no failure recovery necessary!
Basic approaches: anti-entropy, rumor mongering, ...
62
Outline
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
63
The Notion of QoS and Predictability
Service Level Objectives Specific measurables
characteristics; e.g. importance, performance goals
Deadline constraints Percentile constraints
legal parttechnical part
fees, penalties, ...
Service Level Agreement
Common understanding about services, guarantees, responsibilities
Application Server / middleware
DBMS
OS / Hardware
64
Techniques for QoS in Data Management Provide sufficient resources
Capacity planning: „How much boxes for customer X?“
Cost vs. Performance tradeoff Shielding
Dedicated (virtual) system for customers Scalability? Cost efficiency?
Scheduling Ordering requests on priority At which level?
65
Workload Management Purpose:
achieve performance goals for classes of requests (queries, transactions)
Resource provisioning Aspects:
Specification of service-level objectives Workload classification and modeling Admission control & scheduling
Static priorization: DB2 Query Patroller, Oracle Resource Manager, ...
Goal-oriented approaches Economic approaches Utility-based approaches
67
WLM: Model
Admission control: limit the number of simultanously executing requests (multiprogramming level = MPL)
Scheduling: ordering requests by priority
transaction
result
classes
MPLworkload classification
admission control &scheduling
response time
68
Utility Functions Utility function = preference specification
map possible system states (e.g. resource provisioning to jobs) to a real scalar value
Represents performance feature (response time, throughput, ...) and/or economic value
utility
response time
Goal: determine the most valuable feasible state, i.e. maximize utility Explore space of
alternative mappings (search problem)
Runtime monitoring and control
Kephart, Das: Achieving self-management via utility functions. IEEE Internet Computing 2007
69
Workload Modeling & Prediction Goal: predict resource requirements for a given workload,
i.e., find correlation between query features and performance features Approaches: regression, correlation analysis, Kernel Canonical
CA
Prediction: Calculate job coordinates in query plan projection based on job
feature vector Infer job‘s coordinates on the performance projection
query plans/
job descr.
performance
statistics
jobfeature matrix
performance
feature matrix
KCCA
performanceprojection
query planprojection
Ganapathi et al.: Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. ICDE 2009
70
Outline
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Multi-Tenanc
y
Replication
Serv
ice L
evel
Agre
em
ents
Security
71
Overview and Challenges
Data Owner
Service Provider
(un-trusted)
outsourcing
User
Query
Eng
ine
Query
Pre
/Post
- pro
cess
or
queries
query results
Data
Pre
-pro
cess
or
Data confiden- tiality/ privacy
Private information retrieval / Access privacy
Completeness and correctness
72
Challenges I – Data Confidentiality/ Privacy Need to store data in the cloud But we do not trust the service providers for sensitive
information encrypt the data and store it but still be able to run queries over the encrypted data do most of the work at the server
Two issues Privacy during transmission (wells studied, e.g. through SSL/TLS) Privacy of stored data
Querying over encrypted data is challenging needs to maintain content information on the server side, e.g.
range queries require order preserving data encryption mechanisms
privacy performance tradeoff
73
Query Processing on Encrypted Data
Service Provider
(un-trusted)
User Query
Eng
ine
client-sidequery encrypted
results
QueryTranslator
Temporary
Result
Metadata
QueryExecutor
originalquery
Client Site
result
server-sidequery
74
Executing SQL over Encrypted Data Hacigumus et al., (SIGMOD 2002) Main Steps
Partition sensitive domains Order preserving: supports comparison Random: query rewriting becomes hard
Rewrite queries to target partitions Execute queries and return results Prune/post-process results on client
Privacy-Precision Trade-off Larger segments/partitions
increased privacy decreased precision increased overheads in query
processing
75
Relational Encryption
NAME SALARY PID
John 50000 2
Marry 110000 2
James 95000 3
Lisa 105000 4
etuple N_ID S_ID P_ID
fErf!$Q!!vddf>></|
50 1 10
F%%3w&%gfErf!$ 65 2 10
&%gfsdf$%343v<l
50 2 20
%%33w&%gfs##! 65 2 20Service Provider Sitearbitrary encryption function,e.g. AES, RSA, Blowfish, DES, …
Store an etuple for each tuple in the original table
Create a coarse index for each (or selected) attribute(s) in the original table
Bucket Ids
76
Index and Identification Functions
200
0 400
600
800
1000
2 7 5 1 4
Domain Values
Partition (Bucket) ids
Partition function divides domain values into partitions (buckets)
Partition (R.A) = { [0,200], (200,400], (400,600], (600,800], (800,1000] }
partitioning function has an impact on performance as well as privacy
Identification function assigns a partition id to each partition of attribute A
identR.A( (200,400] ) = 7 Any function can be use as identification function, e.g., hash
functionsMeta-data
=
77
Challenges II – Private Information Retrieval (PIR) User queries should be invisible to service provider More formal
database is modeled as a string x of length N stored at remote server
user wants to retrieve the bit xi for some i without disclosing any information about i to the server
Paradox imagine buying in a store without the seller knowing
what you buy
User
i
xi
x1, x2, …, xn
X
78
i n
Q1∈{1,…,n}
Information-Theoretic 2-server PIR
User 0 0 1 1 0 011 10 00
Service Provider 1
Service Provider 2
i
a1 = xl+l ϵ Q1
a2 = xl+l ϵ Q2
xi = a1 a2 +
Q2=Q1 i+
79
Conclusion & Outlook Current
Infrastructures MS Azure Amazon RDS +
SimpleDB Amazon Dynamo Google BigTable Yahoo! PNUTS
Conclusion Challenges & Trends
80
Current Solutions
Amazon RDS
Microsoft SQL Azure Amazon S3Google Bigtable,Cassandra, Voldemort
Yahoo! PNUTS
Replication
one DB per client one DB for all clients
AmazonSimpleDB / Dynamo
Virtualization
Distributed Storage
81
Microsoft SQL Azure Cloud database service for
Azure platform Allows to create SQL server = group of databases
spread across multiple physical machines (incl. geo-location)
Supports relational model and T-SQL (tables, views, indices, triggers, stored procedures)
Deployment and administration using SQL Server Management Studio
Current limitations Individual database size = max. 10 GB No support for CLR, distributed queries &
transactions, spatial data
82
Microsoft SQL Azure: Details Databases
implemented as replicated data partitions Across multiple physical nodes Provide loal balancing and failover
API SQL, ADO.NET, ODBC Tabular Data Streams SQL Server Authentication Sync Framework
Prices 1 GB database: $9.99/month, 10 GB: $99.99/month +
data transfer SLA: 99.9% availability
83
Microsoft Azure: Other Services Azure Blob
Blob storage; PUT/GET interface via REST Azure Table
Structured storage; LINQ, ADO.NET interface
Storage Account
Customer
Order
Customer #1
Customer #2
Name
Address
Table
Entity Property
Properties can be defined per entity; Max size of entity: 1 MB Partition key: used for assigning entities to partitions; Row key:
unique ID within a partition Sort order: single index per table Atomic transactions within a partition
84
Amazon RDS Amazon Relational Database Services
Web Service to set up and operate a MySQL database Full-featured MySQL 5.1 Automated database backup Java-based command line tools and Web Service API
for instance administration Native DB access
Prices: Small DB instance (1.7 GB memory, 1 ECU):
$0.11/hour Largest DB instance (68 GB, 26 ECU): $3.10/hour + $0.10 GB-month storage + data transfer
85
Amazon Data Services Amazon Simple Storage Service (S3)
Distributed Blob storage for objects (1 Byte ... 5 GB data)
REST-based interface to read, write, and delete objects identified by unique, user-defined key
Atomic single-key updates; no locking Eventual consistency (partially read-after-write) Aug 2009: more than 64 billion objects
Amazon SimpleDB (= Amazon Dynamo???) Distributed structured storage Web Service API for access Eventual consistency
86
Amazon SimpleDB Data model
Relational-like data model: domain = collection of items described by key-value pairs; max size 10 GB
Attributes can be added to certain records (256 per record)
Storage Account
Customer
Order
Customer #1
Customer #2
Name: Wolfgang
City: Dresden
Item Attribute: Value
Domain Queries
Restricted to a single domain SFW syntax + count() + multi-attribute predicates Only string-valued data: lexicographical comparisons
87
Amazon Dynamo Highly available and scalable key-value data store for the
Amazon platform Manages the state of Amazon services
Providing bestseller lists, shopping carts, customer preferences, product catalogs → require only primary-key access (e.g. product id, customer id)
Completely decentralized, minimal need for manual administration (e.g. partitioning, redistribution)
Assumptions: Simple query model: put/get operations on keys, small objects (<
1MB) Weaker consistency but high availability („always writable“ data
store), no isolation guarantees Efficiency: running on commodity hardware, guaranteed latency
= SLAs, e.g. 300 ms response time for 99.9% of requests, peak load of 500 requests/sec.
88
Dynamo: Partitioning and Replication Partitioning scheme
based on consistent hashing Virtual nodes: each physical node is responsible for more
than one virtual node Replication
Each data item is replicated at n nodes
AKey space = ring
Responsibility ofnode C
B
D
E
CReplicas of keysFrom range (B,C)
89
Dynamo: Data Versioning Provides eventual consistency → asynchronous propagation
of updates Updates result in a new version of the data Vector clocks for capturing causalities between different
versions of the same object Vector clock = list of (node, counter)
Determine causal ordering/parallel branches of versions Update requests have to specify which version is to be updated
Reconciliation during client reads!
D1([NA,1]) D2([NA,2]) D5([NA,3],[NB,1],[NC,1])
D3([NA,2],[NB,1])
D4([NA,2],[NC,1])
write(D)@NA
write(D)@NA
reconcile(D)@NA
write(D)@NB
write(D)@NC
90
Dynamo: Replica maintenance Consistency among replicas:
Quorum protocol: R nodes must participate in a read, W nodes in a write; R + W > N
Sloppy quorum: Read/writes are performed on the first N healthy nodes Preference list: list of nodes which are responsible for
storing a given key For highest availability: W=1
Replica synchronization Anti-entropy: Merkle trees:
hash trees where leaves are hashes of keys, non-leaves are hashes of children
If hash values of two nodes are equal, no need to check children
91
Google BigTable Fast and large-scale DBMS for Google
applications and services Designed to scale into PB range Uses distributed Google File System (GFS) for
storing data and log files Depends on a cluster management system for
managing resource, monitoring states, scheduling, ....
Can be used as input source and output target for MapReduce programs
92
BigTable: Data Model Bigtable = sparse, distributed, multi-dimensional sorted
map Indexed by row key, column key, timestamp; value = array of
bytes Row keys up to 64 KB; column keys grouped in column families Timestamp (64 bit int) used for versioning Data is maintained in lexicographic order by row keys Row range is dynamically partitioned ➪ tablet = unit of
distribution and load balancing Read/write ops under a single row key are atomic
row key
value
t1t2
column key
93
BigTable: System Architecture Single-master distributed storage system master server responsible for
Assigning tablets to tablet servers Load balancing on tablet servers Detecting addition and expiration of tablet servers Garbage collection of GFS files
Tablet servers Manage sets of tablets (10...1000 tablets per server, 100..200
MB per tablet) Handle read/write requests Split tables
Distributed, persistent lock/name service Chubby uses Paxos for replica consistency (5 replicas) Provides namespace consisting of directories and files; allows
discovering of tablet servers
94
BigTable: Tablets Internally stored in SSTables
Immutable, sorted file of key-value pairs; organized in 64KB blocks + index (block ranges)
Tablet Location Chubby contains location of root tablet Root tablet contains location of all tablets in a METADATA table METADATA tablet contains location of user tablets + end key
row (sparse index) Three-level scheme addresses 234 tablets Cached by client library
Chubby fileRoot tablet
METADATA tabletUser tables
95
BigTable: Tablets /2 Tablet Assignment
Starting tablet servers acquire an exclusive lock in Chubby → allows discovery of tablet servers
Periodically checks by the master on the lock status of tablet servers
Replication of data performed by GFS Tablet Serving
Updates (mutations) are logged and then applied to an in-memory version (memtable)
Compactions Convert memtable into SSTable Merge SSTables
96
Yahoo! PNUTS Yahoo!‘s data serving platform Data & query model:
Simple relational model: tables of records with attributes (incl. Blob types)
Flexible schema evolution by adding attributes at any time
Queries: single-table selection & projection Updates & deletions based on primary-key access
Storage model: Records as parsed JSON objects Filesystem-based hash tables or MySQL InnoDB
engine
97
PNUTS Architecture
Storage units
RoutersTablet
controller
REST API
Clients
MessageBroker
98
PNUTS: Consistency & Replication Consistency model:
Per-record timeline consistency: all replicas apply all updates in the same order
User-specific guarantees: ready-any, read-latest, read-newer-than, writes, write-after-version
Partitioning and replication: Tables horizontally partioned into tablets (100 MB ...10 GB) Each server is responsible for 100+ tables Asynchronous replication by using message broker
(publish/subscribe) Guarantees delivery of messages (incl. Logging) Provides partial ordering of messages
Record-level membership + mastership-migration protocol
99
Comparison
Dynamo
Bigtable PNUTS Amazon RDS
SQL Azure
Query Model
get get+key-based range scans
single table selection+projection
SQL SQL
Logical Data Model
key-value
flexible tables
flexible tables
relational relational
Consistency Model
eventual
relaxed per-record timeline consistency
strict strict
Transaction Guarantees
? row-level row-level ACID ACID
Replication
data-level
GFS record-level
DB-level DB-level
100
Conclusion DBaaS = outsourcing databases to reduce
TCO Reduce operational / administration costs Pay as you go model
Wide spectrum of solutions „rent a database“ Cloud databases
Use cases Database hosting Hosted services Large-scale data analytics
101
Challenges & Trends
Virtuali-zation
Distributed Storage
Logical Data Model
Storage Model
Query & Programming Model
Serv
ice L
evel
Agre
em
ents
Resource provisioning:• Virtualization on system and database level
Service-level agreements:•Shielding: one (virtual) box per client
•Limiting functionality: SQL vs. put/get operations
•Workload management
Scalability and availability•Through redundancy and partitioning
•But may affect consistency model
Expressiveness:•Limiting functionality: SQL vs. put/get vs. MR
Confidentiality and trust•Data encryption • Information distribution
102
References F. Chang et al.: Bigtable: A Distributed Storage System for Structured Data,
OSDI 2006. B.F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-
A. Jacobsen, Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
R. Baldoni, M. Raynal: Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems, IEEE Distributed Systems Online, 2002
E. Brewer: Towards Robust Distributed Systems, PODC 2000 S. Gilbert, N. Lynch: Brewer‘s Conjecture and the Feasibility of Consistent,
Available, Partition-Tolerant Web Services, ACM SIGACT News, 2002 W. Vogels: Eventually Consistent – Revisited, ACM Queue 6(6), 2008 D. Karger et al.: Consistent Hashing and Random Trees: Distributed Caching
Protocols for Relieving Hot Spots on the World Wide Web, STOC '97 Y. Saito, M. Shapiro: Optimistic Replication, ACM Computing Surveys, 5(3):1-
44, 2005 S. Aulbach, T. Grust, D. Jacobs, A. Kemper, J. Rittinger: Multi-tenant
databases for software as a service: schema-mapping techniques. SIGMOD Conference 2008: 1195-1206
103
References G. DeCandia et al.: Dynamo: Amazon‘s Highly Available Key-value
Store, SOSP’07 P. Bernstein et al.: Data Management Issues in Supporting Large-scale
Web Services, IEEE Data Engineering Bulletin, Dec. 2006 M. Brantner et al.: Building a Database on S3, SIGMOD’08 A. Aboulnaga, C. Amza, K. Salem: Virtualization and databases: state
of the art and research challenges. EDBT 2008: 746-747 A. A. Soror, U. F. Minhas, A. Aboulnaga, K. Salem, P. Kokosielis, S.
Kamath: Automatic virtual machine configuration for database workloads. SIGMOD Conference 2008: 953-966
C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
R. Pike, S. Dorward, R. Griesemer, Se. Quinlan, Interpreting the data: Parallel analysis with Sawzall, Scientific Programming, v.13 n.4, p.277-298, October 2005
104
References R. Chaiken, B. Jenkins , P Larson, B. Ramsey, D. Shakib, S. Weaver, J. Zhou,
SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
B. Hore, S. Mehrotra, G. Tsudik, A privacy-preserving index for range queries, Proceedings of the Thirtieth international conference on Very large data bases, p.720-731, August 31-September 03, 2004, Toronto, Canada
H. Hacigümüş, B. Iyer, C. Li, S. Mehrotra, Executing SQL over encrypted data in the database-service-provider model, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
D. Agrawal, A. El Abbadi, F. Emekçi, A. Metwally: Database Management as a Service: Challenges and Opportunities. ICDE 2009: 1709-1716
A. Shamir, How to share a secret, Communications of the ACM, v.22 n.11, p.612-613, Nov. 1979
F. Kerschbaum, J. Vayssière, Privacy-preserving data analytics as an outsourced service, Proceedings of the 2008 ACM workshop on Secure web services, October 31-31, 2008, Alexandria, Virginia, USA
B. Chor, O. Goldreich, E. Kushilevitz , M. Sudan, Private information retrieval, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.41, October 23-25, 1995