Database as a Service - Tutorial @ICDE 2010

1

Database as a ServiceSeminar, ICDE 2010, Long Beach, March 04

Wolfgang Lehner | Dresden University of Technology, Germany

Kai-Uwe Sattler | Ilmenau University of Technology, Germany

2

Introduction Motivation SaaS Cloud Computing Use Cases

3

Software as a Service (SaaS)

Traditional

Software

On-Demand Utility

Build Your Own Plug In, Subscribe

Pay-per-Use

4

Comparison of business modelTraditional packaged software Software as a service (SaaS)

Designed for customers to install, manage and maintain

Designed for delivery as Internet-based services

Architect solutions to be run by an individual company in a dedicated instantiation of the software

Designed to run thousands of different customers on a single code

Infrequent, major upgrades, sold individually to each installed base customer.

Frequent small upgrades to minimize customer disruption and enhance satisfaction.

Version control, upgrade fee Fixing a problem for one customer

fixes it for everyone

5

Avoid hidden cost of traditional SWTraditional

SoftwareSW Licenses

Maintenance

Hardware

IT Staff

Training

Customization

Subscription Fee

Training

Customization

SaaS

Your Large Customers

Dozens of markets of millions or millions of markets of dozens?

$ /

C

ust

om

er

# of Customers

Your Typical Customers

(Currently) “non addressable” Customers

What if you lower your cost of sale (i.e. lower barrier to entry) and you also lower cost of operations

New addressable market >> current market

The Long Tail

6

7

EC2 & S3

Cloud Computing:

A style of computing where massively scalable, IT-enabled capabilities are provided "as a service" across the Internet to multiple external customers.

"It's about economies of scale, with effective and dynamic sharing"

Acquisition ModelServiceBusiness Model Pay for usage

Technical Model Scalable, elastic, shareable

Access Model Internet

"All that matters is results — I don't care how it is done"

"I don't want to own assets — I wantto pay for elastic usage, like a

utility"

"I want accessibility from anywhere from any device"

What is Cloud? – Gartner’s Definition

8

To Qualify as a Cloud Common, Location-independent, Online Utility on Demand*

Common implies multi-tenancy, not single or isolated tenancy Utility implies pay-for-use pricing on Demand implies ~infinite, ~immediate, ~invisible scalability

Alternatively, a “Zero-One-Infinity” definition:** 0 On-premise infrastructure, acquisition cost, adoption cost,

support cost 1 Coherent and resilient environment – not a brittle “software

stack” Scalability in response to changing need, Integratability/ Interoperability with legacy assets and other services

Customizability/Programmability from data, through logic, up into the user interface without compromising robust

multi-tenancy * Joe Weinman, Vice President of Solutions Sales, AT&T, 3 Nov. 2008** From The Jargon File: “Allow none of foo, one of foo, or any number of foo”

9

Cloud Differentials: Service Models Cloud Software as a Service (SaaS)

Use provider’s applications over a network

Cloud Platform as a Service (PaaS) Deploy customer-created applications to a cloud

Cloud Infrastructure as a Service (IaaS) Rent processing, storage, network capacity, and

other fundamental computing resources

10

Cloud Differentials: Characteristics Size/Location

Large Scale(AWS, Google, BM/Google),

Small Scale(SMB, Academia)

Purpose General Purpose Special Purpose (e.g., DB-

Cloud)

Administration/Jurisdiction Public Private

Platform Physical – Virtual Homogenous –

Heterogeneous

Design Paradigms Storage CPU Bandwidth

Usage Model Exclusive Shared Pseudo-Shared

11

Use Cases: Large-Scale Data Analytics Outsource your data and use cloud resources

for analysis Historical and mostly non-critical data Parallelizable, read-mostly workload, high variant

workloads Relaxed ACID guarantees

Examples (Hadoop PoweredBy): Yahoo!: research for ad systems and Web search Facebook: reporting and analytics Netseer.com: crawling and log analysis Journey Dynamics: traffic speed forecasting

12

Use Cases: Database Hosting Public datasets

Biological databases: a single repository instead of > 700 separate databases

Semantic Web Data, Linked data, ... Sloan Digital Sky Survey TwitterCache

Already on Amazon AWS: annotated human genome data, US census, Freebase, ...

Archiving, Metadata Indexing, ...

13

Use Cases: Service Hosting Data management for SaaS solutions

Run the services near the data = ASP

Already many existing applications CRM, e.g. Salesforce, SugarCRM Web Analytics Supply Chain Management Help Desk Management Enterprise Resource Planning, e.g. SAP Business

ByDesign ...

14

Foundations & Architectures Virtualization Programming models Consistency models

& replication SLAs & Workload

management Security

15

Topics covered in this Seminar

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model

Query & Programming Model

Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

17

... it‘s simple!

18

Virtualization Separating the abstract view of computing

resources from the implementation of these resources adds flexibility and agility to the computing

infrastructure soften problems related to provisioning,

manageability, … lowers TCO: fewer computing resources

Classical driving factor: server consolidation

EDBT2008 Tutorial (Aboulnaga e.a.)

Consolidate

E-mail serverLinux

Web serverLinux

Database serverLinux

Virtualization

E-mail serverLinux

Web serverLinux

Database serverLinux

Improved utilization using consolidation

What can be virtualized – the big four.

19

20

Different Types of Virtualization

APP 1 APP 4APP 2 APP 3 APP 5

OPERATING SYSTEM OPERATING SYSTEM

VIRTUAL MACHINE MONITOR (VMM)

VIRTUAL MACHINE 1 VIRTUAL MACHINE 2

PHYSICAL MACHINE

CPU CPU CPUMEM MEM NET

CPU NETMEMCPU

CPU

PHYSICAL STORAGE

21

Virtual Machines Technique with long history (since the 1960's)

Prominent since IBM 370 mainframe series Today

large scale commodity hardware and operating systems

Virtual Machine Monitor (Hypervisor) strong isolation between virtual machines (security, privacy, fault

tolerance) flexible mapping between virtual machines and physical resources classical operations

pause, resume, checkpoint, migrate (admin / load balancing)

Software deployment Preconfigured virtual appliances Repositories of virtual appliances on the web

22

DBMS on top of Virtual Machines ... yet another

application?

... Overhead? SQL Server

within VMware

23

Virtualization Design Advisor What fraction of node

resources goes to what DBMS? Configuring VM parameters

What parameter settings are best for a given resource configuration Configuring the DBMS

parameters Example

Workload 1: TPC-H (10GByte) Workload 2: TPC-H (10GByte)

only Q18 (132 copies) Virtualization design advisor

20% of CPU to Workload 1 80% of CPU to Workload 2

24

Some Experiments Workload Definition based on TPC-H

Q18 is one of the most CPU intensive queries Q21 is one of the least CPU intensive queries Workload Units

C: 25x Q18 I: 1x Q21

Experiment: Sensitivity to workload Resource Needs W1 = 5C + 5I W2 = kC + (10-k)I (increase of k -> more CPU intensive)

DB2 Postgres

26

Virtualization in DBaaS environments

HW Layer

VM VM VM VM VMVM Layer VM

DB Server

DB Server

DB Server

DB Server Layer

Instance

Instance

Instance

Instance

Instance

Instance

Instance Layer

DB DB DB DB DBDB Layer

27

Existing Tools for Node Virtualization

HW Layer

VM VM VMVM Layer

DB Server

DB Server Layer

Instance

Instance

Instance Layer

DB DB DB DB DBDB Layer DB Ad2visor• Indexes• MQTs• MDC• Redistribution of

TablesDB Workload

Manager

VM Configuration• Monitoring• Resources

Configuration• (manual) Migration

Static Environment Assumptions• Advisor expects static hardware environment• VM expects static (peak) resource requirements• Interactions between layers can improve performance/utilization

NodeRessource

Model

28 BP

Index Storage

Expected Performance

5%

10%

15%

25%

20%

200MB

400MB

600MB

800MB

1GB

Layer Interactions (2) Experiment

DB2 on Linux TPC-H workload on 1GB database Ranges for resource grants

Main memory (BP) – 50 MB to 1GB Additional storage (Indexes) – 5% to 30% DB size

Varying advisor output (17-26 indexes) Different possible improvement Different expected Performance after improvement

DB Advisor

VM Configuration

BP

Index Storage

90%

<1% <3%

35%

Possible Improvement

5%

10%

15%

25%

20%

200MB

400MB

600MB

800MB

1GB

29

Storage Virtualization General Goal

provide a layer of indircetion to allow the definition of virtual storage devices minimize/avoid downtime (local and remote mirroring) improve performance (distribution/balancing – provisioning - control

placement) reduce cost of storage administration

Operations create, destroy, grow, shrink virtual devices change size, performance, reliability, ...

workload fluctuations hierarchical storage management

versioning, snapshots, point-in-time copies backup, checkpoints

exploit CPU and memory in the storage system caching execute low-level DBMS functions

30

Virtualization in DBaaS Environments (2)

HW Layer

VM VM VM VM VMVM Layer VM

DB Server

DB Server

DB Server

DB Server Layer

Instance

Instance

Instance

Instance

Instance

Instance

Instance Layer


Storage LayerShared Disk

Local Disk

31

Virtualization in DBaaS Environments (2)

HW Layer

VM VMVM Layer VM

DB Server

DB Server Layer

Instance

Instance

Instance Layer


Storage LayerShared Disk

Local Disk

DB Advisor• Indexes• MQTs• MDC• Redistribution of

TablesDB Workload

Manager

Storage Configuration• Device Bundling• Replication• Archiving

StorageRessource

Model

One way to go? Paravirtualization

CPU and Memory Paravirtualization extends the guest to allow direct interaction with the underlying hypervisor reduces the monitor cost including memory and System call operations. gains from paravirtualization are workload specific

Device Paravirtualization places a high performance virtualization-aware device driver into the guest paravirtualized drivers are more CPU efficient (less CPU overhead for

virtualization) Paravirtualized drivers can also take advantage of HW features, like partial

offload

33

Outline

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

34

Multi Tenancy Goal: consolidate multiple customers onto the

same operational system

Requirements: Extensibility: customer-specific schema changes Security: preventing unauthorized data accesses by other

tenants Performance/scalability: scale-up & scale-out Maintenance: on tenant level instead of on database level

separate DBper tenant

shared DBshared schema

shared DBseparate schema

flexible,but

limited scalabili

ty

best resource utilizatio

n

35

Flexible Schema Approaches Goal: allow tenant-specific schema additions (columns)

Universal Table Extension Table

PivotTable

36

Flexible Schema Approaches: Comparison

Applic

ati

on o

wns

the s

chem

aD

ata

base

ow

ns

the sch

em

a

Pivot table

XML columns

Extension tablePrivate tables

Universal table

Chunk folding

Best performanceFlexible schema evolution

Universal table: requires techniques for handling sparse data Fine-grained index support not possible

Pivot table: Requires joins for reconstructing logical tuples

Chunk folding: similar to pivot tables Group of columns are combined in a chunk and mapped into a chunk

table Requires complex query transformation

37

Access Control in Multi-Tenant DB Shared DB approaches require row-level

access control Query transformation

.... where TenantID = 42 ... Potential security risks

DBMS-level control, e.g. IBM DB2 LBAC Label-based Access control Controls read/write access to individual rows and

columns Security labels with policies Requires separate account for each tenant

38

In a Nutshell How shall virtualization be handled on

Machine level (VM to HW) DBMS level (database to instance to database server) Schema level (multi tenancy)

... using … Allocation between layers Configuration inside layers Flexible schemas

… when … Characteristics of the workloads are known Virtual machines are transparent Tenant-specific schema extensions

… demanding that … SLAs and security are respected Each node’s utilization is maximized Number of nodes is minimized

39

Outline

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

40

MapReduce Background Programming model and an associated implementation for

large-scale data processing Google and related approaches: Apache Hadoop and Microsoft Dryad User-defined map & reduce functions

Infrastructure hides details of parallelization provides fault-tolerance, data distribution, I/O scheduling, load

balancing, ...

M

M

map (in_key, in_value) -> (out_key,

intermediate_value) list

R

R

reduce (out_key, intermediate_value list) ->

out_value list

M

{ (key,value) }

Logic Flow of WordCount

Hadoop Map/Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner…

Mapper

1 Hadoop Map/Reduce is a

17 software framework for

45 easily writing applications

Hadoop 1

Map 1

Reduce 1

is 1

a 1

Hadoop [1, 1, 1, …,1]

Map [1, 1, 1, …, 1]

Reduce [1, 1, 1, …, 1]

is [1, 1, 1, …, 1]

a [1, 1, 1, …, 1]

Sort/Shuffle

Hadoop 5

Map 12

Reduce 12

is 42

a 23

Reducer… …

42

MapRecude Disadvantages

MM RR

Extremely rigid data flow

Common operations must be coded by hand join, filter, split, projection, aggregates, sorting, distinct

User plans may be suboptimal and lead to performance degradation

Semantics hidden inside map-reduce functions Inflexible, difficult to maintain, extend and optimize

Combination of high-level declarative querying and low-level programming with MapReduce

Dataflow Programming Languages Hive, JAQL and Pig

43

PigLatin PigLatin

On top of map-reduce/ Hadoop Mix of declarative style of SQL and procedural style of

map-reduce Consists of two parts

PigLatin: A Data Processing Language Pig Infrastructure: An Evaluator for PigLatin

programs Pig compiles Pig Latin into physical plans Plans are to be executed over Hadoop

30% of all queries at Yahoo! in Pig-Latin Open-source, http://incubator.apache.org/pig

http://incubator.apache.org/pig

44

Example

User URL Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

URL Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits URL Info

Task: Determine the most visited websites in each category.

45

Implementation in MapReduceimport java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));

reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);

lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }

46

Example Workflow in Pig-Latin

load Visitsload Visits

group by urlgroup by url

foreach urlgenerate countforeach urlgenerate count

load URL Infoload URL Info

join on urljoin on url

group by categorygroup by category

foreach categorygenerate top10 URLs


visits = load ‘/data/visits’ as (user, url, time);

gVisits = group visits by url;

visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;

topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topURLs’;

Operate directly over files.

Schemas optional. Can be assigned dynamically.

User-defined functions (UDFs) can be used in every construct• load, store• group, filter, foreach

47

Compilation in MapReduce

load Visitsload Visits

group by urlgroup by url

foreach urlgenerate countforeach urlgenerate count

load URL Infoload URL Info

join on urljoin on url

group by categorygroup by category



Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Map

1Re

duce

1

Map2

Reduce2Map3

Reduce3

48

Data warehouse infrastructure built on top of Hadoop, providing: Data Summarization Ad hoc querying

Simple query language: Hive QL (based on SQL) Extendable via custom mappers and reducers Subproject of Hadoop No „Hive format“ http://hadoop.apache.org/hive/

Hive

http://hadoop.apache.org/hive/

49

Hive - ExampleLOAD DATA INPATH `/data/visits` INTO TABLE visits

INSERT OVERWRITE TABLE visitCountsSELECT url, category, count(*)FROM visitsGROUP BY url, category;

LOAD DATA INPATH ‘/data/urlInfo’ INTO TABLE urlInfo

INSERT OVERWRITE TABLE visitCountsSELECT vc.*, ui.*FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url);

INSERT OVERWRITE TABLE gCategoriesSELECT category, count(*)FROM visitCountsGROUP BY category;

INSERT OVERWRITE TABLE topUrlsSELECT TRANSFORM (visitCounts) USING ‘top10’;

50

Higher level query language for JSON documents Developed at IBM‘s Almaden research center Supports several operations known from SQL

Grouping, Joining, Sorting

Built-in support for Loops, Conditionals, Recursion

Custom Java methods extend JAQL JAQL scripts are compiled to MapReduce jobs Various I/O

Local FS, HDFS, Hbase, Custom I/O adapters

http://www.jaql.org/

JAQL

http://www.jaql.org/

51

JAQL - ExampleregisterFunction(„top“, „de.tuberlin.cs.dima.jaqlextensions.top10“);

$visits = hdfsRead(„/data/visits“);

$visitCounts =$visits-> group by $url = $

into { $url, num: count($)};

$urlInfo = hdfsRead(„data/urlInfo“);

$visitCounts =join $visitCounts, $urlInfowhere $visitCounts.url == $urlInfo.url;

$gCategories =$visitCounts-> group by $category = $

into {$category, num: count($)};

$topUrls = top10($gCategories);

hdfsWrite(“/data/topUrls”, $topUrls);

52

Outline

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

53

ACID vs. BASE

ACIDBasically Available Soft-state Eventual consistent

Strong consistency Isolation Focus on „commit“ Availability? Pessimistic Difficult evolution

(e.g. schema)

Weak consistency Availability first Best effort Optimistic

(aggressive) Fast and simple Easier evolution

Traditional distributed data management

Web-scale data management

54

CAP Theorem [Brewer 2000] Consistency: all clients have the same view,

even in case of updates Availability: all clients find a replica of data,

even in the presence of failures Tolerance to network partitions: system

properties hold even when the network (system) is partitioned

You can have at most two of these properties for any shared-data system.

55

CAP TheoremNo consistency

guarantees➟ updates with

conflict resolutionOn a partition

event, simply wait until data is

consistent again➟ pessimistic

locking

All nodes are in contact with each

other or put everything in a single

box➟ 2 phase commit

56

CAP: Explanations

Network partitions ➫ M is not delivered Solutions?

Synchronous message: <PA,M> is atomic Possible latency problems (availability)

Transaction <PA, M, PB>: requires to control when PB happens Impacts partition tolerance or availability

PA :=update(o) PB:=read(o)

M

1.2.

3.

57

Consistency Models [Vogels 2008]

Strong consistency: after the update completes, any subsequent access from A, B, C will

return D1

Weak consistency: does not guarantee that subsequent accesses will return D1 → a

number of conditions need to be met before D1 is returned

Eventual consistency: Special form of weak consistency Guarantees that if no new updates are made, eventually all accesses

will return D1

D0

Distributedstorage system

A B Cupdate: D0→D1

read(D)

58

Variations of Eventual Consistency Causal consistency:

If A notifies B about the update, B will read D1 (but not C!)

Read-your-writes: A will always read D1 after its own update

Session consistency: Read-your-writes inside a session

Monotonic reads: If a process has seen Dk, any subsequent access

will never return any Di with i < k

Monotonic writes: guarantees to serialize the writes of the same

process

59

Database Replication store the same data on multiple nodes in

order to improve reliability, accessibility, fault-tolerance

1-copy consistency relaxed consistency

Single master

Multimaster

Optimisticreplication

Optimistic strategies = lazy replication Allows replicas to diverge; requires conflict resolution Allow data be accessed without a-priori synchronization Updates are propagated in the background Occasional conflicts are fixed after they happen

Improved availability, flexibility, scalabability, but see CAP theorem

60

Optimistic Replication: Elements

1

2

1. operation submission

1

1 2

3. scheduling

2211 1

1 2

2. propagation

22

1+2

4. conflict resolution

1+2

1+2

5. commitment

Y. Saito, M. Shapiro: Optimistic Replication, ACM Computing Surveys, 5(3):1-44, 2005

61

Conflict Resolution & Update Propagation

Prohibit Ignore Reduce Syntactic

Detect & repair

Semantic

Single master

Thomas write rule

Dividingobjects,

...

Vector clocks

App-specific

ordering or preconditio

ns

Epidemic information dissemination Updates pass through the system like infectious diseases Pairwise communication: a site contacts others (randomly

chosen) and sends ist information, e.g. about updates All sites process messages in the same way Proactive behaviour: no failure recovery necessary!

Basic approaches: anti-entropy, rumor mongering, ...

62

Outline

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

63

The Notion of QoS and Predictability

Service Level Objectives Specific measurables

characteristics; e.g. importance, performance goals

Deadline constraints Percentile constraints

legal parttechnical part

fees, penalties, ...

Service Level Agreement

Common understanding about services, guarantees, responsibilities

Application Server / middleware

DBMS

OS / Hardware

64

Techniques for QoS in Data Management Provide sufficient resources

Capacity planning: „How much boxes for customer X?“

Cost vs. Performance tradeoff Shielding

Dedicated (virtual) system for customers Scalability? Cost efficiency?

Scheduling Ordering requests on priority At which level?

65

Workload Management Purpose:

achieve performance goals for classes of requests (queries, transactions)

Resource provisioning Aspects:

Specification of service-level objectives Workload classification and modeling Admission control & scheduling

Static priorization: DB2 Query Patroller, Oracle Resource Manager, ...

Goal-oriented approaches Economic approaches Utility-based approaches

67

WLM: Model

Admission control: limit the number of simultanously executing requests (multiprogramming level = MPL)

Scheduling: ordering requests by priority

transaction

result

classes

MPLworkload classification

admission control &scheduling

response time

68

Utility Functions Utility function = preference specification

map possible system states (e.g. resource provisioning to jobs) to a real scalar value

Represents performance feature (response time, throughput, ...) and/or economic value

utility

response time

Goal: determine the most valuable feasible state, i.e. maximize utility Explore space of

alternative mappings (search problem)

Runtime monitoring and control

Kephart, Das: Achieving self-management via utility functions. IEEE Internet Computing 2007

69

Workload Modeling & Prediction Goal: predict resource requirements for a given workload,

i.e., find correlation between query features and performance features Approaches: regression, correlation analysis, Kernel Canonical

CA

Prediction: Calculate job coordinates in query plan projection based on job

feature vector Infer job‘s coordinates on the performance projection

query plans/

job descr.

performance

statistics

jobfeature matrix

performance

feature matrix

KCCA

performanceprojection

query planprojection

Ganapathi et al.: Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning. ICDE 2009

70

Outline

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Multi-Tenanc

y

Replication

Serv

ice L

evel

Agre

em

ents

Security

71

Overview and Challenges

Data Owner

Service Provider

(un-trusted)

outsourcing

User

Query

Eng

ine

Query

Pre

/Post

- pro

cess

or

queries

query results

Data

Pre

-pro

cess

or

Data confidentiality/ privacy

Private information retrieval / Access privacy

Completeness and correctness

72

Challenges I – Data Confidentiality/ Privacy Need to store data in the cloud But we do not trust the service providers for sensitive

information encrypt the data and store it but still be able to run queries over the encrypted data do most of the work at the server

Two issues Privacy during transmission (wells studied, e.g. through SSL/TLS) Privacy of stored data

Querying over encrypted data is challenging needs to maintain content information on the server side, e.g.

range queries require order preserving data encryption mechanisms

privacy performance tradeoff

73

Query Processing on Encrypted Data

Service Provider

(un-trusted)

User Query

Eng

ine

client-sidequery encrypted

results

QueryTranslator

Temporary

Result

Metadata

QueryExecutor

originalquery

Client Site

result

server-sidequery

74

Executing SQL over Encrypted Data Hacigumus et al., (SIGMOD 2002) Main Steps

Partition sensitive domains Order preserving: supports comparison Random: query rewriting becomes hard

Rewrite queries to target partitions Execute queries and return results Prune/post-process results on client

Privacy-Precision Trade-off Larger segments/partitions

increased privacy decreased precision increased overheads in query

processing

75

Relational Encryption

NAME SALARY PID

John 50000 2

Marry 110000 2

James 95000 3

Lisa 105000 4

etuple N_ID S_ID P_ID

fErf!$Q!!vddf>></|

50 1 10

F%%3w&%gfErf!$ 65 2 10

&%gfsdf$%343v<l

50 2 20

%%33w&%gfs##! 65 2 20Service Provider Sitearbitrary encryption function,e.g. AES, RSA, Blowfish, DES, …

Store an etuple for each tuple in the original table

Create a coarse index for each (or selected) attribute(s) in the original table

Bucket Ids

76

Index and Identification Functions

200

0 400

600

800

1000

2 7 5 1 4

Domain Values

Partition (Bucket) ids

Partition function divides domain values into partitions (buckets)

Partition (R.A) = { [0,200], (200,400], (400,600], (600,800], (800,1000] }

partitioning function has an impact on performance as well as privacy

Identification function assigns a partition id to each partition of attribute A

identR.A( (200,400] ) = 7 Any function can be use as identification function, e.g., hash

functionsMeta-data

=

77

Challenges II – Private Information Retrieval (PIR) User queries should be invisible to service provider More formal

database is modeled as a string x of length N stored at remote server

user wants to retrieve the bit xi for some i without disclosing any information about i to the server

Paradox imagine buying in a store without the seller knowing

what you buy

User

i

xi

x1, x2, …, xn

X

78

i n

Q1∈{1,…,n}

Information-Theoretic 2-server PIR

User 0 0 1 1 0 011 10 00

Service Provider 1

Service Provider 2

i

a1 = xl+l ϵ Q1

a2 = xl+l ϵ Q2

xi = a1 a2 +

Q2=Q1 i+

79

Conclusion & Outlook Current

Infrastructures MS Azure Amazon RDS +

SimpleDB Amazon Dynamo Google BigTable Yahoo! PNUTS

Conclusion Challenges & Trends

80

Current Solutions

Amazon RDS

Microsoft SQL Azure Amazon S3Google Bigtable,Cassandra, Voldemort

Yahoo! PNUTS

Replication

one DB per client one DB for all clients

AmazonSimpleDB / Dynamo

Virtualization

Distributed Storage

81

Microsoft SQL Azure Cloud database service for

Azure platform Allows to create SQL server = group of databases

spread across multiple physical machines (incl. geo-location)

Supports relational model and T-SQL (tables, views, indices, triggers, stored procedures)

Deployment and administration using SQL Server Management Studio

Current limitations Individual database size = max. 10 GB No support for CLR, distributed queries &

transactions, spatial data

82

Microsoft SQL Azure: Details Databases

implemented as replicated data partitions Across multiple physical nodes Provide loal balancing and failover

API SQL, ADO.NET, ODBC Tabular Data Streams SQL Server Authentication Sync Framework

Prices 1 GB database: $9.99/month, 10 GB: $99.99/month +

data transfer SLA: 99.9% availability

83

Microsoft Azure: Other Services Azure Blob

Blob storage; PUT/GET interface via REST Azure Table

Structured storage; LINQ, ADO.NET interface

Storage Account

Customer

Order

Customer #1

Customer #2

Name

Address

Table

Entity Property

Properties can be defined per entity; Max size of entity: 1 MB Partition key: used for assigning entities to partitions; Row key:

unique ID within a partition Sort order: single index per table Atomic transactions within a partition

84

Amazon RDS Amazon Relational Database Services

Web Service to set up and operate a MySQL database Full-featured MySQL 5.1 Automated database backup Java-based command line tools and Web Service API

for instance administration Native DB access

Prices: Small DB instance (1.7 GB memory, 1 ECU):

$0.11/hour Largest DB instance (68 GB, 26 ECU): $3.10/hour + $0.10 GB-month storage + data transfer

85

Amazon Data Services Amazon Simple Storage Service (S3)

Distributed Blob storage for objects (1 Byte ... 5 GB data)

REST-based interface to read, write, and delete objects identified by unique, user-defined key

Atomic single-key updates; no locking Eventual consistency (partially read-after-write) Aug 2009: more than 64 billion objects

Amazon SimpleDB (= Amazon Dynamo???) Distributed structured storage Web Service API for access Eventual consistency

86

Amazon SimpleDB Data model

Relational-like data model: domain = collection of items described by key-value pairs; max size 10 GB

Attributes can be added to certain records (256 per record)

Storage Account

Customer

Order

Customer #1

Customer #2

Name: Wolfgang

City: Dresden

Item Attribute: Value

Domain Queries

Restricted to a single domain SFW syntax + count() + multi-attribute predicates Only string-valued data: lexicographical comparisons

87

Amazon Dynamo Highly available and scalable key-value data store for the

Amazon platform Manages the state of Amazon services

Providing bestseller lists, shopping carts, customer preferences, product catalogs → require only primary-key access (e.g. product id, customer id)

Completely decentralized, minimal need for manual administration (e.g. partitioning, redistribution)

Assumptions: Simple query model: put/get operations on keys, small objects (<

1MB) Weaker consistency but high availability („always writable“ data

store), no isolation guarantees Efficiency: running on commodity hardware, guaranteed latency

= SLAs, e.g. 300 ms response time for 99.9% of requests, peak load of 500 requests/sec.

88

Dynamo: Partitioning and Replication Partitioning scheme

based on consistent hashing Virtual nodes: each physical node is responsible for more

than one virtual node Replication

Each data item is replicated at n nodes

AKey space = ring

Responsibility ofnode C

B

D

E

CReplicas of keysFrom range (B,C)

89

Dynamo: Data Versioning Provides eventual consistency → asynchronous propagation

of updates Updates result in a new version of the data Vector clocks for capturing causalities between different

versions of the same object Vector clock = list of (node, counter)

Determine causal ordering/parallel branches of versions Update requests have to specify which version is to be updated

Reconciliation during client reads!

D1([NA,1]) D2([NA,2]) D5([NA,3],[NB,1],[NC,1])

D3([NA,2],[NB,1])

D4([NA,2],[NC,1])

write(D)@NA

write(D)@NA

reconcile(D)@NA

write(D)@NB

write(D)@NC

90

Dynamo: Replica maintenance Consistency among replicas:

Quorum protocol: R nodes must participate in a read, W nodes in a write; R + W > N

Sloppy quorum: Read/writes are performed on the first N healthy nodes Preference list: list of nodes which are responsible for

storing a given key For highest availability: W=1

Replica synchronization Anti-entropy: Merkle trees:

hash trees where leaves are hashes of keys, non-leaves are hashes of children

If hash values of two nodes are equal, no need to check children

91

Google BigTable Fast and large-scale DBMS for Google

applications and services Designed to scale into PB range Uses distributed Google File System (GFS) for

storing data and log files Depends on a cluster management system for

managing resource, monitoring states, scheduling, ....

Can be used as input source and output target for MapReduce programs

92

BigTable: Data Model Bigtable = sparse, distributed, multi-dimensional sorted

map Indexed by row key, column key, timestamp; value = array of

bytes Row keys up to 64 KB; column keys grouped in column families Timestamp (64 bit int) used for versioning Data is maintained in lexicographic order by row keys Row range is dynamically partitioned ➪ tablet = unit of

distribution and load balancing Read/write ops under a single row key are atomic

row key

value

t1t2

column key

93

BigTable: System Architecture Single-master distributed storage system master server responsible for

Assigning tablets to tablet servers Load balancing on tablet servers Detecting addition and expiration of tablet servers Garbage collection of GFS files

Tablet servers Manage sets of tablets (10...1000 tablets per server, 100..200

MB per tablet) Handle read/write requests Split tables

Distributed, persistent lock/name service Chubby uses Paxos for replica consistency (5 replicas) Provides namespace consisting of directories and files; allows

discovering of tablet servers

94

BigTable: Tablets Internally stored in SSTables

Immutable, sorted file of key-value pairs; organized in 64KB blocks + index (block ranges)

Tablet Location Chubby contains location of root tablet Root tablet contains location of all tablets in a METADATA table METADATA tablet contains location of user tablets + end key

row (sparse index) Three-level scheme addresses 234 tablets Cached by client library

Chubby fileRoot tablet

METADATA tabletUser tables

95

BigTable: Tablets /2 Tablet Assignment

Starting tablet servers acquire an exclusive lock in Chubby → allows discovery of tablet servers

Periodically checks by the master on the lock status of tablet servers

Replication of data performed by GFS Tablet Serving

Updates (mutations) are logged and then applied to an in-memory version (memtable)

Compactions Convert memtable into SSTable Merge SSTables

96

Yahoo! PNUTS Yahoo!‘s data serving platform Data & query model:

Simple relational model: tables of records with attributes (incl. Blob types)

Flexible schema evolution by adding attributes at any time

Queries: single-table selection & projection Updates & deletions based on primary-key access

Storage model: Records as parsed JSON objects Filesystem-based hash tables or MySQL InnoDB

engine

97

PNUTS Architecture

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

98

PNUTS: Consistency & Replication Consistency model:

Per-record timeline consistency: all replicas apply all updates in the same order

User-specific guarantees: ready-any, read-latest, read-newer-than, writes, write-after-version

Partitioning and replication: Tables horizontally partioned into tablets (100 MB ...10 GB) Each server is responsible for 100+ tables Asynchronous replication by using message broker

(publish/subscribe) Guarantees delivery of messages (incl. Logging) Provides partial ordering of messages

Record-level membership + mastership-migration protocol

99

Comparison

Dynamo

Bigtable PNUTS Amazon RDS

SQL Azure

Query Model

get get+key-based range scans

single table selection+projection

SQL SQL

Logical Data Model

key-value

flexible tables

flexible tables

relational relational

Consistency Model

eventual

relaxed per-record timeline consistency

strict strict

Transaction Guarantees

? row-level row-level ACID ACID

Replication

data-level

GFS record-level

DB-level DB-level

100

Conclusion DBaaS = outsourcing databases to reduce

TCO Reduce operational / administration costs Pay as you go model

Wide spectrum of solutions „rent a database“ Cloud databases

Use cases Database hosting Hosted services Large-scale data analytics

101

Challenges & Trends

Virtuali-zation

Distributed Storage

Logical Data Model

Storage Model


Serv

ice L

evel

Agre

em

ents

Resource provisioning:• Virtualization on system and database level

Service-level agreements:•Shielding: one (virtual) box per client

•Limiting functionality: SQL vs. put/get operations

•Workload management

Scalability and availability•Through redundancy and partitioning

•But may affect consistency model

Expressiveness:•Limiting functionality: SQL vs. put/get vs. MR

Confidentiality and trust•Data encryption • Information distribution

102

References F. Chang et al.: Bigtable: A Distributed Storage System for Structured Data,

OSDI 2006. B.F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-

A. Jacobsen, Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008

R. Baldoni, M. Raynal: Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems, IEEE Distributed Systems Online, 2002

E. Brewer: Towards Robust Distributed Systems, PODC 2000 S. Gilbert, N. Lynch: Brewer‘s Conjecture and the Feasibility of Consistent,

Available, Partition-Tolerant Web Services, ACM SIGACT News, 2002 W. Vogels: Eventually Consistent – Revisited, ACM Queue 6(6), 2008 D. Karger et al.: Consistent Hashing and Random Trees: Distributed Caching

Protocols for Relieving Hot Spots on the World Wide Web, STOC '97 Y. Saito, M. Shapiro: Optimistic Replication, ACM Computing Surveys, 5(3):1-

44, 2005 S. Aulbach, T. Grust, D. Jacobs, A. Kemper, J. Rittinger: Multi-tenant

databases for software as a service: schema-mapping techniques. SIGMOD Conference 2008: 1195-1206

103

References G. DeCandia et al.: Dynamo: Amazon‘s Highly Available Key-value

Store, SOSP’07 P. Bernstein et al.: Data Management Issues in Supporting Large-scale

Web Services, IEEE Data Engineering Bulletin, Dec. 2006 M. Brantner et al.: Building a Database on S3, SIGMOD’08 A. Aboulnaga, C. Amza, K. Salem: Virtualization and databases: state

of the art and research challenges. EDBT 2008: 746-747 A. A. Soror, U. F. Minhas, A. Aboulnaga, K. Salem, P. Kokosielis, S.

Kamath: Automatic virtual machine configuration for database workloads. SIGMOD Conference 2008: 953-966

C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig latin: a not-so-foreign language for data processing, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada

R. Pike, S. Dorward, R. Griesemer, Se. Quinlan, Interpreting the data: Parallel analysis with Sawzall, Scientific Programming, v.13 n.4, p.277-298, October 2005

104

References R. Chaiken, B. Jenkins , P Larson, B. Ramsey, D. Shakib, S. Weaver, J. Zhou,

SCOPE: easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008

B. Hore, S. Mehrotra, G. Tsudik, A privacy-preserving index for range queries, Proceedings of the Thirtieth international conference on Very large data bases, p.720-731, August 31-September 03, 2004, Toronto, Canada

H. Hacigümüş, B. Iyer, C. Li, S. Mehrotra, Executing SQL over encrypted data in the database-service-provider model, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin

D. Agrawal, A. El Abbadi, F. Emekçi, A. Metwally: Database Management as a Service: Challenges and Opportunities. ICDE 2009: 1709-1716

A. Shamir, How to share a secret, Communications of the ACM, v.22 n.11, p.612-613, Nov. 1979

F. Kerschbaum, J. Vayssière, Privacy-preserving data analytics as an outsourced service, Proceedings of the 2008 ACM workshop on Secure web services, October 31-31, 2008, Alexandria, Virginia, USA

B. Chor, O. Goldreich, E. Kushilevitz , M. Sudan, Private information retrieval, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.41, October 23-25, 1995

105

Who has the first question?

[email protected]@tu-ilmenau.de

Database as a Service - Tutorial @ICDE 2010

Education

Transcript of Database as a Service - Tutorial @ICDE 2010