Consistent Join Queries for Cloud Data Stores

Zhou Wei 1,2

Guillaume Pierre1, Chi-Hung Chi2

1 VU University Amsterdam2 Tsinghua University Beijing

Background

• Cloud Computing Platforms for Web Apps– Unlimited resources– Scalable NoSQL data stores– Pay-as-you-go

Problems of Cloud data stores

• Problem 1: No Join queries– Only support queries within a single table– No queries across tables

• Problem 2: Relaxed data consistency– Single-row transaction only– No consistency guarantee across multi-items

• Web apps need consistent join queries

Why Web App needs join?

• Data Normalization Foreign-key Equi-joins

PK Title Comment1 IPhone Cool2 IPhone Nice3 IPhone Expensive

PostPK Title1 IPhone

PK Post_id Comment1 1 Cool2 1 Nice3 1 Expensive

Comment

Why consistent?

• Application correctness is not an option• Hard to maintain correctness with an

inconsistent data store– Require skillful programmers– Error-Prone– Cost more time

Alternatives

• SQL databases– Centralized– Full Replication– NOT scalable to update

workload– Complicated queries– ACID Transaction

• NoSQL data stores– Distributed– Partitioned– Scalable to update

workload– No Join Queries– Single-row Transaction

Google BigtableAmazon SimpleDBCassandra …

MySQLOracle …

Our Position

• Scalable NoSQL data stores supporting equi-joins queries for Web applications

• Join query as a ACID transaction• Queries of Web applications access only a

small portion of data– Also true for join queries– All of them are foreign-key equi-joins

How it works?

Web Application

Cloud Data Store

Load Data

Checkpoint

Transactions

TM TM TM

CloudTPSJoin Queries

Outline

• Join Algorithm• Transaction Protocol• Performance Evaluation• Conclusion

Join Algorithm

• Web applications are response-time sensitive– Table scan is not an option

• The basic idea– Linking records by FK relationships in advance– Each join query has well identified “root records”– Begin from accessing each “root record”– Recursively obtaining its related records

Link records for forward join

PK FK1 22 33 2

Comment

Given referring row identify referenced row(FK) (PK)

Comment

(1,3)2

Link records for backward join

PK FK1 22 33 2

Comment Post Index

Give referenced row identify referring rows(PK) (FK)

Joining two tables

SELECT * FROM post, comment WHERE post.id = 1 and comment.post_id = post.id

comment

post.id = comment.post_id

Multi-Joins

• The table of the “root record” is the root

• Recursively join child tables

• Length of critical execution path = 2

Comment

Revision

Outline

Distributed Transaction

Transaction

•SubT(1)•SubT(2)•SubT(3) TM

SubT(1) SubT(2) SubT(3)

2 Phase Commit (2PC)

Problem

• 2PC requires all records identified in advance• Join queries/index management identify

accessed records on the fly• Solution: extend 2PC to allow adding records

dynamically

The Original 2PC

Data1 Data2

2. Vote 3. Commit/Abort1. Submit

Extended 2PC

• Dynamically add data items to the transaction

Data1 Data2

2. Vote1. Submit

Commit Add Data3

Extended 2PC

Data1 Data2

3. Derive

Extended 2PC

Data1 Data2

4. Vote

Commit

Extended 2PC

Data1 Data2

5. Commit

Outline

Experiment Setup

• Implement a prototype of CloudTPS – Java Servlet– Deployed in Tomcat v6.0

• Environments– 1. Das3+Hbase+Tomcat (99% reqs < 100ms)– 2. EC2+SimpleDB+Tomcat (90% reqs < 100ms)• High-CPU medium instances

Micro-benchmarks

• Generate specific workload– Containing only Join queries/RW transactions– Changing the number of accessed data items– Changing the length of critical execution path– Switch between Update indexes/Not update

• All configured with 10 CloudTPS nodes• Measuring maximum sustainable throughput– Under response time constraint

Two-table join evaluation

Transaction Per Second Record Per Second

Increasing number of accessed records in a transaction

Multi-Joins Evaluation

Increasing the height of join-query treeAll join queries access 12 records

RW transactions evaluation

Transaction Per Second Record Per Second

Increasing number of accessed records in a transaction

Scalability Evaluation

• Macro-Benchmark– Derived from TPC-W, simulating an online book

store– Our workload includes only the join queries and

RW transactions– No simple queries

Scalability Results

Linearly Scalability: Any increase of workload can be addressed by adding more machines

CloudTPS vs. Postgresql

Postgresql: binary-replication

Fault Tolerance

• Recovers from 3 network partitions and 2 machine failures

Conclusion

• We demonstrate supporting consistent join queries over partitioned and distributed data

• Join queries from Web applications only access a small portion of data

• It scales linearly even under an intensive workload of only join queries and read-write transactions

Questions?

Thank you!

Details in Macro-benchmark workload

Join queries RW transactions

PK SK1 A2 B3 B

Secondary-key Queries

• Select * from Table1 Where SK = ‘B’• Create a separate index table for the SK• Translate into a join query

Table 1Index Table

PK’(SK) T1A 1B 2,3

Index Management• Indexes link the related records

PK Title1 Hello2 Good3 Well

PK FK1 12 13 2

post comment

T11,23-

Index Management (cont.)

• Case: a new comment inserted• Updating indexes atomically

PK FK1 12 13 24 2

post comment

T11,23, 4

• Consistency Violated Case 1

PK FK1 12 13 24 2

post comment

T11,23-

• Consistency Violated Case 2

PK Title1 Hello2 Good3 OK

PK FK1 12 13 2

post comment

T11,23, 4

Fault tolerance

• Maintain ACID during server failure– Transactions that reach an agreement of “COMMIT”

should Commit– Abort other transactions

• Replication to N transaction managers– Values of Data Items– States of Transactions (Vote result and “redo logs”)

• Failover of a failed server– Transfer ownership of data items– Transfer leadership of transactions being coordinated

Optimization for Join Queries

• Join Query is a Read-only transaction– Remove the final Commit phase– Sub-transactions removed immediately after local

execution– RO transactions will not block other transactions

Concurrency Control

SubT(1)

SubT(1)5

Timestamp Ordering

SubT(2)

SubT(2)5

Read-Only Transaction

ROSubT(1)

SubT(1)5

SubT(2)5

Join(Data1) TM TM

SubT(1)7

Read-Only Transaction (Cont.)

ROSubT(1)6

Join(Data1) TM TM

SubT(1)7

Read-only Transaction (cont.)

NoneROSubT(2)6

Result

SubT(1)7

Consistent Join Queries for Cloud Data Stores

Documents

Transcript of Consistent Join Queries for Cloud Data Stores

Hiding Tree-Structured Data and Queries from Untrusted Data Stores

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Working with Queries Keyboard Shortcuts psalltraining.com General · 2015. 9. 30. · Field Data Types Data Type Description Text (Default) Stores text, numbers, or a combination

Access: Queries Ad-hoc Reporting Chapter T. Access Queries Queries Access Properties Sorting Selection Criteria Calculations.

RETAIL - ibef.org · Departmental stores Hypermarkets Supermarkets/ convenience stores Specialty stores Cash and carry stores Pantaloon has 209 stores cash and carry

Automated military unit identification in battlefield ...public.lanl.gov/stroud/GSMO1997.pdfelement that generates (and stores in memory) random MTI screens that are consistent with

GMVC96 - Goodman MFGSelf-Diagnostic Control Board – Continuously monitors the system for consistent, reliable operation, stores last diagnostic codes in memory; and indicates condition

Your Tasks A.Create queries and reports that will meet the needs of the business. – Reports should have a consistent house style. B. Explain any testing.

Queries & Reports - Admin Portal Index · •Disable Pop-up blockers •Position Funding Queries •Employee Information Queries •Budget Queries •Reports ... • You need the

Analytics: The Agile Way...Traditional Analytics via Data at Rest Analytics via Event-Stream Processing 1. Receives and stores data, typically from a single source. 1. Stores queries

SUPREME COURT OF THE UNITED STATES · ment. Respondent Abercrombie & Fitch Stores, Inc., operates several lines of clothing stores, each with its own “style.” Consistent with

Super Stores Vs. Specialty Stores

Please direct all queries to: Queries Committee · 2019-11-05 · 50 Please direct all queries to: Queries Committee Chairperson P.O. Box 5313, Helena, MT 59604. Queries submitted

An Information-Theoretic Perspective of Consistent ...dimacs.rutgers.edu/Workshops/Next15/Slides/Cadambe.pdf · • Modern key-value stores - Amazon Dynamo DB, Couch DB, Apache Cassandra

Spatial Queries Nearest Neighbor and Join Queries.

Join Queries CS 146. Introduction: Join Queries So far, our SELECT queries have retrieved data from…

NoSQL: HBase and Neo4j - ce.uniroma2.it · Suitable use cases for column-family stores • Queries that involve only a few columns • Aggregation queries against vast amounts of

Fast and strongly-consistent per-item resilience in …Fast and strongly-consistent per-item resilience in key-value stores Konstantin Taranov, Gustavo Alonso, Torsten Hoefler Systems

Batch Processing - Cornell University · Iterative processing on real-time data stream Interactive queries on a consistent view of results Argue that currently Streaming systems cannot

DS4300 -Large-Scale Storage and Retrieval · Key-Value Store Features Consistency: Consistency guaranteed only on a single key. Distributed stores are eventually consistent. Resolution