Cb15 presentation-yingyi

30
BIG DATA QUERY LANDSCAPE – N1QL AND MORE Yingyi Bu | Couchbase

Transcript of Cb15 presentation-yingyi

Page 1: Cb15 presentation-yingyi

BIG DATA QUERY LANDSCAPE – N1QL AND MORE

Yingyi Bu | Couchbase

Page 2: Cb15 presentation-yingyi

©2015 Couchbase Inc. 2

About Myself

Sr. Software Engineer @ Couchbase

Committer @ AsterixDB

(Research Project under Apache Incubation)

PhD Student @ UC Irvine

N1QL SQL++

[email protected]

@buyingyi

Page 3: Cb15 presentation-yingyi

©2015 Couchbase Inc. 3

Agenda

Introduction

Operational Query Processing

Analytical Query Processing

Comparison and Unification

Summary

Page 4: Cb15 presentation-yingyi

Introduction

Page 5: Cb15 presentation-yingyi

©2015 Couchbase Inc. 5

Research

Projects

Introduction

NoSQL

SQL-on-Hadoop

SQL++

Unification

Page 6: Cb15 presentation-yingyi

©2015 Couchbase Inc. 6

Language Unification Research SQL Backward Compatible

Rich Data Model

Configurable Semantics

System Unification Research A Single Language Interface

Scale-out for Both Workloads

Resource Scheduling Underneath

Introduction

SQL++

Page 7: Cb15 presentation-yingyi

Operational Query Processing

Page 8: Cb15 presentation-yingyi

©2015 Couchbase Inc. 8

ArrayList<URI> nodes = new ArrayList<URI>();

// Add one or more nodes of your clusternodes.add(URI.create("http://127.0.0.1:8091/pools"));

// Try to connect to the clientCouchbaseClient client = null;try {

client = new CouchbaseClient(nodes, "default", "");} catch (Exception e) {

System.err.println("Error connecting to Couchbase: " + e.getMessage());

System.exit(1);}

// Put the key-value pair into Couchbase.client.set("hello", "couchbase!").get();

// Return the result and cast it to stringString result = (String) client.get("hello");System.out.println(result);

Operational Query Processing

Put

Get

JSON

Filtering

Flatten

Group-by

Aggregation

Join

Ordering

Page 9: Cb15 presentation-yingyi

©2015 Couchbase Inc. 9

N1QL – SQL for NoSQL

Nested Data

Heterogeneous Data

Dynamic typing[

{ "beer-sample": {

"brewery_id": "bro""abv": {"m1":1, "m2“:2},"category": "North American Lager”,

"type": "beer"}

},{

"beer-sample": {"abv": 9.5,"brewery_id": "brouwerij"}

}]

SELECT

category, type, abv.m1

FROM `beer-sample`

WHERE type = “beer”

[{

"category": "North American Lager",

"type": "beer”,"m1": 1

}]

Standard SELECT pipeline

Joins, subqueries, set operators

UNNEST and NEST

Page 10: Cb15 presentation-yingyi

©2015 Couchbase Inc. 10

Cassandra

SQL-like query language

Feature N1QL Cassandra

Lookup ✔ ✔

Filtering ✔ ✔

Ordering ✔ ✔

Aggregation ✔ ✖

Join ✔ ✖

Subqueries ✔ ✖

Unnest ✔ ✖

Schema-free ✔ ✖

SELECT firstname, lastname FROM users WHERE birth_year = 1981 AND country = 'FR' ALLOW FILTERING;

SELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) > ('John''sBlog', '2012-01-01')

Page 11: Cb15 presentation-yingyi

©2015 Couchbase Inc. 11

MongoDB

JavaScript-like language

Feature N1QL MongoDB

Lookup ✔ ✔

Filtering ✔ ✔

Ordering ✔ ✔

Aggregation ✔ ✔

Join ✔ ✖

Subqueries ✔ ✖

Unnest ✔ ✔

Schema-free ✔ ✔

db.sales.aggregate([

{$group : {

_id : { month: { $month: "$date" }, day: { $dayOfMonth: "$date" }, year: { $year: "$date" } },

totalPrice: { $sum: { $multiply: [ "$price", "$quantity" ] } },

averageQuantity: { $avg: "$quantity" },count: { $sum: 1 }

}}

])

db.users.find( { age: { $gt: 18 } }, { name: 1, address: 1 } ).limit(5)

Page 12: Cb15 presentation-yingyi

Analytical Query Processing

Page 13: Cb15 presentation-yingyi

©2015 Couchbase Inc. 13

Hive

INSERT OVERWRITE TABLE school_summary

SELECT subq1.school, COUNT(1)

FROM (SELECT a.status, b.school, b.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid

AND a.ds='2009-03-20' )) subq1

GROUP BY subq1.school

ProjectProject

Scan (a)

FilterScan (b)

ReduceSink ReduceSink

Join

Group-by

FileSink

Scan

ReduceSink

Group-by

FileSink

M1

R1

M2

R2 More data types than SQL

Hadoop or Tez as runtime

Page 14: Cb15 presentation-yingyi

©2015 Couchbase Inc. 14

Impala

INSERT OVERWRITE TABLE

school_summary

SELECT subq1.school, COUNT(1)

FROM (SELECT a.status, b.school, b.gender

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid

AND a.ds='2009-03-20' )) subq1

GROUP BY subq1.school

ProjectProject

Filter HDFS Scan (b)

Hash Join

HDFS Scan (a)

Pre-Agg

Merge-Agg

HDFS Write

ANSI SQL-92

HDFS/HBase as the storage

Native MPP execution engine

Page 15: Cb15 presentation-yingyi

©2015 Couchbase Inc. 15

Spark SQL

ctx = new HiveContext()users = ctx.table("users")young = users.where(users("age") < 21) println(young.count())

SELECT count(*) FROM users

where age < 21

SQL DataFrames

SQL

DataFrames

Unresolved Logical Plan

Logical Plan

PhysicalPlans

SelectedPhysicalPlan R

DD

s

Co

st M

od

el

Catalog

Page 16: Cb15 presentation-yingyi

©2015 Couchbase Inc. 16

Drill

ANSI SQL-92

Nested Data

Schema Inference

Centralized schema

Static

Managed by DBAs

Self-describing or schema-less

Dynamic evolving

Managed by applications

Embedded in data

CSV, JSON, Parquet, ORC

Page 17: Cb15 presentation-yingyi

Comparison and Unification

Page 18: Cb15 presentation-yingyi

©2015 Couchbase Inc. 18

Comparison and Unification

AsterixDB – System Unification Research

Query language?

Language Comparisons

SQL++ – Language Unification Research

N1QL and SQL++

SQL++

Unification

Research

Projects

Page 19: Cb15 presentation-yingyi

©2015 Couchbase Inc. 19

NoSQL data model with schema flexibility

Declarative full-fledged query language (AQL)

Partitioned native LSM-based storage

Secondary index (B-Tree, R-Tree, and keyword index)

Single-row transaction

Spatial/temporal data types

External data (HDFS) access and indexing

Native MPP query execution engine

AsterixDB (Apache incubator)

Operational

Analytical

Page 20: Cb15 presentation-yingyi

©2015 Couchbase Inc. 20

Query Language?

SELECT subq1.school, COUNT(1)

FROM (SELECT a.status, a.date, b.school, b.region

FROM status_updates a JOIN profiles b

ON (a.userid = b.userid

AND a.date='2009-03-20' )) subq1

GROUP BY subq1.school

Relational JSON

Nested tuples/collections

Partial/missing schema

Heterogeneity

Complex values

Replace COUNT(1) with

“(select * from subq1 order by date limit 3)”;

“school” is not in the schema of the “profiles” table

“school” is missing in some profiles;

“school” is a nested tuple.

Page 21: Cb15 presentation-yingyi

©2015 Couchbase Inc. 21

Language Comparison: Data Model

SystemTop-level

ValuesHeterogeneity Arrays Bags Maps

NestedTuples

Primitive Values

Hive Bags/Tuples ✖ ✔ ✖ P ✔ ✔

Impala Bags/Tuples ✖ ✖ ✖ ✖ ✖ ✔

Spark SQL Bags/Tuples ✖ ✔ ✖ ✔ ✔ ✔

Drill Bags/Tuples ✖ ✔ ✖ ✔ ✔ ✔

N1QL Bags/Tuples ✔ ✔ ✖ ✖ ✔ ✔

Cassandra Bags/Tuples ✖ P ✖ P ✖ ✔

MongoDB Bags/Tuples ✔ ✔ ✖ ✖ ✔ ✔

AsterixDB Any Values ✔ ✖ ✔ ✖ ✔ ✔

Page 22: Cb15 presentation-yingyi

©2015 Couchbase Inc. 22

Language Comparison: Types

SystemDynamic

Type CheckStatic

Type CheckAny Type Open Type Union Type Optional

Hive ✖ ✔ ✖ ✖ ✖ ✖

Impala ✖ ✔ ✖ ✖ ✖ ✖

Spark SQL ✖ ✔ ✖ ✖ ✖ ✖

Drill ✖ ✔ ✖ ✖ ✖ ✖

N1QL ✔ ✖ – –

Cassandra ✖ ✔ ✖ ✖ ✖ ✖

MongoDB ✔ ✖ – –

AsterixDB ✔ ✔ ✔ ✔ ✖ ✔

Page 23: Cb15 presentation-yingyi

©2015 Couchbase Inc. 23

Language Comparison: Path Navigation

SystemTuple Nav.

absentTuple Nav. mismatch

Array Nav. absent

Array Nav. mismatch

Map Nav. absent

Map Nav. mismatch

Hive error error null error null error

Impala error error -- -- -- --

Spark SQL error error error error null error

Drill error error error error null error

N1QL missing missing missing missing -- --

Cassandra error error -- -- -- --

MongoDB missing missing -- -- -- --

AsterixDB null error error error -- --

No Errors!

Page 24: Cb15 presentation-yingyi

©2015 Couchbase Inc. 24

Language Comparison: SELECT Clause

SystemProject Tuples

with Non-scalarSubqueries

Project Tupleswith NestedCollections

Project Non-Tuples

Hive ✖ ✔ ✖

Impala ✖ ✖ ✖

Spark SQL ✖ ✔ ✖

Drill ✖ ✔ ✖

N1QL ✔ ✔ ✔

Cassandra ✖ ✖ ✖

MongoDB ✖ ✔ ✔

AsterixDB ✔ ✔ ✔

Page 25: Cb15 presentation-yingyi

©2015 Couchbase Inc. 25

Language Comparison: FROM Clause

System Subquery JoinsInner

UnnestOuter

UnnestOrdinal

Positions

Hive ✔ ✔ ✔ ✔ ✔

Impala ✔ ✔ ✖ ✖ ✖

Spark SQL ✔ ✔ ✖ ✖ ✖

Drill ✔ ✔ ✔ ✖ ✖

N1QL ✔ ✔ ✔ ✔ ✖

Cassandra ✖ ✖ ✖ ✖ ✖

MongoDB ✖ ✖ ✔ ✖ ✖

AsterixDB ✔ ✔ ✔ ✖ ✔

Page 26: Cb15 presentation-yingyi

©2015 Couchbase Inc. 26

JSON data model

INNER/OUTER FLATTEN CLAUSE

Arbitrary subqueries in SELECT

Configurable parameters for semantics Path navigations

Equality evaluations

Collection coercions

SQL++ (The “++” Part)

Supported by N1QL!

Made consistent in N1QL!

Page 27: Cb15 presentation-yingyi

©2015 Couchbase Inc. 27

SQL++ Configuration for N1QL

Configuration Parameter Value Parameter Value

@path

tuple_nav.absent missing tuple_nav.type_mismatch missing

array_nav.absent missing array_nav.type_mismatch missing

map_nav.absent missing map_nav.type_mismatch missing

@eq

complex yes type_mismatch false

null_eq_null null null_eq_value null

null_eq_missing missing missing_eq_missing missing

missing_eq_value missing null_and_missing missing

null_and_true null null_and_null null

missing_and_true missing missing_and_missing missing

Page 28: Cb15 presentation-yingyi

Summary

N1QL in a Bigger Context

Page 29: Cb15 presentation-yingyi

©2015 Couchbase Inc. 29

Operational Query Processing Rich Data Model

SQL is BACK, but with EXTENSIONS!

Analytical Query Processing Rich Data Model is a MUST!

Unification The trend!

Summary

Page 30: Cb15 presentation-yingyi

Thank you.Q & A