Phoenix James Taylor [email protected] We put the SQL back in NoSQL.

28
Phoenix James Taylor [email protected] We put the SQL back in NoSQL

Transcript of Phoenix James Taylor [email protected] We put the SQL back in NoSQL.

Page 1: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

PhoenixJames [email protected]

We put the SQL back in NoSQL

Page 2: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Agenda

Completed

What is Phoenix?Why SQL?What is next?Q&A

Page 3: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

What is Phoenix?

Completed

SQL layer on top of HBaseDelivered as an embedded JDBC driverTargets low latency queries over HBase dataColumns modeled as multi-part row key and key valuesVersioned schema repositoryQuery engine transforms SQL into puts, delete, scansUses native HBase APIs instead of Map/ReduceBrings the computation to the data:

Aggregate, insert, delete datathrough coprocessorsPush predicates through custom filters

100% JavaOpen source here: https://github.com/forcedotcom/phoenix

Page 4: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Why SQL?

Completed

Broaden HBase adoptionGive folks an API they already know

Reduce the amount of code users need to writeSELECT TRUNC(date,'DAY’), AVG(cpu_usage)FROM web_statWHERE domain LIKE 'Salesforce%’GROUP BY TRUNC(date,'DAY')

Performance optimizations transparent to the userAggregationStats gatheringSecondary indexing

Leverage existing toolingSQL client

Page 5: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Page 6: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Page 7: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keys

Page 8: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysCREATE TABLE web_stat (

domain VARCHAR NOT NULL, feature VARCHAR NOT NULL, date DATE NOT NULL, usage BIGINT, active_visitor INTEGER,

CONSTRAINT pk PRIMARY KEY (domain, feature, date));

Page 9: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang custom function

Page 10: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang custom function

Derive class from ScalarFunctionAdd annotation to define name, args, and typesImplement evaluate methodRegister function

(blog on this coming soon: http://phoenix-hbase.blogspot.com/)

Page 11: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queries

Page 12: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queries

Set CURRENT_SCN property on connection to earlier timestamp

Queries will see only rows before timestampSchema in-place at that point in time will be used

Page 13: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a row

Page 14: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a row

Declare new new child entity as nested tablePrefix column qualifier of nested entities with:

table name + child primary key + child column nameRestrict join to be only through parent/child relationExecute query by scanning nested child rows

TBD: https:/github.com/forcedotcom/phoenix/issues/19

Page 15: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writes

Page 16: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…

Completed

Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writes

“Salt” row key on upsert by mod-ing with cluster sizeQuery for fully qualified key by inserting salt byteRange scan by concatenating results of scan over all

possible salt bytesOr alternately

Define column used for hash to derive row key prefix

TBD: https://github.com/forcedotcom/phoenix/issues/74

Page 17: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counter

Page 18: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counter

Surface the HBase put-and-increment functionality through the standard SQL sequence support

TBD: https://github.com/forcedotcom/phoenix/issues/18

Page 19: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table data

Page 20: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table data

Support the standard SQL TABLESAMPLE clauseImplement filter that uses a skip next hint Base next key on the table stats “guide posts”

TBD: https://github.com/forcedotcom/phoenix/issues/22

Page 21: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table dataDeclare columns at query time

Page 22: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

But I can’t surface x,y,z in SQL…Define multi-part row keysImplement my whizz-bang built-in functionRun snapshot in time queriesNest child entities inside of a rowPrevent hot spotting on writesIncrement atomic counterSample table dataDeclare columns at query time

SELECT col1,col2,col3FROM my_table(col2 VARCHAR, col3 INTEGER)WHERE col3 > 10

TBD: https://github.com/forcedotcom/phoenix/issues/9

Page 23: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

ConclusionPhoenix fits the 80/20 use case ruleLet us know what you’d like to see addedGet involved – we need your help!Think about how your new feature can be surfaced in SQL

Page 24: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Thank you!Questions/comments?

Page 25: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Query Processing

FEATURERow Key

Key Values

ORG_ID DATE

TXNS

IO_TIME

RESPONSE_TIME

Product Metrics HTable

Scan Start key: ORG_ID (:1) + DATE (:2) End key: ORG_ID (:1) + DATE (:3)

Filter Filter: IO_TIME > 100

Aggregation Intercepts scan on region server Builds map of distinct FEATURE values Returns one row per distinct group Client does final merge

SELECT feature, SUM(txns)FROM product_metricsWHERE org_id = :1AND date >= :2 AND date <= :3AND io_time > 100GROUP BY feature

Page 26: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Phoenix Query Optimizations

Completed

Start/stop key of scan based on AND-ed columnsThrough SUBSTR, ROUND, TRUNC, LIKE

Parallelized on client by chunking over start/stop key of scanAggregation on region-servers through coprocessor

Inline for GROUP BY over row key ordered columnsIn memory map per group otherwise

WHERE clause executed through custom filtersIncremental evaluation with early terminationEvaluated through byte pointers

IN and OR over same column (in progress)Becomes batched get or filter with next row hint

Top N queries (future)Through coprocessor keeping top N rows

TABLESAMPLE (future)Becomes filter with next row hint

Page 27: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Phoenix Performance

Page 28: Phoenix James Taylor jtaylor@salesforce.com We put the SQL back in NoSQL.

Phoenix Performance

Completed