Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet +...

27

Transcript of Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet +...

Page 1: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File
Page 2: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

00Copyright 2017 © Qubole

Qubole Data Platform – Cloud-native platform for AI, Machine Learning, and Analytics

Data Lake

Data Prepand Ingestion Analytics AI and

Machine LearningSelf-service access

Multiple use cases

Financial governance

Elastic scale

Security

DataEngineers

DataAnalysts

DataScientists

PlatformAdministrators

Cloud-Native Data Platform for AI, Machine Learning, and Analytics

. . .

Page 3: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

●○○

●●●●

○●●

Page 4: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● GDPR & CCPA

● Rights to erasure

● Rights to rectification

Page 5: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Regenerating data○ New table with deletions/updates on original table○ Drop original ○ Rename new table○ Expensive process

● Re-structure tables○ User in partitions○ Fast deletions○ Update limited to partition○ Restructuring and updates expensive

Page 6: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Fast and inexpensive Updates and Deletes● Minimal impact on read performance● Available/extendible to Apache Hive, Apache Spark and Presto

Page 7: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

Solution Update/Delte? Primary Engine

Cross Engine Read

Cross Engine Write

Databricks Delta

YES Databricks Spark

No No

Hive ACID v2 YES Hive No Limited

Apache Iceberg(I)

IN PROGRESS Spark Yes Limited

Page 8: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● ORC● Fastest update/deletes● No degradation in read performance in stable/compacted state● Open source so extendible to all engines

Page 9: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

Transactions

Locks

DifferentialFiles

Page 10: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Transaction and Write-Ids○ Transaction Opened and Committed/Rolled-back for each operation○ Aborted, uncommitted transactions not visible○ Write Ids: determines write location○ Atomicity and Isolation

Page 11: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

Queries

FileSystem

Metastore

CREATE TABLE sample (a int, b int) TBLPROPERTIES('transactional_properties'='insert_only')

INSERT INTO TABLE sample VALUES(10,10)

sample |_____ delta_00001_00001_0000 |_______ 000000_0

Committed Transactions: {1}Aborted Transactions: {}

INSERT INTO TABLE sample VALUES(20,20)

|_____ delta_00002_00002_0000 |_______ 000000_0

Committed Transactions: {1, 2}Aborted Transactions: {}INSERT INTO TABLE sample VALUES(30,30) -- FAIL

|_____ delta_00003_00003_0000 |_______ 000000_0

Committed Transactions: {1, 2}Aborted Transactions: {3}

Page 12: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Differential Directories○ Delta○ Delete Delta

● Table Types○ Insert Only Tables

■ Base + [Delta]

○ Full Acid Tables■ Base + [Delta] + [Delete Delta]■ Files have additional columns for

Synthetic-RowId

● Base directories

Page 13: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

●○○

Queries

CREATE TABLE sample (a int, b int) TBLPROPERTIES('transactional'='true')

INSERT OVERWRITE TABLE sample VALUES(10,10)

Base File.. Txn Bucket Row

Ida b

.. 1 1234 1 10 10

DELETE FROM TABLE sample where a = 10Delete_Delta File.. Txn Bucket Row

Ida b

.. 1 1234 1 null null

Page 14: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

●sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000

|_______ 000000_0

sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000 | |_______ 000000_0 | |_____ delta_00001_00002_0000

|_______ 000000_0

sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000 | |_______ 000000_0 | |_____ delta_00001_00002_0000 | |_______ 000000_0 | |_____ base_0004

|_______ 000000_0

sample | |_____ base_0004

|_______ 000000_0

Page 18: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

TBS

isValid

Filterc2 = X

isValid

y y x z z y

isValid

Result

Result

EMPTYPAGE

y y x z z y

isValid

y y y y y y

isValid SELECT c1 FROM TWHERE c2 = ‘x’

Page 19: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Hive’s RecordReader○ + Minimal work required○ - Heavy performance penalty

Page 20: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

○ + Extendible across formats

○ - Materialization at Join

○ - Load all Delete_Deltas

in-memory

Page 21: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File
Page 24: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

● Performance improvements● Support Bucketed Tables● ACID tables with non-ACID data from past● Support ORC generated by Hive Streaming Connection API

Page 25: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File
Page 26: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

Comparisons

Solution Updates/Deletes Snapshot Isolation Compaction/Cleanup File Formats

Databricks Delta Yes Yes Manual Parquet

Hive ACID v2 Yes Yes Automatic ORC/All

Apache Iceberg(I) No Yes None Avro, Parquet, ORC

Uber Hudi Yes Yes Manual Parquet + Avro

26

Page 27: Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet + Avro 26. Comparisons Solution Updates/Deletes Primary Engine Compaction/Cleanup File

Comparisons

Solution Updates/Deletes Primary Engine Compaction/Cleanup File Formats

Databricks Delta Available Databricks Spark Manual Parquet

Hive ACID v2 Available Hive Automatic ORC/All

Apache Iceberg(I) Under development Spark, Presto None Avro, Parquet, ORC

Uber Hudi Available Spark Manual Parquet + Avro

27