Introduction to Apache Tajo: Future of Data Warehouse
-
Upload
jihoon-son -
Category
Technology
-
view
1.360 -
download
3
Transcript of Introduction to Apache Tajo: Future of Data Warehouse
![Page 1: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/1.jpg)
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son / Gruter Inc.
![Page 2: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/2.jpg)
I am
● Jihoon Son (@jihoonson)○ Ph.D at Korea Univ.○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter ○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
![Page 3: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/3.jpg)
Today's Topic: Tajo
● What is Tajo?○ Tajo / tάːzo / 타조○ Ostrich in Korean
■ Fastest two-legged animal in the world
3
![Page 4: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/4.jpg)
Today's Topic: Tajo
● What is Apache Tajo?○ Our Ostrich can do SQL
processing on big data!■ SQL-on-Hadoop system■ Apache Top-level project
4
![Page 5: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/5.jpg)
Maybe You Think ...
5
SQL-on-Hadoop?Boring..
![Page 7: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/7.jpg)
SQL-on-Hadoop Systems
7
![Page 8: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/8.jpg)
SQL-on-Hadoop Systems
8
![Page 9: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/9.jpg)
SQL-on-Hadoop Systems
9
Long-running ETL jobs
Low-latency interactive analysis
![Page 10: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/10.jpg)
SQL-on-Hadoop Systems
10
● Requirements○ Stable query execution
■ Fault-tolerance● Can avoid query
resubmission ○ Adaptation to dynamic
environment■ Available resources,
unpredictable delays, ...
Long-running ETL jobs
![Page 11: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/11.jpg)
SQL-on-Hadoop Systems
11
● Requirements○ Fast query execution
■ Several query execution techniques
■ In-memory processing Low-latency interactive analysis
![Page 12: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/12.jpg)
Tajo is designed for Both Workloads
12
Long-running ETL jobs
Low-latency interactive analysis
![Page 13: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/13.jpg)
Who are using Tajo?
13
![Page 14: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/14.jpg)
Use Cases: SK Telecom
● Data warehousing & analysis○ 1st telco in South Korea
■ 40 TB/day compressed data (2014)
14
![Page 15: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/15.jpg)
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
Hadoop MPP DBMS
![Page 16: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/16.jpg)
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
![Page 17: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/17.jpg)
ETLETLETL
Integration Layer
Data Warehouse
Operational Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging Area
Data Vault
Data Marts
Strategic Marts
● Long-running ETL jobs● Ad-hoc analysis
![Page 18: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/18.jpg)
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time○ Daily analysis becomes possible○ More exploratory analysis is newly available
with remaining resources
18
![Page 19: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/19.jpg)
Use Cases: Bluehole Studio
● Game log analysis○ Finding principal
causes of service-quality deficiencies
19
![Page 20: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/20.jpg)
Use Cases: Bluehole Studio
● Tajo on EMR
20
![Page 21: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/21.jpg)
Use Cases: Bluehole Studio
● Their first log analysis system○ Easy and rapid deployment of Tajo○ Low learning curve with SQL standard
● Immediate action becomes possible for user complaints and hidden bugs
21
![Page 22: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/22.jpg)
Use Cases: Melon
● Data discovery○ Music streaming service (26 million users)○ Analysis of purchase history for target
marketing● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo○ More analysis becomes possible
22
![Page 23: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/23.jpg)
So, Why should you use Tajo?
23
![Page 24: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/24.jpg)
So, Why should you use Tajo?
● Easy to use
24
![Page 25: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/25.jpg)
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
![Page 26: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/26.jpg)
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...○ Mature SQL features
■ Most existing queries can be executed without modification
26
![Page 27: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/27.jpg)
So, Why should you use Tajo?
● Easy to use○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...○ Mature SQL features
■ Most existing queries can be executed without modification
○ Various data format support■ Text, JSON, Orc, Parquet, …
27
![Page 28: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/28.jpg)
So, Why should you use Tajo?
● Optimized performance
28
![Page 29: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/29.jpg)
So, Why should you use Tajo?
● Optimized performance○ Optimized code
■ Optimized I/O performance● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing● Mitigating GC overhead
29
![Page 30: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/30.jpg)
So, Why should you use Tajo?
● Optimized performance○ Cost-based query plan optimization
■ Join ordering ■ Best algorithm selection
● According to input size■ Progressive optimization
● Further optimize the query plan during query execution● Especially excellent for long running queries
■ => Efficient start schema processing
30
![Page 31: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/31.jpg)
So, Why should you use Tajo?
● Various storage type support
31
![Page 32: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/32.jpg)
So, Why should you use Tajo?
● Various storage type support
32
![Page 33: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/33.jpg)
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQLCloud
storageOn-premise
storage
![Page 34: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/34.jpg)
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQLCloud
storageOn-premise
storage
● Fast delivery● Easy maintenance● Simple data flow
![Page 35: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/35.jpg)
How fast is Tajo?
35
![Page 36: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/36.jpg)
Evaluation on Cloud Environment
● Google Cloud Platform○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
![Page 37: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/37.jpg)
Target Systems
● Hive (0.12)○ Baseline performance○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
![Page 38: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/38.jpg)
Target Systems
● Spark-SQL (1.5.0)○ Default configuration provided by GCP
■ Use the whole cpu and memory■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is adjusted for better performance
38
![Page 39: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/39.jpg)
TPC-DS
● Data○ 24 tables
■ Plain text format■ Stored on Google Cloud Storage
● Query○ Which can be executed on every system
without modifications■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed39
![Page 40: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/40.jpg)
SF 1000, 50 instances
40
![Page 41: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/41.jpg)
SF 1000, 50 instances
41
![Page 42: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/42.jpg)
SF 1000, 50 instances
42
Cannot be run on 1TB
![Page 43: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/43.jpg)
SF 10000, 50 instances
43
![Page 44: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/44.jpg)
SF 10000, 50 instances
44
![Page 45: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/45.jpg)
Demo
45
![Page 46: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/46.jpg)
Simple Demo on EMR
46
● Using TPC-H data set, but○ Lineitem table is stored on HDFS○ Orders table is stored on PostgreSQL○ Other tables are stored on S3
![Page 47: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/47.jpg)
Apache Tajo
● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis
● Is very fast● Supports query federation on diverse data
sources
47
![Page 48: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/48.jpg)
Get Involved!
● We are recruiting contributors!● General
○ http://tajo.apache.org/
● Getting Started○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads○ http://tajo.apache.org/downloads.html
● Issue tracker○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list○ [email protected] ○ [email protected]
48
![Page 49: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/49.jpg)
Useful Links
49
● EMR bootstrap○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo ● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/
![Page 50: Introduction to Apache Tajo: Future of Data Warehouse](https://reader033.fdocuments.us/reader033/viewer/2022042600/5878c3aa1a28ab26728b58e1/html5/thumbnails/50.jpg)
Q & A
50