Introduction to Apache Tajo: Data Warehouse for Big Data
-
Upload
jihoon-son -
Category
Engineering
-
view
2.194 -
download
6
Transcript of Introduction to Apache Tajo: Data Warehouse for Big Data
![Page 1: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/1.jpg)
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son / Gruter inc.
![Page 2: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/2.jpg)
About Me
● Jihoon Son (@jihoonson)○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo○ Research engineer at Gruter
2
![Page 3: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/3.jpg)
Outline
● About Tajo● Features of the Recent Release● Demo ● Roadmap
3
![Page 4: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/4.jpg)
What is Tajo?
● Tajo / tάːzo / 타조○ An ostrich in Korean○ The world's fastest two-legged animal
4
![Page 5: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/5.jpg)
What is Tajo?
● Apache Top-level Project○ Big data warehouse system
■ ANSI-SQL compliant ■ Mature SQL features
● Various types of join, window functions○ Rapid query execution with own distributed DAG engine
■ Low latency, and long running batch queries with a single system
■ Fault-tolerance○ Beyond SQL-on-Hadoop
■ Support various types of storage
5
![Page 6: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/6.jpg)
Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit a query
Manage metadataAllocate
a query
Send tasks & monitor
Send tasks & monitor
6
![Page 7: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/7.jpg)
Who are Using Tajo?
● Use cases: replacement of commercial DW○ 1st telco in South Korea
■ Replacement of long-running ETL workloads on several TB datasets
■ Lots of daily reports about user behavior■ Ad‐hoc analysis on TB datasets
○ Benefits ■ Simplified architecture for data analysis
● An unified system for DW ETL, OLAP, and Hadoop ETL ■ Much less cost, more data analysis within same SLA
● Saved license fee of commercial DW
7
![Page 8: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/8.jpg)
Who are Using Tajo?
● Use cases: data discovery○ Music streaming service (26 million users)
■ Analysis of purchase history for target marketing○ Benefits
■ Interactive query on large datasets■ Data analysis with familiar BI tools
8
![Page 9: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/9.jpg)
Recent Release: 0.11
● Feature highlights○ Query federation○ JDBC-based storage support○ Self-describing data formats support○ Multi-query support○ More stable and efficient join execution○ Index support○ Python UDF/UDAF support
9
![Page 10: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/10.jpg)
Recent Release: 0.11
● Today's topic○ Query federation○ JDBC-based storage support○ Self-describing data formats support○ Multi-query support○ More stable and efficient join execution○ Index support○ Python UDF/UDAF support
10
![Page 11: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/11.jpg)
Query Federation with Tajo
11
![Page 12: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/12.jpg)
● Your data might be spread on multiple heterogeneous sites○ Cloud, DBMS, Hadoop, NoSQL, …
Your Data
DBMS
Application
Cloud storageOn-premise
storageNoSQL
12
![Page 13: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/13.jpg)
● Even in a single site, your data might be stored in different data formats
Your Data
JSONCSV Parquet ORC Log
...
13
![Page 14: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/14.jpg)
Your Data
● How to analyze distributed data?○ Traditionally ...
DBMSApplicationCloud storage
On-premise storage
NoSQL
Global view
ETL transform
● Long delivery ● Complex data flow● Human-intensive
14
![Page 15: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/15.jpg)
● Query federation
Your Data with Tajo
DBMSApplicationCloud storage
On-premise storage
NoSQL
Global view
● Fast delivery● Easy maintenance● Simple data flow
15
![Page 16: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/16.jpg)
Storage and Data Format Support
Data formats
Storage types
16
![Page 17: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/17.jpg)
> CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1';
> CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH ('orc.compression.kind'='snappy') LOCATION 's3://user';
> CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION 'hbase:zk://localhost:2181/uptodate';
> ...
Create Table
Data formatStorage
URI
17
![Page 18: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/18.jpg)
Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive1';
> CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive2';
> CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='\\N','compression.codec'='org.apache.hadoop.io.compress.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive3';
> ...
18
Too tedious!
![Page 19: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/19.jpg)
Introduction to Tablespace
● Tablespace ○ Registered storage space○ A tablespace is identified by an unique URI○ Configurations and policies are shared by all tables in a
tablespace■ Storage type■ Default data format and supported data formats
○ It allows users to reuse registered storage configurations and policies
19
![Page 20: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/20.jpg)
Tablespaces, Databases, and Tables
Namespace
Storage1
Storage2
...
... ...
Tablespace1
Tablespace2
Tablespace3
Physical space
Table1
Table2
Table3
Database1
Database1
...
20
![Page 21: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/21.jpg)
{
"spaces" : {
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='\\N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
"hbase1" : {
"uri" : "hbase:zk://localhost:2181/table1"
}
}
}
Tablespace Configuration
Tablespace nameTablespace URI
21
![Page 22: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/22.jpg)
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;Tablespace
nameData format is omitted. Default data format is TEXT.
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='\\N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
22
![Page 23: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/23.jpg)
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc WITH ('orc.compression.kind'='snappy');
> CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1;
> ...
23
![Page 24: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/24.jpg)
HDFS HBase
Tajo Worker
Query Engine
Storage ServiceHDFS
handler
Tajo Worker
Query Engine
Storage ServiceHDFS
handler
Tajo Worker
Query Engine
Storage ServiceHBase
handler
Querying on Different Data Silos
● How does a worker access different data sources?○ Storage service
■ Return a proper handler for underlying storage
> SELECT ... FROM hdfs_table, hbase_table, ...
24
![Page 25: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/25.jpg)
JDBC-based Storage Support
25
![Page 26: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/26.jpg)
jdbc_db1 tajo_db1
JDBC-based Storage
● Storage providing the JDBC interface○ PostgreSQL, MySQL, MariaDB, ...
● Databases of JDBC-based storage are mapped to Tajo databases
Table1
Table2
Table3
Table1
Table2
Table3
tajo_db2
Table1
Table2
Table3
…
jdbc_db2
Table1
Table2
Table3
…
JDBC-based storage Tajo
26
![Page 27: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/27.jpg)
Tablespace Configuration{ "spaces": { "pgsql_db1": { "uri": "jdbc:postgresql://hostname:port/db1"
"configs": { "mapped_database": "tajo_db1" "connection_properties": { "user": "tajo", "password": "xxxx" } } } }}
PostgreSQL database name
Tajo database name
Tablespace name
27
![Page 28: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/28.jpg)
Return to Query Federation
● How to correlate data on JDBC-based storage and others?○ Need to have a global view of metadata across different
storage types■ Tajo also has its own metadata for its data■ Each JDBC-based storage has own metadata for its data■ Each NoSQL storage has metadata for its data■ …
28
![Page 29: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/29.jpg)
● Federating metadata of underlying storage
Metadata Federation
DBMS metadata provider NoSQL metadata provider
Linked Metadata Manager
DBMS HCatalog
Tajo catalog metadata provider
Catalog Interface
● Tablespace● Database● Tables● Schema names
...
29
![Page 30: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/30.jpg)
Querying on JDBC-based Storage
● A plan is converted into a SQL string● Query generation
○ Diverse SQL syntax of different types of storage○ Different SQL builder for each storage type
Tajo Master Tajo Worker JDBC-based storage
SELECT ...
Query plan
SELECT ...
30
![Page 31: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/31.jpg)
Operation Push Down
● Tajo can exploit the processing capability of underlying storage○ DBMSs, MongoDB, HBase, …
● Operations are pushed down into underlying storage○ Leveraging the advanced features provided by
underlying storage■ Ex) DBMSs' query optimization, index, ...
31
![Page 32: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/32.jpg)
Example 1SELECT count(*)FROM account ac, archive arWHERE ac.key = ar.id and ac.name = 'tajo'
account
DBMS
archive
HDFS
scan archivescan account
ac.name = 'tajo'
joinac.key = ar.id
group bycount(*)
group bycount(*)
Full scan Result onlyPush
operation
32
![Page 33: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/33.jpg)
Example 2
SELECT ac.name, count(*)FROM account acGROUP BY ac.name
account
DBMS
scan account
group bycount(*)
Result onlyPush
operation
33
![Page 34: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/34.jpg)
Self-describing Data Formats Support
34
![Page 35: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/35.jpg)
Self-describing Data Formats
● Some data formats include schema information as well as data○ JSON, ORC, Parquet, …
● Tajo 0.11 natively supports self-describing data formats○ Since they already have schema information, Tajo
doesn't need to store it aside○ Instead, Tajo can infer the schema at query execution
time
35
![Page 36: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/36.jpg)
Create Table with Nested Data Format
{ "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}}
{ "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}}
{ "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}}
> CREATE EXTERNAL TABLE schemaful_table ( title TEXT, name RECORD ( first_name TEXT, last_name TEXT ) ) USING json LOCATION 'hdfs:///json_table';
Nested type
36
![Page 37: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/37.jpg)
How about This Data?{"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent.com/u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew","full_name":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github.com/users/CAOakleyII/following{/other_user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github.com/users/CAOakleyII/repos","events_url":"https://api.github.com/users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":"User","site_admin":false},"private":false,"html_url":"https://github.com/CAOakleyII/chromebrew","description":"Package manager for Chrome OS","fork":true,"url":"https://api.github.com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOakleyII/chromebrew/keys{/key_id}","collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chromebrew/teams","hooks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github.com/repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/blobs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github.com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/CAOakleyII/chromebrew/languages","stargazers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/contributors","subscribers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.github.com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.github.com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github.com/repos/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github.com/repos/CAOakleyII/chromebrew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github.com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github.com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","labels_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github.com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07-01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06-09T07:46:57Z","git_url":"git://github.com/CAOakleyII/chromebrew.git","ssh_url":"[email protected]:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":"https://github.com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downloads":true,"has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"created_at":"2015-07-01T00:00:01Z"}
...
37
![Page 38: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/38.jpg)
Create Schemaless Table
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table';
That's all!
Allow any schema
38
![Page 39: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/39.jpg)
Schema-free Query Execution
> CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...) USING text LOCATION 'hdfs:///csv_table;
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table';
> SELECT name.first_name, name.last_name from schemaless_table;
> SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE name = name.last_name GROUP BY title;
39
![Page 40: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/40.jpg)
Schema Inference
● Table schema is inferred at query time● Example
SELECT
a, b.b1, b.b2.c1
FROM
t;
(
a text,
b record (
b1 text,
b2 record (
c1 text
)
)
)
Inferred schemaQuery
40
![Page 41: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/41.jpg)
Demo
41
![Page 42: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/42.jpg)
Demo with Command line
42
![Page 43: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/43.jpg)
Roadmap
43
![Page 44: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/44.jpg)
Roadmap
● 0.12○ Improved Yarn integration○ Authentication support○ JavaScript stored procedure support○ Scalar subquery support○ Hive UDF support
44
![Page 45: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/45.jpg)
Roadmap
● Next generation (beyond 0.12)○ Exploiting modern hardware○ Approximate query processing○ Genetic query optimization○ And more …
45
![Page 46: Introduction to Apache Tajo: Data Warehouse for Big Data](https://reader034.fdocuments.us/reader034/viewer/2022042517/5878c3891a28ab26728b588f/html5/thumbnails/46.jpg)
tajo> select question from you;
46