穆黎森:Interactive batch query at scale
-
Upload
hdhappy001 -
Category
Technology
-
view
232 -
download
1
description
Transcript of 穆黎森:Interactive batch query at scale
Interactive Batch Query At Scale
Adhoc query system for game analytics based on Drill
!1
Related Topics
• Java Programming
• Relational Algebra
• Distributed Database
• Hadoop Ecosystem
!2
About Us
• Elex-tech
• Game Development, Game Publishing
• SNS Games, Web Games, Mobile Games, Apps
• Global Market
!3
• The Problem!
• Brief on Drill
• Design Considerations
• Enhancement from Xingcloud
• Now & Future
!4
The Problem
!5
The Problem• How many logins today?
• How many individual users this week?
• Total income today?
• Paid user amount this month?
• …
!6
The Problem: Facts• How many X during time period of Y
!
!
!
• Fact Table
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!7
The Problem: Facts• How many logins today?
• How many individual users this week?
• Total income today?
• Paid user amount this month?
• …
!8
The Problem: Facts• How many logins today?
!
!
!
• select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’;
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!9
The Problem: Facts• How many individual users this week?
!
!
!
• select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’;
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!10
The Problem: Facts• Total income today?
!
!
!
• select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’;
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!11
The Problem: Facts• Paid user amount this month?
!
!
!
• select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’;
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!12
The Problem: Dimensions• How many logins today from China?
• How many individual users of each server this week?
• Total income today by new user?
• Paid user amount this month from Adwords?
• …
!13
The Problem: Dimensions• The user X’s property Y is of value Z
!
!
!
• Dimension Table
user id reg_time language refer …user_001 20100612 en adwordsuser_002 20110927 cn facebookuser_003 20121010 fr admobuser_004 20130522 it tapjoy
!14
Fact & Dimension• Aggregation on Join
user id reg_time language refer …user_001 20100612 en adwordsuser_002 20110927 cn facebookuser_003 20121010 fr admobuser_004 20130522 it tapjoy
user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090
!15
Fact & Dimension• How many logins today from China?
• How many individual users of each server this week?
• Total income today by new user?
• Paid user amount this month from adwords?
• …
!16
Fact & DimensionSELECT COUNT DISTINCT (on uid)
JOIN (1 fact, n dimension, on uid)
WHERE (filter by value of dimensions/facts)
GROUP BY (value of dimension)
!17
Fact & Dimension• SQL
• -> Syntax tree
• -> Logical Plan
• -> Physical Plan
scan: Fact
scan: Dimension
filterfilter
Join
agg
scan: Dimension
filter
Join
pre-aggregation?
!19
!20
Combinatorial Explosion!!21
Access Pattern
Facts Dimensions
Write Append Insert, update
Read by date event
user id prop value full table
!22
Volume
• 200GB new Facts
• 50GB Dimension updates
!23
Storage
Architecture
Drill
MySQL HBase
MySQL StorageEngine
HBase StorageEngine
Data Loader
Query
!24
• The Problem
• Brief on Drill!
• Design Considerations
• Our work
• Now & Future
!25
http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac!26
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913!27
• The Problem
• Brief on Drill
• Design Considerations!
• Our work
• Now & Future
!28
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913!29
Data Model{
name: "icecream",
price: {
basic: 4.99,
coupon: true
}
}
• Various types
• Nested values
• price.basic
• Schema-free
!30
Design Considerations
• As Fast As possible
• Space efficient
• Time efficient
!31
about Space Efficiency• Compact data representation
• Java object overhead: high
• JVM friendly(GC)
• Simpler object graph
• Less tenured space, less full GC
!32
about Time Efficiency• Cache friendly
• data access Locality
• Superscalar: pipeline friendly
• the inner loop problem
• SIMD friendly
• opportunity to operate on a vector of values
• JVM friendly(JNI)
!33
ValueVector & RecordBatch
ValueVector!34
ValueVector & RecordBatch
• ValueVector
• small memory overhead
• backed by DirectByteBuffer
• further encoding
• continuous access/random access
!35
{
name: "icecream",
price: {
basic: 4.99,
coupon: true
}
}
icecream…
4.99…
T…
name:VarChar
price.basic:floatprice.coupon:boolean
ValueVector & RecordBatch
RecordBatch
!36
ValueVector & RecordBatch
scan: Fact
scan: Dimension
filter
filter Join agg
• Data passed in RecordBatch
• Inner loop: next() vs for
!37
Review the Considerations• Cache friendly
• Superscalar: pipeline friendly
• SIMD friendly
• Compact data representation
• JVM friendly(GC)
• JVM friendly(JNI)!38
icecream…
4.99…
T…
name:VarCh
price.basic:floprice.coupon:boole
• The Problem
• Brief on Drill
• Design Considerations
• Our work!
• Now & Future
!39
Our work, primarily
• Adhoc batch query
!40
Reports: 2-dimensional tables generally
!41
Adhoc batch query
DailyActiveUser 2013-07-26 2013-07-27
en 576 491
cn 361 945
!42
Adhoc batch queryuser id event timeuser_13 login 2013-07-26user_13 login 2013-07-26user_76 pay 2013-07-27
user id nationuser_13 cnuser_76 en
DAU 2013-07-26 2013-07-27
en 576 491
cn 361 945
Dimension
Fact
!43
Adhoc batch queryDAU 2013-07-26 2013-07-27
en 576 491
cn 361 945
!44
DAU 2013-07-26 2013-07-27
en 576 491
cn 361 945
Adhoc batch queryscan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-26’
nation=‘en’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-27’
nation=‘en’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-26’
nation=‘cn’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-27’
nation=‘cn’
!45
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-26’
nation=‘en’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-27’
nation=‘en’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-26’
nation=‘cn’
scan: Fact
scan: Dimension
filter
filter Join
agg
date=‘2013-07-27’
nation=‘cn’
!46
scan: Fact
scan: Dimension
filter
filter Join
date=‘2013-07-26’
nation=‘en’
filter
filter Join
date=‘2013-07-27’
nation=‘en’
filter
filter Join
date=‘2013-07-26’
nation=‘cn’
filter
filter Join
date=‘2013-07-27’
nation=‘cn’
agg
agg
agg
agg
!47
Adhoc batch query• Benefits
• Reduce the same Scans
• Merge similar Scans
• Possibility
• SQL usually Parses into Tree, while
• LogicalPlan in Drill is DAG!48
More Benefits: Middle result reuse
!49
Adhoc batch queryscan: Fact
scan: Dimension
filter
filter Join
date=‘2013-07-26’
nation=‘en’
filter
filter Join
date=‘2013-07-27’
nation=‘en’
filter
filter Join
date=‘2013-07-26’
nation=‘cn’
filter
filter Join
date=‘2013-07-27’
nation=‘cn’
agg
agg
agg
agg
!50
Adhoc batch queryscan: Fact
scan: Dimension
filter
Join
date=‘2013-07-26’
nation=‘en’
filter
Join
date=‘2013-07-27’
filter
Join
date=‘2013-07-26’
nation=‘cn’
filter
Join
date=‘2013-07-27’
agg
agg
agg
agg
Filter
Filter
!51
Adhoc batch queryscan: Fact
scan: Dimension
Join
date=‘2013-07-26’
nation=‘en’Join
date=‘2013-07-27’
Join
nation=‘cn’Join
agg
agg
agg
agg
Filter
Filter
Filter
Filter
!52
More Benefits: More Batched,
More Offline
!53
Single Query
!54
Batched 3 Queries
!55
Batched Query, from a report
!56
Batched Query, from tens of reports, with 1k+ operators
!57
Jobs vs Predictions
• Offline job
• becomes predictions of what data user may be interested in
• by merging more query together
• daily predictions & hourly predictions
!58
More Benefits: Utilising multi-core
!59
Utilising Multi-core• Original:
• Pull data from root
• Downwards recursively
scan: Fact
scan: Dimension
filterfilter
Join
agg
date=‘2013-07-26’nation=‘en’
!60
Utilising Multi-core• Now:
• Push data from Leaf
• Data driven upwards
• Pooled executionscan: Fact
scan: Dimension
filterfilter
Join
agg
date=‘2013-07-26’nation=‘en’
!61
Adhoc batch query• Benefits
• Reduce the same Scans
• Merge similar Scans
• Merge intermediate operators
• Unified process for adhoc & batch process
• Multi-core process of single Plan!62
• The Problem
• Brief on Drill
• Design Considerations
• Our work
• Now & Future
!63
About Xingcloud• Now
• http://a.xingcloud.com
• 2 billion insert/update daily
• 200k+ aggregation data/day, 6k sec in total
• query response time: <1sec - 100 sec, 10 sec on avg.
• Future
• Plan Merge
• Unified process for batch, adhoc & stream process, SQL oriented
• SQL(t): Plan with time window
!64
About Drill• Now
• Distributed Join
• on Parquet/ORCFile on HDFS
• Write interface of storage engines
• Future
• 1.0 M2: December 2013
• 1.0 GA: Early 2014
• more detail on https://issues.apache.org/jira/browse/DRILL
!65
References• http://incubator.apache.org/drill/index.html#resources
• http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
• http://prezi.com/j43vb1umlgqv/timothy-chen/
• http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf
• http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
!66
Q & A
!67