HADOOP NATIVE SQL
What is HAWQ?
Apache HAWQ (incubating)
Is an elastic parallel processing SQL engine that runs native in Apache™ Hadoop® to directly access data for advanced analytics.
Why HAWQ?
Hadoop Native SQL is a business imperative
1. Hadoop: the new Data WarehouseData is moving out of traditional data warehouses and into Apache Hadoop.
● IT’S ABOUT COST
● IT’S ABOUT COLLABORATION
● IT’S ABOUT OPEN SOURCE
● IT’S ABOUT SCALE
● IT’S ABOUT ANALYTICS
● IT’S ABOUT CLOUD
IT’S ABOUT SQL!
SQL continues to me the Most Valuable workload on Hadoop today
“
MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’
- IT COULD TRANSFORM GE'S BUSINESS - AND THE ECONOMY.
”Jeff Immelt, CEO, GE
Sophisticated Analyticsdrive competitive advantage
2. The rise of the Data ScientistData science enables leveraging data assets for competitive advantage.
● Data science 800% growth in two years[1]
● Needs tools capable of rich analytics handling of massive data
● SQL and Machine Learning are two powerful enabling tools
● Deep ANSI SQL compliance is a requirement for many existing tools
IT’S ABOUT PREDICTIVE INSIGHTS!
[1] source indeed.com http://www.indeed.com/jobtrends?q=Data-science&relative=1
Hadoop Native SQL must embrace the Hadoop ecosystem
3. Hadoop SQL ecosystem
Apache HAWQ
(incubating)
Apache Hive
Apache Drill
Cloudera Impala
100% Apache Governance Yes Yes Yes No
Native HCatalog Integration Yes Yes No Yes
Native Yarn Integration Yes Yes Yes Yes
Native Ambari Integration Yes Yes No No
Support ACID consistency Yes Yes No No
Native Machine Learning Yes No No No
Row Level Security Yes No No Yes*
Focus Low Latency & Analytic Queries
Simple Batch
Schema detection
Low latencyQueries
SQL Patterns
Scalable Performance drives rapid iteration
4. TPC-DS Performance - Impala
HAWQFaster
ImpalaFaster
• HAWQ Faster on 45 / 60 TPC-DS queries completed*• 4.55x mean avg.• 12 hrs faster total
* Impala supported 74 / 99 queries and 12 crashed mid-run
4. TPC-DS Performance - Hive w / Tez• HAWQ Faster on 46 / 62 TPC-DS queries completed*• 3.44x mean avg.• 9 hrs faster total
* Hive supported 60 / 99 queries and 5 crashed mid-run
HAWQFaster
ImpalaFaster
5. TPC-DS - Standards Support
* Impala required rewriting date ranges to support partition elimination
TPC-DS Query 46
SELECT ...FROM ...WHERE … ss_date between '1999-01-01' and '2001-12-31'...
Modified to run in ImpalaSELECT ...FROM ...WHERE ... -- partition key filter ss_sold_date_sk in (2451181, 2451182, 2451188, 2451189, 2451195, 2451196, 2451202, 2451203, 2451209, 2451210, 2451216, 2451217, 2451223, 2451224, 2451230, 2451231, 2451237, 2451238, 2451244, 2451245, 2451251, 2451252, 2451258, 2451259, 2451265, 2451266, 2451272, 2451273, 2451279, 2451280, 2451286, 2451287, 2451293, 2451294, 2451300, 2451301, 2451307, 2451308, 2451314, 2451315, 2451321, 2451322, 2451328, 2451329, 2451335, 2451336, 2451342, 2451343, 2451349, 2451350, 2451356, 2451357, 2451363, 2451364, 2451370, 2451371, 2451377, 2451378, 2451384, 2451385, 2451391, 2451392, 2451398, 2451399, 2451405, 2451406, 2451412, 2451413, 2451419, 2451420, 2451426, 2451427, 2451433, 2451434, 2451440, 2451441, 2451447, 2451448, 2451454, 2451455, 2451461, 2451462, 2451468, 2451469, 2451475, 2451476, 2451482, 2451483, 2451489, 2451490, 2451496, 2451497, 2451503, 2451504, 2451510, 2451511, 2451517, 2451518, 2451524, 2451525, 2451531, 2451532, 2451538, 2451539, 2451545, 2451546, 2451552, 2451553, 2451559, 2451560, 2451566, 2451567, 2451573, 2451574, 2451580, 2451581, 2451587, 2451588, 2451594, 2451595, 2451601, 2451602, 2451608, 2451609, 2451615, 2451616, 2451622, 2451623, 2451629, 2451630, 2451636, 2451637, 2451643, 2451644, 2451650, 2451651, 2451657, 2451658, 2451664, 2451665, 2451671, 2451672, 2451678, 2451679, 2451685, 2451686, 2451692, 2451693, 2451699, 2451700, 2451706, 2451707, 2451713, 2451714, 2451720, 2451721, 2451727, 2451728, 2451734, 2451735, 2451741, 2451742, 2451748, 2451749, 2451755, 2451756, 2451762, 2451763, 2451769, 2451770, 2451776, 2451777, 2451783, 2451784, 2451790, 2451791, 2451797, 2451798, 2451804, 2451805, 2451811, 2451812, 2451818, 2451819, 2451825, 2451826, 2451832, 2451833, 2451839, 2451840, 2451846, 2451847, ...
HAWQArchitecture
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Historical Timeline
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks PostgreSQL
Hadoop 1.0 Released
HAWQ goes Apache
HAWQ project launched
Hadoop 2.0 Released
PostgreSQL
backend/access/
bootstrap/
catalog/
commands/
executar/
foreign/
lib/
libpq/
main/
nodes/
optimizer/
parser/
po/
port/
postmaster/
regex/
...
HAWQ
backend/access/
bootstrap/
catalog/
cdb/
commands/
executar/
foreign/
gp_libpq_fe/
gpopt/
lib/
libgppc/
libpq/
main/
nodes/
optimizer/
parser/
...
Similarities with PostgreSQL
pxf
High Level Architecture
HDFS
Ambari
pxf
pxf
pxf
pxf
hbase
pxf
Yarn
High Level Architecture
Parser
Session Manager
Query Rewrite
Planner ORCA
Dispatch
Interconnect Executor PXF
Storage Manager
libhdfs3
Resource Manager libyarn
Catalog
Resource Enforcer
HAWQ Ambari Integration
Storage Manager Design
PostgreSQL
➜ Single node
➜ Local storage
➜ Local Catalog
➜ Distributed design
➜ HDFS block storage
➜ libHDFS3
HDFS
➜ Append Only
➜ Master Catalog
➜ Metadata dispatch
HAWQ
Data Access
➜ HAWQ supports querying unmanaged data via ○ native hcatalog integration○ pxf○ external tables
➜ HAWQ supports managed transactional tables○ Managed tables are able to provide
transaction isolation.○ Provide atomicity of data inserts○ Provide consistent views of the data
HCatalog Access
SELECT *FROM hcatalog.ops.weblogsWHERE ts between ‘2015-09-01’ and ‘2015-09-30’;
HCatalog Access
SELECT *FROM hcatalog.ops.weblogsWHERE ts between ‘2015-09-01’ and ‘2015-09-30’;
weblogs: id double date timestamp ...
HIVEPXF
PXF
PXFHCAT
disk heap:pg_class...
in-memory:pg_exttablepg_class...
PXF Design
➜ Master / agent process model
➜ Exposed as external tables in HAWQ
➜ Extensible design○ Fragmenter○ Accessor○ Resolver
pxf master
pxf pxf pxf agents
HDFSHIVEHBASE...
Concurrent Transactional Inserts
Files in hdfs
/hawq_data/.../ 0 1 2 ...
Catalog metadata segno | eof | …---------+------+... 0 | 100 | 1 | 20 | 2 | 40 | ...
Concurrent Transactional Inserts
Files in hdfs
/hawq_data/.../ 0 <- session 1 inserts 1 2 ...
Catalog metadata segno | eof | …---------+------+... 0 | 120 | (mvcc) 1 | 20 | 2 | 40 | ...
Concurrent Transactional Inserts
Files in hdfs
/hawq_data/.../ 0 <- session 1 inserts 1 <- session 2 inserts 2 ...
Catalog metadata segno | eof | …---------+------+... 0 | 120 | (mvcc) 1 | 220 | (mvcc) 2 | 40 | ...
Concurrent Transactional Inserts
Files in hdfs
/hawq_data/.../ 0 <- abort / truncate 1 <- commit 2 ...
Catalog metadata segno | eof | …---------+------+... 0 | 100 | (mvcc) 1 | 220 | (mvcc) 2 | 40 | ...
HAWQ relies on HDFS Truncate support (HDFS-3107) to truncate aborted inserts so that later sessions can insert atomically
HAWQ Metadata management
Master Catalog
➜ Stores all the system metadata
➜ Based on PostgreSQL style catalog representation
➜ Supports master mirroring for fault tolerance
➜ Provides for fully transactional DDL operations
Query Annotation
➜ Metadata is needed at query execution time on the workers
➜ The most efficient method of providing metadata is to dispatch it with the query
➜ Achieved by walking the plan prior to dispatch and annotating with query metadada
Local Catalog Cache
➜ Each worker has native understand of all bootstrap types
➜ Data dispatched with the query is added to a local cache for the duration of a query.
➜ Each worker is effectively stateless and receives the needed metadata at execution time.
HAWQ Distributed Query Engine
Motion 2 phase aggregation Dispatch
explain select * from a join b on (a.i=b.j); QUERY PLAN ---------------------------------------------------------------------------------------------------- Gather Motion 2:1 -> Hash Join Hash Cond: a.i = b.j -> Seq Scan on a -> Hash -> Redistribute Motion 2:2 Hash Key: b.j -> Seq Scan on b
● GATHER Motion: Data from all nodes is brought to 1 location
● REDISTRIBUTE Motion: Data is hash partitioned between virtual segments
● BROADCAST Motion: Data is broadcast to all virtual segments
HAWQ Distributed Query Engine
Motion 2 phase aggregation Pipelines
Join A Join B Join C “copartitioned” join “redistributed” join “broadcast” join
GATHER GATHER GATHER | | | Join Join Join / \ / \ / \ A B A REDISTRIBUTE A BROADCAST | | B B
HAWQ Distributed Query Engine
Motion 2 phase aggregation Pipelines
explain select count(*) from b group by j; QUERY PLAN ----------------------------------------------------------------------------------------------- Gather Motion 2:1 -> HashAggregate Group By: b.j -> Redistribute Motion 2:2 Hash Key: b.j -> HashAggregate Group By: b.j -> Seq Scan on b ● Similar in concept to COMBINE/REDUCE in Hadoop
● Local aggregation occurs on the data processed by each virtual segment
● 2nd phase aggregation occurs after GATHER/REDISTRIBUTE to accumulate partial aggregations from individual virtual segments
HAWQ Distributed Query Engine
Motion 2 phase aggregation Pipelines
● Each Executor node operates on a “pull” based model
● Several nodes may be active at any time
● Most nodes are non-blocking
● Optimized such that inactive executor nodes do not occupy resources.
Resource Manager Design
Yarn
➜ Provisions containers to Yarn Applications
➜ Provides multi-tenant Resource Management across applications
➜ Support for different scheduling policies
○ fair scheduler○ capacity
scheduler
HAWQ RM
➜ Requests resources from Yarn when needed
➜ Returns resources to Yarn when unused
➜ Provides Low latency allocation of HAWQ containers to queries
➜ Determines how many resources to allocate to a query
HAWQ Dispatch
➜ Allocates HAWQ virtual segments to a query
➜ Assigns HDFS blocks to HAWQ virtual segments
➜ Allocates resources within Yarn containers to individual chunks of a distributed query plan
Resource Manager Design
Yarn HAWQ RM HAWQ Dispatch
pxf
HDFS
Ambari
pxf
pxf
pxf
pxf
hbase
pxf
Yarn
Resource Manager Design
Yarn HAWQ RM HAWQ Dispatch
Q1 Q1 Q1
Q1 Q1Q1
Q2
Resource Manager Design
Yarn HAWQ RM HAWQ Dispatch
Q1 Q1 Q1
Q1 Q1Q1
HDFS
HAWQ Extensibility
➜ User Defined Functions
➜ User Defined Aggregates
➜ User Defined Operators
➜ User Defined Types
➜ Supports multiple languages
HAWQ Machine Learning
Apache MADlib (incubating)
➜ Leverages robust extensibility
➜ Provides in database machine learning capabilities
➜ Supports ○ Clustering ○ Regression○ Classification○ Topic Modeling○ … and much more
http://madlib.incubator.apache.org/
[email protected] [email protected] [email protected]
Websitehttp://hawq.incubator.apache.org/
Wikihttps://cwiki.apache.org/confluence/display/HAWQ
Github mirrorhttps://github.com/apache/incubator-hawq/
Bug reporting
https://issues.apache.org/jira/browse/HAWQHADOOP NATIVE SQL
Questions?
Top Related