Impala Presentation - Orlando
-
Upload
vijay-kumar -
Category
Documents
-
view
225 -
download
0
Transcript of Impala Presentation - Orlando
ImpalaA Modern, Open-Source SQL Engine for Hadoop
Ricky SaltzerCloudera
About Me
• Tools Developer at Cloudera
Raleigh, NC
• Hadoop / Impala Enthusiast
• Co-Author of Impala in Action
Coupon: Save 45%
impalafla(Expires 01/25/2014)www.manning.com/saltzer
About Cloudera
• Formed in 2008• First company to release commercial Hadoop distribution• Employs more than 70 Hadoop commitors, PMC members,
and contributors.• 15 projects were founded by Cloudera employees• 5 books published on Hadoop / related technologies• Tens of thousands of nodes under management• Creators of Cloudera Manager
• Install, Configure, and Manage a Hadoop cluster in minutes
Agenda
• Quick overview of Hadoop• What is Impala?• Goals and user view of Impala• Impala architecture• Scaling• Performance comparisons• Future roadmap
What is Apache Hadoop?
What’s Wrong With MapReduce?
• Batch oriented• High latency• Not all paradigms fit• Only for developers
Wait, What about Hive?
• SQL to MapReduce engine, originally created by Facebook
• Still a great tool for batch oriented jobs (e.g. conversions)
• Was never designed to be real-time, or to handle large concurrent volume.
What is Impala?
General Purpose SQL Engine
• Analytical workloads• Linearly scalable• Handle queries that run from milliseconds to hours• Thousands of concurrent queries
Runs Directly in Hadoop
• Compatible with multiple storage managers• Reads widely used Hadoop file formats• Runs on the same nodes as Hadoop
High Performance
• True MPP Query Engine• C++ instead of Java• Runtime code generation (LLVM IR)• Completely new execution engine (not MapReduce)• Novel IO Manager
Completely Open Source
Apache License 2.0
http://github.com/cloudera/impala
User View of Impala
• Runs as a distributed service; each node with data runs an Impala Daemon.
• Truly distributed, any Impala daemon can accept a query• Highly available, no single point of failure• Submit queries via JDBC, ODBC, Shell, or Hue• Thrift Application Drivers (Ruby, Python, etc)• Queries are distributed to nodes with relevant data• Uses same metastore as Hive
There is NO Impala Format!
- Bring Your Own Data
• Parquet• High performance columnar format, based on Dremel
• RCFile• Original Hadoop columnar file (slower)
• Avro• Data serialization system supporting rich, complex types
• Sequence• Default MapReduce output, optimized for KeyValue
• Text• Raw text data (e.g. CSV)
Supported File Formats
• Snappy• Fast, great compression
• GZIP• Slow, best compression
• BZIP2• Moderate, great compression
• LZO• Text only
Supported Compression
SQL
• SQL-92 minus correlated subqueries• Similar to HiveQL• ORDER requires LIMIT (coming soon…)• No complex data types (coming soon…)• Other Goodies
• INSERT INTO SELECT• CREATE TABLE AS SELECT• LOAD INTO
• UDF, UDAF (Native and Java)• JOINs must fit in the aggregate memory of all executing nodes• Continuing to add additional SQL functionality
• Impala DaemonReceives, coordinates, and executes user queries. Runs as a distributed service
• StatestoreManages the cluster state
• Catalog ServerProvides automatic metadata propagation
Impala Architecture
Impala Daemon
• Runs as a distributed process on each HDFS DataNode• Each Impala Daemon is capable of facilitating a user query,
there is no “master server”.• Subcomponents:
• Planner - Translates user queries to a logical execution plan• Coordinator - Receives plans; coordinates execution on remote
Impala daemons.• Executor - Scans local data; Finishes work given by coordinator
as fast as possible.
Statestore
• Central system state repository• Name service (membership)• Metadata• Future: scheduling relevant, and diagnostic states
• Soft-state• All data can be reconstructed from the rest of the system• Cluster continues to function without statestore
• Heartbeats• Pushes new data• Checks for liveness
Metadata / Catalog Server
• Impala Metadata• Hive Metastore
• Logical table representations• HDFS
• block replica locations• block replica volumes ids
• Catalog Server• Caches metadata• Automatically propagates changes• Updates are atomic• Manual refresh for outside changes (e.g. DDL through Hive)
What happens when I submit a query?
High Level Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHDFS NN
Statestore&
Catalog
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
Client Sends Request via ODBC, JDBC or Thrift
HiveMetastore
High Level Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
HDFS NNStatestore
&Catalog
Planner turns request into collections of plan fragmentsCoordinator initiates execution on remotes nodes
HiveMetastore
High Level Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NNStatestore
&Catalog
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
query results
Intermediate results are streamed between nodes
Operation permitted, query results are streamed back to client
Query Planning
Query Planning: Overview
• 2-phase planning process:• single-node plan: left-deep tree of plan
operators• plan partitioning: partition single-node plan to
maximize scan locality (i.e. less data movement)
• Parallelization of operators:• All query operators are fully distributed• Operations are heavily pipelined
Left Deep Tree
X / \ / \ X r6 / \ / \ X r5 / \ / \r0 r1
Query Planning: Single-Node Plan
• Plan Operators• Scan• HashJoin• HashAggregation• Union• Top-N• Exchange
Single-Node: Example Query
SELECT t1.custid, SUM(t2.revenue) AS revenue FROM largehdfstable t1 JOIN largehdfstable t2 ON ( t1.id1 = t2.id ) JOIN smallhbasetable t3 ON ( t1.id2 = t3.id ) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10;
Query Planning: Distributed Plans
• Goals:• Maximize scan locality, minimize data movement• Full distribution of all query operators (where semantically
correct). • Parallel Joins:
• Broadcast• Join is colocated, right-hand side table is broadcasting to each
executing node. Preferred for smaller tables• Partition
• Both tables are hash partitioned on join columns• Cost based decisions made when statistics are available
Query Planning: Distributed Plans
• Parallel Aggregation:• pre-aggregation where data is first materialized• merge aggregation partitioned by grouping columns
• Parallel Top-N:• Initial Top-N operation where data is first materialized• Final Top-N in single-node plan fragment
Query Planning: Distributed Plans
1. All scans are local2. First join is Partitioned3. Second join is Broadcast4. Pre-Aggregate Join Result5. Merge aggregation after
repartitioning6. Initial Top-N in aggregation7. Final Top-N in coordinator 1 12
3
4
5
6 7
3
Under the Hood
Impala’s Execution Engine
• Written in C++ for minimal execution overhead• Internal in-memory tuple formats puts fixed-width data
at fixed offsets. • Uses intrinsics/special CPU instructions for text parsing,
crc32 computation, etc.• Runtime code generation using LLVM for big loops
Runtime Code Generation
• Example of a “Big Loop”• Insert a batch of rows into a hash table
• Known at compile time:• Number of tuples in a batch• tuple layout• column types
• Generated at compile time:• Unrolled loops• Inlines all function calls• Contains no dead code• Minimizes branching
Code Path Optimizations
void MaterializeTuple(char* tuple) {for (int i = 0; i < num_slots_; ++i) {
char* slot = tuple + offsets_[i];switch (types_[i]) {
case BOOLEAN:*slot = ParseBoolean();break;
case INT:*slot = ParseInt();
case FLOAT: …case STRING: …// etc.
}}
}
void MaterializeTuple(char* tuple) {// i = 0*(tuple + 0) = ParseInt();// i = 1*(tuple + 4) = ParseBoolean();// i = 2*(tuple + 5) = ParseInt();
}
Interpreted Code Generated
Hot code path, called per row
Was it worth it?
Yes.
Code Path Optimizations
Codegen on? Time Instructions Branches Branch Misses Miss %
Yes .63s 52,605,701,380 9,050,446,359 145,461,106 1.607
No 1.7s 102,345,521,322 17,131,519,396 370,150,103 2.161
Impala’s IO Manager
• Maximizes disk throughput• Designed to take advantage of the aggregate throughput of
your disks• Interleaves computation and IO as much as possible• Services all queries on a single node
• Thread per disk• Each thread keeps the disk busy with read-aheads
• Supports blocking and non-blocking IO
Impala’s IO Manager
Query Query Query
IO Manager
Disk Disk Disk
Impala Daemon
Disk Disk
Comparing Impala to Dremel
• What is Dremel?• Columnar storage for data with nested data structures• Distributed scalable aggregation on top of that• Scales linearly• Whitepaper (http://bit.ly/19txiiX) published by Google in 2010
• Columnar Storage: Parquet• Distributed Aggregation: Impala• Contains much more than Dremel had at its time, including
JOINs.
More About Parquet
• What is it?• container format for all popular serialization formats: Avro, Thrift,
Protocol Buffers• Jointly developed Cloudera & Twitter, many more contributors
now!• Open source, on Github
• Features• Stores data in native types (i.e. bool, ints, doubles)• Supports fully shredded nested data• Support for fast index pages (fast lookups)• Extensible value encodings (i.e. run-length, dictionary, delta)
From Twitter’s “Dremel Made Simple” blog
Row Major
Why Columnar Storage?
Column Major
Table
The most efficient IO, is one that never happens at all
Nested Data? From Twitter’s “Dremel Made Simple” blog
Does it Scale?
Scaling Impala
• Impala loves memory
• Impala will use all or your disks when possible
• Linearly Scalable• Double your cluster size? Expect queries to run twice as fast.• HDFS takes care of block placement, no sharding!
Double the Hardware, Same Users
Double the Hardware, Double the Users
Double the Hardware, Double the Data
Wanna Race?
Performance Benchmark #1
Impala vs Hive 0.12
Hive 0.12 = Stinger Phase 2 on YARN/MR2
• Size: 3TB (TPC-DS scale 3,000)• Machines: 5 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (Hive: ORC, Impala: Parquet)
Winner: Impala
Hive is not the bar.One trillion rows per second is the bar.
- Josh WillsDirector of Data Science, Cloudera
Performance Benchmark #2
Impala vs
DBMS-Y = Popular Analytical Database
DBMS-Y
• Size: 30TB (TPC-DS scale 30,000)• Machines: 20 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (DBMS-Y: Proprietary Columnar, Impala: Parquet)
Winner: Impala
Future Roadmap
• Additional SQL• ORDER BY without LIMIT• Analytical window functions• DECIMAL datatype• Unstructured data types (remember Parquet)
• Resource Management• We didn’t talk about LLAMA:
• Impala on YARN
• Confidently share a cluster between components (e.g. Impala, MR, Spark)
Future Roadmap
• Performance• Multi-threaded operators (yes, they are single threaded now)• In-Memory caching; ability to pin tables/partitions in memory
• Background/Incremental Statistics
• And more...
Feature Request?Please voice your requests!
Tell us which feature is important to you!