Impala Presentation - Orlando

ImpalaA Modern, Open-Source SQL Engine for Hadoop

Ricky SaltzerCloudera

About Me

• Tools Developer at Cloudera

Raleigh, NC

• Hadoop / Impala Enthusiast

• Co-Author of Impala in Action

Coupon: Save 45%

impalafla(Expires 01/25/2014)www.manning.com/saltzer

About Cloudera

• Formed in 2008• First company to release commercial Hadoop distribution• Employs more than 70 Hadoop commitors, PMC members,

and contributors.• 15 projects were founded by Cloudera employees• 5 books published on Hadoop / related technologies• Tens of thousands of nodes under management• Creators of Cloudera Manager

• Install, Configure, and Manage a Hadoop cluster in minutes

Agenda

• Quick overview of Hadoop• What is Impala?• Goals and user view of Impala• Impala architecture• Scaling• Performance comparisons• Future roadmap

What is Apache Hadoop?

What’s Wrong With MapReduce?

• Batch oriented• High latency• Not all paradigms fit• Only for developers

Wait, What about Hive?

• SQL to MapReduce engine, originally created by Facebook

• Still a great tool for batch oriented jobs (e.g. conversions)

• Was never designed to be real-time, or to handle large concurrent volume.

What is Impala?

General Purpose SQL Engine

• Analytical workloads• Linearly scalable• Handle queries that run from milliseconds to hours• Thousands of concurrent queries

Runs Directly in Hadoop

• Compatible with multiple storage managers• Reads widely used Hadoop file formats• Runs on the same nodes as Hadoop

High Performance

• True MPP Query Engine• C++ instead of Java• Runtime code generation (LLVM IR)• Completely new execution engine (not MapReduce)• Novel IO Manager

Completely Open Source

Apache License 2.0

http://github.com/cloudera/impala



User View of Impala

• Runs as a distributed service; each node with data runs an Impala Daemon.

• Truly distributed, any Impala daemon can accept a query• Highly available, no single point of failure• Submit queries via JDBC, ODBC, Shell, or Hue• Thrift Application Drivers (Ruby, Python, etc)• Queries are distributed to nodes with relevant data• Uses same metastore as Hive

There is NO Impala Format!

- Bring Your Own Data

• Parquet• High performance columnar format, based on Dremel

• RCFile• Original Hadoop columnar file (slower)

• Avro• Data serialization system supporting rich, complex types

• Sequence• Default MapReduce output, optimized for KeyValue

• Text• Raw text data (e.g. CSV)

Supported File Formats

• Snappy• Fast, great compression

• GZIP• Slow, best compression

• BZIP2• Moderate, great compression

• LZO• Text only

Supported Compression

SQL

• SQL-92 minus correlated subqueries• Similar to HiveQL• ORDER requires LIMIT (coming soon…)• No complex data types (coming soon…)• Other Goodies

• INSERT INTO SELECT• CREATE TABLE AS SELECT• LOAD INTO

• UDF, UDAF (Native and Java)• JOINs must fit in the aggregate memory of all executing nodes• Continuing to add additional SQL functionality

• Impala DaemonReceives, coordinates, and executes user queries. Runs as a distributed service

• StatestoreManages the cluster state

• Catalog ServerProvides automatic metadata propagation

Impala Architecture

Impala Daemon

• Runs as a distributed process on each HDFS DataNode• Each Impala Daemon is capable of facilitating a user query,

there is no “master server”.• Subcomponents:

• Planner - Translates user queries to a logical execution plan• Coordinator - Receives plans; coordinates execution on remote

Impala daemons.• Executor - Scans local data; Finishes work given by coordinator

as fast as possible.

Statestore

• Central system state repository• Name service (membership)• Metadata• Future: scheduling relevant, and diagnostic states

• Soft-state• All data can be reconstructed from the rest of the system• Cluster continues to function without statestore

• Heartbeats• Pushes new data• Checks for liveness

Metadata / Catalog Server

• Impala Metadata• Hive Metastore

• Logical table representations• HDFS

• block replica locations• block replica volumes ids

• Catalog Server• Caches metadata• Automatically propagates changes• Updates are atomic• Manual refresh for outside changes (e.g. DDL through Hive)

What happens when I submit a query?

High Level Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHDFS NN

Statestore&

Catalog

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

Client Sends Request via ODBC, JDBC or Thrift

HiveMetastore


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

HDFS NNStatestore

&Catalog

Planner turns request into collections of plan fragmentsCoordinator initiates execution on remotes nodes

HiveMetastore


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHive

Metastore HDFS NNStatestore

&Catalog

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

query results

Intermediate results are streamed between nodes

Operation permitted, query results are streamed back to client

Query Planning

Query Planning: Overview

• 2-phase planning process:• single-node plan: left-deep tree of plan

operators• plan partitioning: partition single-node plan to

maximize scan locality (i.e. less data movement)

• Parallelization of operators:• All query operators are fully distributed• Operations are heavily pipelined

Left Deep Tree

X / \ / \ X r6 / \ / \ X r5 / \ / \r0 r1

Query Planning: Single-Node Plan

• Plan Operators• Scan• HashJoin• HashAggregation• Union• Top-N• Exchange

Single-Node: Example Query

SELECT t1.custid, SUM(t2.revenue) AS revenue FROM largehdfstable t1 JOIN largehdfstable t2 ON ( t1.id1 = t2.id ) JOIN smallhbasetable t3 ON ( t1.id2 = t3.id ) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10;

Query Planning: Distributed Plans

• Goals:• Maximize scan locality, minimize data movement• Full distribution of all query operators (where semantically

correct). • Parallel Joins:

• Broadcast• Join is colocated, right-hand side table is broadcasting to each

executing node. Preferred for smaller tables• Partition

• Both tables are hash partitioned on join columns• Cost based decisions made when statistics are available


• Parallel Aggregation:• pre-aggregation where data is first materialized• merge aggregation partitioned by grouping columns

• Parallel Top-N:• Initial Top-N operation where data is first materialized• Final Top-N in single-node plan fragment


1. All scans are local2. First join is Partitioned3. Second join is Broadcast4. Pre-Aggregate Join Result5. Merge aggregation after

repartitioning6. Initial Top-N in aggregation7. Final Top-N in coordinator 1 12

3

4

5

6 7

3

Under the Hood

Impala’s Execution Engine

• Written in C++ for minimal execution overhead• Internal in-memory tuple formats puts fixed-width data

at fixed offsets. • Uses intrinsics/special CPU instructions for text parsing,

crc32 computation, etc.• Runtime code generation using LLVM for big loops

Runtime Code Generation

• Example of a “Big Loop”• Insert a batch of rows into a hash table

• Known at compile time:• Number of tuples in a batch• tuple layout• column types

• Generated at compile time:• Unrolled loops• Inlines all function calls• Contains no dead code• Minimizes branching

Code Path Optimizations

void MaterializeTuple(char* tuple) {for (int i = 0; i < num_slots_; ++i) {

char* slot = tuple + offsets_[i];switch (types_[i]) {

case BOOLEAN:*slot = ParseBoolean();break;

case INT:*slot = ParseInt();

case FLOAT: …case STRING: …// etc.

}}

}

void MaterializeTuple(char* tuple) {// i = 0*(tuple + 0) = ParseInt();// i = 1*(tuple + 4) = ParseBoolean();// i = 2*(tuple + 5) = ParseInt();

}

Interpreted Code Generated

Hot code path, called per row

Was it worth it?

Code Path Optimizations

Codegen on? Time Instructions Branches Branch Misses Miss %

Yes .63s 52,605,701,380 9,050,446,359 145,461,106 1.607

No 1.7s 102,345,521,322 17,131,519,396 370,150,103 2.161

Impala’s IO Manager

• Maximizes disk throughput• Designed to take advantage of the aggregate throughput of

your disks• Interleaves computation and IO as much as possible• Services all queries on a single node

• Thread per disk• Each thread keeps the disk busy with read-aheads

• Supports blocking and non-blocking IO

Impala’s IO Manager

Query Query Query

IO Manager

Disk Disk Disk

Impala Daemon

Disk Disk

Comparing Impala to Dremel

• What is Dremel?• Columnar storage for data with nested data structures• Distributed scalable aggregation on top of that• Scales linearly• Whitepaper (http://bit.ly/19txiiX) published by Google in 2010

• Columnar Storage: Parquet• Distributed Aggregation: Impala• Contains much more than Dremel had at its time, including

JOINs.

http://bit.ly/19txiiX

More About Parquet

• What is it?• container format for all popular serialization formats: Avro, Thrift,

Protocol Buffers• Jointly developed Cloudera & Twitter, many more contributors

now!• Open source, on Github

• Features• Stores data in native types (i.e. bool, ints, doubles)• Supports fully shredded nested data• Support for fast index pages (fast lookups)• Extensible value encodings (i.e. run-length, dictionary, delta)

From Twitter’s “Dremel Made Simple” blog

Row Major

Why Columnar Storage?

Column Major

Table

The most efficient IO, is one that never happens at all

Nested Data? From Twitter’s “Dremel Made Simple” blog

Does it Scale?

Scaling Impala

• Impala loves memory

• Impala will use all or your disks when possible

• Linearly Scalable• Double your cluster size? Expect queries to run twice as fast.• HDFS takes care of block placement, no sharding!

Double the Hardware, Same Users

Double the Hardware, Double the Users

Double the Hardware, Double the Data

Wanna Race?

Performance Benchmark #1

Impala vs Hive 0.12

Hive 0.12 = Stinger Phase 2 on YARN/MR2

• Size: 3TB (TPC-DS scale 3,000)• Machines: 5 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (Hive: ORC, Impala: Parquet)

Winner: Impala

Hive is not the bar.One trillion rows per second is the bar.

- Josh WillsDirector of Data Science, Cloudera

Performance Benchmark #2

Impala vs

DBMS-Y = Popular Analytical Database

DBMS-Y

• Size: 30TB (TPC-DS scale 30,000)• Machines: 20 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (DBMS-Y: Proprietary Columnar, Impala: Parquet)

Winner: Impala

Future Roadmap

• Additional SQL• ORDER BY without LIMIT• Analytical window functions• DECIMAL datatype• Unstructured data types (remember Parquet)

• Resource Management• We didn’t talk about LLAMA:

• Impala on YARN

• Confidently share a cluster between components (e.g. Impala, MR, Spark)

Future Roadmap

• Performance• Multi-threaded operators (yes, they are single threaded now)• In-Memory caching; ability to pin tables/partitions in memory

• Background/Incremental Statistics

• And more...

Feature Request?Please voice your requests!

Tell us which feature is important to you!

[email protected]

Thank You!

[email protected]

@monstrado

SELECT question FROM audience WHERE has_question = true;

Impala Presentation - Orlando

Documents

Transcript of Impala Presentation - Orlando