Impala 2.0 Update #impalajp
-
Upload
cloudera-japan -
Category
Technology
-
view
3.174 -
download
1
description
Transcript of Impala 2.0 Update #impalajp
![Page 1: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/1.jpg)
1
Impala 2.0 Update Sho Shimauchi, Cloudera 2014/10/31
![Page 2: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/2.jpg)
2
Today’s Topic
• What is Cloudera Impala? • Impala 1.4 / 2.0 update
• Performance Improvement • Query Language • Resource Management and Security • Others
![Page 3: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/3.jpg)
3
Who am I ?
• Pre-‐sales SoluLons Architect • joined Cloudera in 2011, the first Japanese employee at Cloudera
• email: [email protected] • twiTer: @shiumachi
![Page 4: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/4.jpg)
4
Cloudera Impala
![Page 5: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/5.jpg)
5
What is Impala?
• MPP SQL query engine for Hadoop environment • wriTen in naLve code for maximum hardware efficiency
• open-‐source! • hTp://impala.io/
• Supported by Cloudera, Amazon, and MapR • History
• 2012/10 Public Beta released • 2013/04 Impala 1.0 released • current version: Impala 2.0
![Page 6: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/6.jpg)
6
Impala is easy to use
• create tables as virtual views over data stored in HDFS / HBase • schema metadata is stored in Metastore
• shared with Hive, Pig, etc.
• connect via ODBC / JDBC • authenLcate via Kerberos / LDAP • run standard SQL
• ANSI SQL-‐92 based • limited to SELECT and bulk INSERT • no correlated subqueries available in 2.0 • UDF / UDAF
![Page 7: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/7.jpg)
7
Impala 1.4 (2014/07)
• DECIMAL(<precision>, <scale>) • HDFS caching DDL • column definiLon based on Parquet file (CREATE TABLE … LIKE PARQUET) • ORDER BY without LIMIT • LDAP connecLons through TLS • SHOW PARTITIONS • YARN integrated resource manager will be producLon ready • Llama HA support • CREATE TABLE … STORED AS AVRO • SUMMARY command in impala-‐shell (provides high-‐level summary of query plan)
• faster COMPUTE STATS • Performance improvements for parLLon pruning • impala shell supports UTF-‐8 characters • addiLonal built-‐ins from EDW systems
![Page 8: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/8.jpg)
8
Impala 2.0 (2014/10)
• hash table can spill to disk • join and aggregate tables of arbitrary size
• Subquery enhancements • allowed in WHERE queries • EXISTS / NOT EXISTS • IN / NOT IN can operate on the result set from a subquery • correlated / uncorrelated subqueries • scalar subqueries
• SQL 2003 compliant analyLc window funcLons • LEAD(), LAG(), RANK(), FIRST_VALUE(), etc.
• New Data Type: VARCHAR, CHAR • Security Enhancements
• mulLple authenLcaLon methods • GRANT / REVOKE / CREATE ROLE / DROP ROLE / SHOW ROLES / etc.
• text + gzip / bzip2 / Snappy • Hint inside views • QUERY_TIMEOUT_S • DATE_PART() / EXTRACT() • Parquet default block size is changed to 256MB (was: 1GB) • LEFT ANTI JOIN / RIGHT ANTI JOIN • impala-‐shell can read sesngs from $HOME/.impalarc
![Page 9: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/9.jpg)
9
Performance Improvement
![Page 10: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/10.jpg)
10
HDFS caching
• When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an addiLonal copy of the data in memory
• avoids checksumming and data copies • new HDFS API is available in CDH 5.0 • configure cache with Impala DDL
• CREATE TABLE tbl_name CACHED IN ‘<pool>’ • ALTER TABLE tbl_name ADD PARTITION … CACHED IN ‘<pool>’
![Page 11: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/11.jpg)
11
ParLLon Pruning improvement
• Previously, Impala typically queried tables with up to approximately 3000 parLLons. With the performance improvement in parLLon pruning, now Impala can comfortably handle tables with tens of thousands of parLLons.
![Page 12: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/12.jpg)
12
Spilling to Disk SQL OperaLon
• write temporary data to when Impala is close to exceeding its memory limit
• In PROFILE, BlockMgr.BytesWriTen counter reports how much data was wriTen to disk during the query
![Page 13: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/13.jpg)
13
Query Language
![Page 14: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/14.jpg)
14
Subquery
Scalar subquery: produces a result set with a single row containing a single column SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);!
Uncorrelated subquery: not refer to any tables from the outer block of the query
SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);!
Correlated subquery: compare one or more values from the outer query block to values referenced in the WHERE clause of the subquery
SELECT employee_name, employee_id FROM employees one WHERE! salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id);!
![Page 15: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/15.jpg)
15
AnalyLc FuncLons (a.k.a Window FuncLons)
• supported in 2.0 and later • supported funcLons
• RANK() / DENSE_RANK() • FIRST_VALUE() / LAST_VALUE() • LAG() / LEAD() • ROW_NUMBER()
• Aggregate funcLons are already implemented • MAX(), MIN(), AVG(), SUM(), etc.
![Page 16: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/16.jpg)
16
AnalyLc FuncLons Example
select stock_symbol, closing_date, closing_price,! lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"! from stock_ticker! order by closing_date;!+--------------+---------------------+---------------+-------------------+!| stock_symbol | closing_date | closing_price | yesterday closing |!+--------------+---------------------+---------------+-------------------+!| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |!| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |!| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |!| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |!| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |!| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |!| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |!+--------------+---------------------+---------------+-------------------+!
For each day, the query prints the closing price alongside the previous day's closing price:
![Page 17: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/17.jpg)
17
ApproximaLon features
• APPX_COUNT_DISTINCT query opLon • rewrite COUNT(DISTINCT) calls to use NDV() • speeds up the operaLon • allows mulLple COUNT(DISTINCT) in a single query
• APPX_MEDIAN() • returns a value that is approximately the median (midpoint) of values in the set of input values
![Page 18: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/18.jpg)
18
Approx. funcLons example
[localhost:21000] > select min(x), max(x), avg(x) from million_numbers;!+-------------------+-------------------+-------------------+!| min(x) | max(x) | avg(x) |!+-------------------+-------------------+-------------------+!| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |!+-------------------+-------------------+-------------------+![localhost:21000] > select appx_median(x) from million_numbers;!+----------------+!| appx_median(x) |!+----------------+!| 24721.6 |!+----------------+!
![Page 19: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/19.jpg)
19
CREATE TABLE … LIKE PARQUET
• CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file'
• The column names and data types are automaLcally configured based on the Parquet data file
![Page 20: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/20.jpg)
20
ORDER BY without LIMIT
• LIMIT clause is now opLonal for queries that use the ORDER BY clause
• Impala automaLcally uses a temporary disk work area to perform the sort if the sort operaLon would otherwise exceed the Impala memory limit for a parLcular data node.
![Page 21: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/21.jpg)
21
DECODE()
SELECT event, DECODE(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday”, 4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")! FROM calendar;!
![Page 22: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/22.jpg)
22
ANTI JOIN
LEFT ANTI JOIN / RIGHT ANTI JOIN are supported in Impala 2.0 [localhost:21000] > create table t1 (x int);![localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);!![localhost:21000] > create table t2 (y int);![localhost:21000] > insert into t2 values (2), (4), (6);!![localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);!+---+!| x |!+---+!| 1 |!| 3 |!| 5 |!+---+!!
![Page 23: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/23.jpg)
23
new data types
• DECIMAL (Impala 1.4) • column_name DECIMAL[(precision[,scale])]
• with no precision or scale values is equivalent to DECIMAL(9,0)
• VARCHAR (Impala 2.0) • STRING with a max length
• CHAR (Impala 2.0) • STRING with a precise length
![Page 24: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/24.jpg)
24
new built-‐in funcLons
• EXTRACT() : returns one date or Lme field from a TIMESTAMP value
• TRUNC() : truncates date/Lme values to year, month, etc. • ADD_MONTHS(): alias for MONTHS_ADD() • ROUND(): rounds DECIMAL values • for compuLng properLes for staLsLcal distribuLons
• STDDEV() • STDDEV_SAMP() / STDDEV_POP() • VARIANCE() • VARIANCE_SAMP() / VARIANCE_POP()
• MAX_INT() / MIN_SMALLINT() • IS_INF() / IS_NAN()
![Page 25: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/25.jpg)
25
SHOW PARTITIONS
[localhost:21000] > show partitions census;!+-------+-------+--------+------+---------+!| year | #Rows | #Files | Size | Format |!+-------+-------+--------+------+---------+!| 2000 | -1 | 0 | 0B | TEXT |!| 2004 | -1 | 0 | 0B | TEXT |!| 2008 | -1 | 0 | 0B | TEXT |!| 2010 | -1 | 0 | 0B | TEXT |!| 2011 | 4 | 1 | 22B | TEXT |!| 2012 | 4 | 1 | 22B | TEXT |!| 2013 | 1 | 1 | 231B | PARQUET |!| Total | 9 | 3 | 275B | |!+-------+-------+--------+------+---------+!!
![Page 26: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/26.jpg)
26
SUMMARY
• impala-‐shell command • easy-‐to-‐digest overview of the Lmings for the different phases of execuLon for a query
[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;!+---------------------+!| avg(ss_sales_price) |!+---------------------+!| 37.80770926328327 |!+---------------------+![localhost:21000] > summary;!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |!| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |!| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |!| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
![Page 27: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/27.jpg)
27
SET statement
• Before Impala 2.0, SET can be used only in impala-‐shell
• In Impala 2.0, you can use SET in client app through JDBC / ODBC APIs.
![Page 28: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/28.jpg)
28
Resource Management and Security
![Page 29: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/29.jpg)
29
Admission Control (Impala 1.3)
• Fast and lightweight resource management mechanism
• avoids oversubscripLon of resources for concurrent workloads • queries are queued when reaching configurable limits
• Run on every impalad • no SPOF
![Page 30: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/30.jpg)
30
YARN and Llama
• Llama: Low Latency ApplicaLon MAster • Subdivides coarse-‐grain YARN scheduling into finer-‐granularity for low-‐latency and short-‐lived queries
• Llama registers one long-‐lived AM per YARN pool • Llama caches resources allocated by YARN for a short Lme, so that they can be quickly re-‐allocated to Impala queries • much faster than waiLng for YARN
• Impala 1.4: GA. Llama HA support
![Page 31: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/31.jpg)
31
Query Timeout
• A new query opLon, QUERY_TIMEOUT_S, lets you specify a Lmeout period in seconds for individual queries
• Note: The Lmeout clock for queries and sessions only starts Lcking when the query or session is idle
![Page 32: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/32.jpg)
32
Security
• Impala 2.0 can accept either kind of auth. request • ex) host A with Kerberos, and host B with LDAP
• Security related statement • GRANT • REVOKE • CREATE ROLE • DROP ROLE • SHOW ROLES • SHOW ROLE GRANT
• -‐-‐disk_spill_encrypLon opLon
![Page 33: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/33.jpg)
33
Others
![Page 34: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/34.jpg)
34
Text + gzip, bzip2, and Snappy
• In Impala 2.0 and later, Impala supports using text data files that employ gzip, bzip2, or Snappy compression
• use ROW FORMAT with delimiter and escape character to create table
CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!
![Page 35: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/35.jpg)
35
impala-‐shell
• UTF-‐8 support (1.4) • .impalarc file (2.0) [impala]!verbose=true!default_db=tpc_benchmarking!write_delimited=true!output_delimiter=,!output_file=/home/tester1/benchmark_results.csv!show_profiles=true!
![Page 36: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/36.jpg)
36
DocumentaLon
• Cluster Sizing Guidelines for Impala • hTp://www.cloudera.com/content/cloudera/en/documentaLon/core/latest/topics/impala_cluster_sizing.html
![Page 37: Impala 2.0 Update #impalajp](https://reader033.fdocuments.us/reader033/viewer/2022052505/5565fc9bd8b42a2a4d8b4b9b/html5/thumbnails/37.jpg)
37