Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

69
© 2015 IBM Corporation Pushing The Performance Envelope Identifying Performance Bottlenecks in Big SQL/Hadoop space. Roy Cecil [ [email protected]]/ 10/26/2015

Transcript of Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Page 1: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

© 2015 IBM Corporation

Pushing The Performance EnvelopeIdentifying Performance Bottlenecks in Big SQL/Hadoop space.

Roy Cecil [ [email protected]]/ 10/26/2015

Page 2: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal

without notice at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction

and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or

legal obligation to deliver any material, code or functionality. Information about potential future

products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our

products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a

controlled environment. The actual throughput or performance that any user will experience will vary

depending upon many factors, including considerations such as the amount of multiprogramming in the

user’s job stream, the I/O configuration, the storage configuration, and the workload processed.

Therefore, no assurance can be given that an individual user will achieve results similar to those stated

here.

Please Note:

2

Page 3: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Agenda

2

Page 4: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Big SQL Architecture Overview

Page 5: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Architecture Overview – IBM Open Platform

4

Page 6: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Text Analytics

POSIX Distributed File System

Multi-workload, Multi-tenant

scheduling

IBM BigInsights

Enterprise Management

Machine Learning

with Big R

Big R

IBM Open Platform (IOP) with Apache Hadoop- Full Open Source

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Big SQL

BigSheets

for Apache Hadoop

Insight - IBM BigInsights for Apache Hadoop

24

x 7

Su

pp

ort

HadoopSystems

Page 7: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Architecture Overview

*FMP = Fenced mode process

Management Node

Big SQLMaster Node

Management Node

Big SQLScheduler

Big SQLWorker Node

JavaI/O

FMP

NativeI/O

FMP

HDFS Data Node

MRTask Tracker

Other ServiceHDFS

Data HDFSData HDFS

Data

TempData

UDF FMP

Compute Node

Database Service

Hive Metastore

Hive Server

Big SQLWorker Node

JavaI/O

FMP

NativeI/O

FMP

HDFS Data Node

MRTaskTracker

Other ServiceHDFS

Data HDFSData HDFS

Data

TempData

UDF FMP

Compute Node

Big SQLWorker Node

JavaI/O

FMP

NativeI/O

FMP

HDFS Data Node

MRTask Tracker

Other ServiceHDFS

Data HDFSData HDFS

Data

TempData

UDF FMP

Compute Node

DDLFMP

UDF FMP

Big SQL Head Node

Big SQL WorkerNodeBig SQL Worker Node

Big SQL WorkerNode

6

Page 8: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Big SQL does not own the data.Therefore, indexes cannot be built/Data is scatter partitioned – there isNO co-location of data

7

DB2 TempTablespace

Compute Node

Big SQL Worker Node

DB2 TempTablespaceTempTablespace

Big SQL Runtime

Big SQL Optimizer & Query Re-write Engine

••

••

SORTHEAP

HDFS dataHDFS data

HDFS data

JavaI/O

readerFMP

NativeI/O

readerFMP

Bufferpool cache is only for temporarydata (within the current query).SORTHEAP used to sort operations.They spill to Bufferpool and to disk If insufficient.

Big SQL Optimizer and query Re-write Engine selects best access plans.

A Look into DataNode

Page 9: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

• Readers & Writers are responsible for reading/writing data from/to HDFS for the Big SQL engine.

• Native I/O reader (also known as dfsReader & C++ reader) The high-speed interface for common file formats Delimited, Parquet, RC, Avro, and Sequencefile

• Java I/O reader Handles all other formats via standard Hadoop/Hive API’s

• Both perform multi-threaded direct I/O on local data

• The database engine understands storage format capabilities Projection list is pushed into I/O format whenever possible Predicates are pushed as close to the data as

possible (into storage format, if possible) Predicates that cannot be pushed down are

evaluated within the database engine

• The database engine is only aware of which nodesneed to read Scheduler directs the readers to their portion of work

Readers/Writers

8

Big SQLWorker Node

JavaI/O

FMP

NativeI/O

FMP

HDFS Data Node

MRTask Tracker

Other ServiceHDFS

Data HDFSData HDFS

Data

TempData

UDF FMP

Compute Node

Page 10: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Scheduler

• The Scheduler is the main RDBMS↔Hadoop service interface

• Interfaces with Hive Metastore for table metadata Compiler ask it for some "hadoop" metadata, such as partitioning columns

• Acts like the MapReduce job tracker for Big SQL Big SQL provides query predicates for scheduler to perform

partition elimination Determines splits for each “table” involved in the query Schedules splits on available Big SQL nodes

(favoring scheduling locally to the data) Serves work (splits) to I/O engines Coordinates “commits” after INSERTs

9

Management Node

Big SQLMaster Node

Big SQLScheduler

DDLFMP

UDF FMP

Mgmt Node

Database Service

Hive Metastore

Big SQLWorker Node

JavaI/O

FMP

NativeI/O

FMP

HDFS Data Node

MRTask Tracker

UDF FMP

Page 11: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Metrics Driven Performance

Page 12: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Performance Management

11

• Data/Event correlation

• Form Hypothesis

• Performance Tuning

• Big SQL Metrics

• Hadoop Metrics

• Operating System Metrics

• Configuration

• Software

• Hardware

• Baseline

Change Management

Monitoring

CorrelationOptimizing

Page 13: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Categories of Metrics

12

Page 14: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Ambari Console – System/Hadoop Metrics

13

Page 15: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Historical view – Drill Down

14

Page 16: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Exploit the power of Hadoop Metrics

15

HDFS

MapReduce

RPC

Resource Manager

Others

Page 17: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Add Hadoop Metrics

16

Page 18: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Big SQL Metrics - Data Server Manager

Page 19: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

DSM Welcome Screen

18

Page 20: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Adding a connection to your Big SQL database

19

$db2 get dbm cfg | grep SVCENAMETCP/IP Service name (SVCENAME) = db2c_bigsql

$ grep db2c_bigsql /etc/servicesdb2c_bigsql 32051/tcp

Page 21: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Overview Tab

20

Page 22: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

21

Page 23: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Locking Tab

22

Page 24: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Applications

23

Page 25: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Workload

24

Page 26: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Memory

25

Page 27: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

I/O

26

Page 28: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Storage Tab

27

Page 29: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Alerts

28

Page 30: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Case Study

Page 31: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

TPC-DS ( query 16 )

30

Page 32: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – Execution Overview

31

Page 33: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – Statement View

32

Page 34: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – Applications View

33

Page 35: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – Statement View ( Detailed )

34

Page 36: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Where are we writing.

35

Page 37: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 - Plans

36

Page 38: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 - Plans

37

Page 39: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 Plans

38

Page 40: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 plans

39

Page 41: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16- Force Application off.

40

Page 42: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

41

Page 43: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – After ANALYZE

42

Page 44: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Query 16 – After ANALYZE

43

Page 45: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Top 10 Performance Tips

Page 46: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

45

Page 47: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

46

Spread the Big SQL data

path over as many

disks as possible

Share disks between

Big SQL, HDFS (dfs.data.dir)

or GPFS, and MapReduce

intermediate data

(mapred.local.dir)

Big SQL[bigsql_db_path]

MapRed cache[mapred.local.dir]

HDFS/GPFS[dfs.data.dir]

Page 48: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

47

• Big SQL needs to share cluster resources with other Hadoop components

• When installing Big SQL, the user specifies the percentage of cluster resources to dedicate to Big SQL

The default is 25%

Recommended range is 25% -> 75%

Page 49: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

48

Out Of The Box results

PARQUET is the

optimal storage format

for Big SQL

For more details : http://bit.ly/1W7KOAk

Page 50: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

49

Big SQL (and Hive) provide the ability to partition a table based on a data value

This improves query performance by eliminating those partitions that do not

contain the data value of interest

Big SQL stores different data partitions as separate files in hdfs and only scans the

partitions required by a query thereby improving runtime

Partition on a column commonly referenced in range delimiting or equality

predicates.

Range of dates are ideal for use as partition columns

Page 51: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

50

Big SQL’s engine internally works with data in units of 32K pages and

works most efficiently when the definition of table allows a row to fit

within 32k of memory. To exploit this optimization when possible use

VARCHAR(n) instead of STRING

Use the bigsql.string.size property (via SET HADOOP PROPERTY) to lower

the default size of the VARCHAR to which the STRING is mapped when

creating new tables.

Page 52: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

51

Big SQL uses a powerful Cost Based Optimizer to select an optimum

plan for the queries against it. Having up-to-date statistics is key to

having good query performance.

More on best practices around ANALYSE @ http://ibm.co/1PDXR8r

Page 53: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

52

Informational constraints are like defining referential integrity

constraints but only not enforced.

Informational constraints provide Optimizer with hints about unique

values which would prevent it from doing unnecessary sorting and

aggregation. It also helps Optimizer make better selectivity estimates.

Page 54: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

53

Big SQL comes with a powerful WLM( Workload Management ) tool.

With the WLM tool it is easy to define different workload and assign

resources ( CPU/Memory) to it.

This allows better exploitation of your system without sacrificing the

QoS requirements for your workloads.

It also improves the overall throughput of multi-stream workloads.

Page 55: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

54

Self Tuning Memory Manager ( STMM ) is a thread that observes the

memory usage patterns on your cluster and adjusts the various

buffers to ensure your queries perform optimally.

One should turn on STMM and ensure that the database is activated

on all nodes.

Page 56: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.
Page 57: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Disclaimer Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) Running on identical hardware. The test workload was based on the latest revision of the TPC-DS benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications,c) after extensive query rewrites. All minor modifications are either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

Page 58: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Audited Results

Page 59: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

58

Page 60: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Big SQL runs more SQL out-of-boxBig SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL is the only engine that can

execute all 99 queries with minimal porting

effort

Page 61: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Big SQL vs. Spark SQL @ 1TB TPC-DS

• Single Stream Results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by 1.6X on average

Query vs. Query, Big SQL was on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL was 2.5X faster

Page 62: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

But, … what happens when you scale it?

Scale Single Stream 4 Concurrent Streams

1 TB • Big SQL was faster on 76 / 99 Queries

• Big SQL averaged 5.5X faster

• Removing Top / Bottom 5, Big SQL averaged 2.5X faster

• Spark SQL FAILED on 3 queries

• Big SQL was 4.4X faster*

10 TB • Big SQL was faster on 80/99 Queries

• Spark SQL FAILED on 7 queries

• Big SQL averaged 6.2X faster*

• Removing Top / Bottom 5, Big SQL averaged 4.6X faster

• Big SQL elapsed time for workload was better than linear

• Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams.

*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

More Users

Mo

re Data

Page 63: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

Recommendation: Use Both…. Right Tool for the Right JobNot Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster

Migrating existing workloads to Hadoop

Security

Many Concurrent Users

Best in-class Performance

Machine Learning

Large Scale / Complex Transformations

Very Good Performance

Avoid maintaining 2 versions of SQL queries

(RDBMS vs. Hadoop)

Ideal tool for Data Engineers and Data Scientists

Big SQL Spark SQL

… invoke Big SQL from Spark for best of both…

Page 64: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

63

Page 65: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

© 2015 IBM Corporation

Thank You

[email protected]

Sr. Performance Engineer, IBM Software Labs, Dublin

Page 66: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

We Value Your Feedback!

Don’t forget to submit your Insight session and speaker

feedback! Your feedback is very important to us – we use it

to continually improve the conference.

Access the Insight Conference Connect tool at

insight2015survey.com to quickly submit your surveys from

your smartphone, laptop or conference kiosk.

65

Page 67: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

66

Notices and Disclaimers

Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form

without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for

accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to

update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO

EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,

LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted

according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as

illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other

results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services

available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the

views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or

other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the

identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the

customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will

ensure that the customer is in compliance with any law.

Page 68: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

67

Notices and Disclaimers (con’t)

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly

available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,

compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the

suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to

interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT

LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,

trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document

Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM

SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,

OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,

pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,

Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of

International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be

trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:

www.ibm.com/legal/copytrade.shtml.

Page 69: Pushing the performance envelope - Identifying Performance Bottlenecks in Big SQL Hadoop space.

© 2015 IBM Corporation

Thank You