Impala vs HANA

19
7/18/2019 Impala vs HANA http://slidepdf.com/reader/full/impala-vs-hana 1/19 HANA vs Impala, on AWS Cloud SAP HANA Version 1.0 SPS05 vs Cloudera Hadoop Impala 1.0 GA Written by: Aron MacDonald [email protected] 19/05/2013 (Last updated 30/05/2013)

description

Impala vs HANA

Transcript of Impala vs HANA

Page 1: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 1/19

HANA vs Impala, on AWS Cloud

SAP HANA Version 1.0 SPS05

vs

Cloudera Hadoop Impala 1.0 GA

Written by: Aron [email protected] 

19/05/2013

(Last updated 30/05/2013)

Page 2: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 2/19

Overview

SAP HANA is a database which runs completely in Memory. Disks are only used to keep database logs, e.g. For recoveryin event of power failure. HANA works only with Structured Data

Cloudera Impala, runs on HADOOP, so is scalable, but still relies on Disks for Data storage and Data retrieval. It isn’t afully fledge SQL database but currently supports INSERTs and SELECTS. Impala supports queries on structured andunstructured data

The purpose of this document is to compare and contrast the core functionality of 2 new products, which have bothbeen designed to provide real-time reporting of ‘Big DATA’ 

The key criteria compared in this document include: – Costs on AWS

 – Data Capacity

 – Data Load times

 – Query run times

The comparisons were performed on the cheapest configurations options available on Amazon Web Service (AWS)The additional tools each environment offer are not compared in this document. HANA has a significant number of 

query modelling tools and integration tools with other SAP products (e.g real time replication of OLTP data) thatare not available at this time for Impala.

Note: The Appendix at end of this document has some links for additional reading and some diagrams of thearchitecture of both HANA and Impala

Page 3: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 3/19

AWS Pricing Options (US East, N Virginia)

• A small HANA system Costs $300 p/month. HANA on AWS utilises EBS volumes can be

stopped and started to reduced Costs when not in use.

• An Extra Small Hadoop cluster running Impala costs $175 p/month. Cloudera Hadoop on

AWS currently requires Instance Stores so can NOT be stopped and started to reduce costs.

• Rows highlighted above are the configuration options I have used for performance testing

*AWS Has a lot of other API types (e.g. m1.xlarge) that can be used by Cloudera Manager. I’ve only focussed on the cheaper

ones (<$1/hour), but prices range from 0.04 p/hour to over $3/hour depending on your requirement. AWS also offers discounts

of up to 70%, if you sign up for up for 1-3 years, rather than pay as you go monthly.

** As of Cloudera Manager 4.5 there is no longer a node limit. 1776 nodes is used only as an illustration of a 1 Petabyte Cluster#

System

Size (based on

Hardware specs of a

single instance) Usage

AWS API

Name AWS Instance Spec

Instance

s

Cost p/hour

USD

(1 Instance)

Cost p/mth

USD

(Total Req.

Instances)

Cost p/year

USD

(Total Req.

Instances)

SAP HANA on AWS

Small DEV m2.xlarge 17.1Gb Mem, 2 Virtual Core, 420 Gb 1 0.41 299.30 3,591.60

Medium DEV m2.2xlarge 34.2Gb Mem, 4 Virtual Core, 850 Gb 1 0.82 598.60 7,183.20

Large DEV m2.4xlarge 68.4Gb Mem, 8 Virtual Core, 1.690 Tb 1 1.64 1,197.20 14,366.40

Small PROD (add 0.99 /hr Licence Cost SAP) m2.xlarge 17.1Gb Mem, 2 Virtual Core, 420 Gb 1 1.40 1,022.00 12,264.00

Medium PROD (add 0.99 /hr Licence Cost SAP) m2.2xlarge 34.2Gb Mem, 4 Virtual Core, 850 Gb 1 1.81 1,321.30 15,855.60

Large PROD (add 0.99 /hr Licence Cost SAP) m2.4xlarge 68.4Gb Mem, 8 Virtual Core, 1.690 Tb 1 2.63 1,919.90 23,038.80

HADOOP Cluster

(using Cloudera

Manager) *

Small (1 Node) Minimum (Master + 1 Data Node) m1.medium 3.75Gb Mem, 1 Virtual Core, 410Gb 2 0.12 175.20 2,102.40

Small (3 Node) Minimum recommended (Master + 3 Data Node) m1.medium 3.75Gb Mem, 1 Virtual Core, 410Gb 4 0.12 350.40 4,204.80

Medium (3 node) Minimum recommended (Master + 3 Data Node) m1.large 7.5Gb Mem, 2 Virtual Core, 850Gb 4 0.24 700.80 8,409.60

Large (3 Node) Minimum recommended (Master + 3 Data Node) m1.xlarge 15Gb Mem, 4 Virtual Core, 1.69 Tb 4 0.48 1,401.60 16,819.20

Large (1776 Nodes))

1 Petabyte Cluster with 26 Tb Memory & 7100 Virtual Cores

(1 Master + 1776 Node Cluster, with Replication Factor 3)

Note: Cost Excludes Cloudera Manager charges for Clusters

with more than 50 Nodes m1.xlarge

15Gb Mem, 4 Virtual Core, 1.69 Tb

(Assuming 3 way replication of data, this

would give 1777** 0.48 622,660.80 7,471,929.60

Page 4: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 4/19

Example Dataset (SAP SPL Line item table)

• Used an SAP SPL line table as basis forComparison.

• Example Data was sourced from a ProductiveSAP ECC system

• 1 Period of Historical Data from a moderatelylarge company code was used. (~510K rowsof data /115Mb uncompressed Data)

• Master data & Data values were replacedwith more generic values for data securityreasons

• To simulate Larger volumes of Data the datawas duplicated to 12 periods and then to 10Company Codes.

• 510K records, was duplicated to ~60 Million

• For optimal query run-times a COLUMNstore table type were used in HANA andImpala (using PARQUET file format)

Table ZSPLA - Example of a Special Purpose Ledger (SPL) Line item table in SAP

FIELD Field Description

HANA

DATA Type

IMPALA

Data Type

RYEAR Year INT INT

RBUKRS Company Code CHAR(4) STRING

DOCNR Document Number CHAR(10) STRING

DOCLN Document Line Nr CHAR(6) STRING

POPER Posting Period INT INT

RTCUR Transaction Currency CHAR(5) STRING

DRCRK DR DR Indicator CHAR(1) STRING

RACCT Account CHAR(10) STRING

RCNTR Cost Center CHAR(10) STRING

RPRCTR Profit Center CHAR(10) STRING

RZZKOKRS Controlling Area CHAR(4) STRING

RMVCT Transaction Type CHAR(3) STRING

RJVNAM Joint Venture CHAR(6) STRING

REGROU Equity Group CHAR(3) STRING

RORDNR Internal Order CHAR(12) STRING

ZZPOSID WBS CHAR(24) STRING

ZZRRECIN Recover Indicator CHAR(2) STRING

TSL Amount (Transaction Currency) DECIMAL(17,2) DOUBLE*

HSL Amount (Company Code Currency) DECIMAL(17,2) DOUBLE*

KSL Amount (Group Currency USD) DECIMAL(17,2) DOUBLE*

SGTXT Line item Text CHAR(50) STRING

DOCTY Document Type CHAR(2) STRING

BUDAT Posting Date DATE TIMESTAMP

WSDAT Document Date DATE TIMESTAMP

CPUDT CPU Date DATE TIMESTAMP

CPUTM CPU Time CHAR(6) STRING

USNAM User Name CHAR(12) STRING

* Impala 1.0 does not currently support Decimal Type, so additional rounding is needed

for Financial reporting

Page 5: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 5/19

Data Load Times & CompressionHANA Small vs Impala Small (1 Node Cluster)

• Initially 510K records were loaded into a HANAColumn store table & Impala Parquet table

• Scripts were run on HANA & Impala to: – Insert 6 Million records per Company Code

 – Execute ‘Select Count(*), SUM(*)’ on table after the Insert  

Summary Results:

- Compression of Example Dataset- HANA: 10 Times compression (125 Mb / 6 Million Records)- Impala: 7 Times Compression (186 Mb / 6 Million Records)

- Load Times- HANA: Load Time degrades as table size increases (60

Million records took ~1 Hour)

- Impala: Load times are is more constant (60 Million recordstook ~20 Minutes)

- Select Times- HANA: Simple Select statement on table took constantly less

than 2 seconds, irrespective of Table size- Impala: Simple Select statement on table degraded at anoticeable rate as table size increased. E.g. Increased from 6Seconds to over 30 seconds as table grew. [Note: This may bein part be due to a memory leak bug being fixed on the nextpatch release]

00:00:00

00:01:26

00:02:5300:04:19

00:05:46

00:07:12

00:08:38

00:10:05

00:11:31

00:12:58

6 12 18 24 31 37 43 49 55 61

   T   i   m   e    (    h    h   :   m   m   :   s   s    )

Total Number of Records in the Table (Millions)

Impala - Insert 6 Million

Impala - Select Count & SUM

HANA - Insert 6 Million

HANA - Select Count & SUM

Note: Similar Impala queries running other table types (e.g.

Using HBASE tables) are significantly slower than Parquet

tables. E.g. ‘Select Count(*), SUM(*)’ on a HBASE Impala Table

takes more than 30 seconds with only 510K records. This

though is still a significant improvement of similar HIVE queries

on the same HBASE table which may take several minutes.

Page 6: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 6/19

Query Run Times(on table with 60 Million Records)

• Overall Run times on HANA and Impala were both Good, however HANA constantly returned results 1 Second.

• Performance on Impala improves as more Nodes are added to the Cluster

• Impala with 18 Nodes (AWS m1.medium) performed similar to HANA (AWS m2.xlarge)

• The Impala Run times above were also improved further by between 30% & 65%, when re-executed. Runningqueries on the tables primes data onto the cache on each of the cluster nodes. Cloudera are apparently workingon further enhancements to memory caching, which will hopefully be released later this this year.

Records

Returned

Time (Seconds)

Abbreviated Select Statement (Group by statement were used on most of the queries)

HANA

Small

Impala

Small

(1 Node)

Impala

Small

(3 Nodes)

Impala

Small

(9 Nodes)

Impala

Small

(18 Nodes)

select count(*) 1 1 4 1 1 1

select count(*), sum(ksl) 1 1 14 3 1 1

select rbukrs, count(*), sum(ksl) 10 3 28 5 4 3

select rbukrs, poper, count(*), sum(ksl) 120 5 34 7 4 2

select rbukrs, poper, count(*), sum(ksl)

where ( rbukrs = "A009" OR rbukrs = "A010") and poper = 9 2 1 34 6 5 3

select rbukrs, poper, count(*), sum(ksl)

where rbukrs = "A009" and poper = 9 1 1 26 5 8 2

select rbukrs, racct, count(*), sum(ksl)where rbukrs = "A009" and poper = 9 623 1 34 8 5 2

select rbukrs, racct, count(*), sum(ksl)

where rbukrs = "A009" and poper = 9 and racct ="606500001" 1 1 35 8 5 3select

ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl

where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 4 79 16 8 4

select *

where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 8 163 42 18 9

Est. Monthly Cost of Production Environment on AWS(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $876 $1664

Page 7: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 7/19

Data Limits (based on the Example Dataset)

• The above row limits are the crude, theoretical limits, based on the

example Dataset used, assuming the system does not store other tables.

• The Impala limit is based on 90% of the total HDFS disk space of the

Hadoop Cluster• The HANA limit is based on a maximum 75% memory utilisation (less

space used by general system processes)

* Based on only a 3 Node Cluster with Replication Factor of 3 (i.e. Data

replicated on 3 Data nodes for Hardware failure protection)

System AWS Size Theorectical Row Limit

SAP HANA on AWS

SMALL 0.2 Billion rows

MEDIUM 0.6 Billion rows

LARGE 1.5 Billion rows

HADOOP on AWS*

SMALL

(3 Nodes) 12 Billion rows

MEDIUM

(3 Nodes) 25 Billion rows

LARGE

(3 Nodes) 50 Billion rows

LARGE (1776 Node Cluster) 30 Trillion rows

Page 8: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 8/19

Impala Query Run Times(1 Billion Records)

• This example does not include HANA stats as the AWS Small Hana Box will not support 1

Billion records for the example dataset , only 0.2 Billion

RecordsReturned

Time (Seconds)

Abbreviated Select Statement

Impala

Small(3 Nodes)

Impala

Small(9 Nodes)

Impala

Small(18 Nodes)

select count(*) 1 18 5 3

select count(*), sum(ksl) 1 80 31 17

select rbukrs, count(*), sum(ksl) 10 168 55 25

select rbukrs, poper, count(*), sum(ksl) 120 229 65 27

select rbukrs, poper, count(*), sum(ksl)

where ( rbukrs = "A009" OR rbukrs = "A010") and poper = 9 2 237 66 21

select rbukrs, poper, count(*), sum(ksl)

where rbukrs = "A009" and poper = 9 1 205 59 17

select rbukrs, racct, count(*), sum(ksl)

where rbukrs = "A009" and poper = 9 623 285 76 26

select rbukrs, racct, count(*), sum(ksl)

where rbukrs = "A009" and poper = 9 and racct ="606500001" 1 294 85 25select

ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl

where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 449 152 70

select *

where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 Timeout 336 174

Est. Monthly Cost of Production Environment on AWS

(Impala m1.medium) $350 $876 $1664

Page 9: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 9/19

1 Billion Rows on 18 Node cluster

• As data is loaded, a gradual decline in both Insert & Select times were noted

• After 800 Million records the cluster started to perform slowly, but after ‘Impala’ was restarted runtimesreturned to normal. (A memory leak bug is the cause, it will be fixed on the next patch release)

• Based on runtime stats of a 3, 9 & 18 Node Cluster runtimes appear quite linear to both the number of nodesand number of records.

• To achieve query run times of < 10 Seconds* then it is estimated that a 300 node cluster would be required ( ~USD $26K / mth on AWS) 

*Impala also has the option to partition your tables (e.g. by Year / Period). This would significantly reduce run times in yo ur queries, assuming the appropriateWHERE clause restrictions are used. Partitions would increase runtimes for those queries which need to span all partitions. In the above scenario a partition byYEAR & Period should reduced the longest running query, illustrated above, to less than 10 seconds, without the need to add additional nodes to the cluster.

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200

   R   u   n   t   i   m   e    (   S   e   c   o   n    d   s    )

Total Number of Records in the Table (Millions)

Runtime: select ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl where rbukrs = "A009" and poper = 9 and racct = "606500001"

Runtime: Insert 500K records

Runtime: Select count(*), Sum(ksl)

Page 10: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 10/19

TPC-H Query Run Times (lineitem table 60 Million Rows)

• Ran comparison queries using lineitem table from TPC-H. TPC Benchmark™H (TPC-H) is a decision support benchmark. SeeAppendix for more details e.g. if you wish to use the same dataset for your own benchmark

• Overall Run times on HANA and Impala with 3 Node Cluster using Parquet format tables were good

• Unfortunately HANA was unable to complete the MERGE DELTA operation after the initial load(to optimize memory utilisation) with 1Partition, so the table was split into 5 ‘ROUND ROBIN’ Partitions. 

• Also due to the memory limitations on the small AWS HANA box, it was necessary to perform an additional LOAD ALL statement prior toexecuting the queries to ensure table data was primed for reporting. This pre-step took approx. 10 seconds to complete.

Records

Returned

Time (Seconds)

Select Statement

HANA

Small

Impala

Small

(1 Node)

Parquet

Impala

Small

(3 Nodes)

Parquet

Impala

Small

(1 Node)

Text

Impala

Small

(3 Nodes)

Textselect count(*) from lineitem 1 1 3 1 74 31

select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29

select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by

l_shipmode 7 8 23 5 74 28

select l_shipmode, count(*), sum(l_extendedprice) from lineitem where

l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28

select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem

group by l_shipmode, l_linestatus 14 10 32 7 74 28

select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitemwhere l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29

select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and

l_suppkey = 1 45 1 23 5 73 30

select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode

= 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31

select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and

l_suppkey = 1 45 1 104 21 73 30

Size

(5 Part.)

1.9Gb (40 files x 80mb)3.2Gb (1 file – No Compression)7.2Gb

Est. Monthly Cost of Production Environment on AWS

(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350

Page 11: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 11/19

Quick Technical SummaryHANA(Ver 1.0 SPS05) HADOOP & IMPALA (1.0 GA)

AWS Small:

Capacity 17Gb 410Gb

Rows of Data Supported* 0.2 Billion 12 Billion

Cost per year (DEV) $ 3,591.60 $ 4,204.80

Cost per year (PROD) $ 12,264.00 $ 4,204.80

AWS Limit: (Limited only by Budget)

Capacity 64Gb Petabytes (limited by your budget)

Rows of Data Supported* 1.5 Billion Theorectically limited only by Budget

Cost per year (DEV) $ 14,366.40 Many Millions+++

Cost per year (PROD) $ 23,038.80 Many Millions+++

Non AWS Limit: (Limited only by Budget)

Capacity 15Tb (Current Limit though expected to increase) Petabytes (limited by your budget)

Rows of Data Supported* 375 Billion Theorectically limited only by Budget

Cost Many Millions+++ Many Millions+++

*Base on my sample Data structure only (similar to main fields stored in SAP SPL Line item table)

HANA(Ver 1.0 SPS05) HADOOP & IMPALA (1.0 GA)

OS Several (incl Linux) Linux

Data Storage Memory Only (Backup Logs written to Disk) Disk & Memory

Structure Data YES YES

Unstructured Data NO YES

Other files (e.g. Images, videos, etc) NO YES, but not accessible via IMPALA

Software Licenced Open Source

Integration to SAP OLTP Data Realtime (SLT) ** Scheduled (BO Data Service or Custom)

Integration to SAP OLAP Data(BW) Realtime (if BW on HANA) ** Scheduled (BO Data Service or Custom)

Reporting Tools Business Objects + other Third Pary Business Objects + other Third Pary

SQL Statements YES Limited (Primarily support SELECT queries)

SQL Modelling Tool YES NO

Predictive Analyse Tools YES NO

Application Development Tools

Many Open Source tools (e.g. Eclipse with JAVA &

PYTHON addons)

(or ABAP if running HANA Netweaver version) Open Source Tools

System Monitoring Tool HANA Studio Cloudera Manager (Free for < 50 Node Cluster)

** Not supported on AWS HANA version, Custom Scheduled JDBC integration required

Page 12: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 12/19

Big Data Real-time Analytics•

Storing Data in column storage provides the best response times for real time analytic reporting and reduces storage space,whether in memory or on Disk (E.g. PARQUET/SNAPPY file format for IMPALA, or COLUMN store tables in HANA)

• In Memory Computing is definitely the future for the fastest possible reporting of large data Sets

• However, storing VERY large datasets perpetually in memory, in the off chance it is needed, is a more costly route

Analogy :

• At the moment I would liken HANA to a Ferrari: Sexy, goes very fast, but has limited luggage space.

• Impala by comparison is a fleet of MPV's: Good performance and good capacity, at an affordable price.

• Hadoop (without Impala) is a fleet of Long Haul trucks: Moderate performance, Excellent Capacity and drives overnight.

The good thing about Cloudera’s Hadoop offering is that when you buy the Trucks they throw in the MPV's for free.

If ALL your corporate data (ERP & NON ERP) was printed out on paper, would you haul it around in a Ferrari , MPV's or Truck's?

The answer is probably that it would depend on the amount of paper and how quickly you need to deliver it.

Weaknesses:

HANA can’t store unstructured data and is costly to store COLD/FROEN/HISTORICAL Data 

IMPALA does not have sophisticated query modelling tools and does not yet support real time replication logic from your OLTP

Solution:

Don't think of Impala and HANA as competitors but rather different vehicles to solve different delivery requirements.

Page 13: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 13/19

Use Cases

*Above are some Non-specific use cases which

make use of the different tools

Use Case* Potential Tool

Real-time Reporting of SAP OLTP data, including joins and

data transformations SAP HANA

Summarise Unstructured DATA LOGS (scheduled) HADOOP MAP/REDUCE

Realtime reporting of Summarised Data Logs, including

Joins to other NON OLTP Data IMPALA

Near Realtime reporting of Social Media Data IMPALA + HADOOP MAP/REDUCE (scheduled to collect recent Social Media Data)

Realtime reporting of recent OLTP data joined with recent

Social Media Data

HANA + HADOOP MAP/REDUCE (scheduled to collect recent Social Media Data and

load into HANA)

Image Analysis Process(scheduled)

HADOOP MAP/REDUCE (scheduled job to run sophisticated program which analyses

Image/Video files and stores the results in a structured file)

Image Analysis Reporting IMPALA (to report on results file)

Predictive Analysis Reporting (comparing OLTP & NON

OLTP DATA)

HANA + HADOOP MAP/REDUCE (scheduled to collect & transfer applicable Historic

or relevant Non OLTP Data to HANA)

Page 14: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 14/19

Wish List

SAP HANA/BW/ECC

• Add Near Line Storage (NLS) and Archiving capability to HADOOP IMPALA tables, so OLD data canstill be accessed

• Integration with Impala to allow a single query to aggregate CURRENT data in SAP and Historicaldata in IMPALA, similar to what's being attempted by SAP with a new federation layer connectingto Sybase

IMPALA• Memory Management options, for caching Parquet tables (or subsets off e.g. Current Month) in

Memory of Cluster Nodes

• SQL modelling Tool

• Decimal Values (for Financial reporting)

• Support for AWS EBS (not Instance Store), and automated remapping of internal IP's (based onpublic DNS or Elastic IP's) upon Stop and Start, to reduced costs when cluster not in use. AddingAWS Start / Stop functionality to Cloudera Manger would be very useful

COLLABORATION

• Better integration tools for Bi-Directional transfer of Data between IMPALA & HANA

Page 15: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 15/19

Appendix

Links to setting up your own test environments:

HANA:

http://scn.sap.com/docs/DOC-28294 

IMPALA:

http://blog.cloudera.com/blog/2012/10/set-up-a-hadoophbase-cluster-on-ec2-in-about-an-hour/ 

AWS:

http://aws.amazon.com/ec2/instance-types/ 

http://aws.amazon.com/ec2/pricing/ 

Page 16: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 16/19

Cloudera Impala Architecture

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ 

SQL statements supported:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html 

Page 18: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 18/19

TPC-H Dataset

Useful Links:

http://www.tpc.org/tpch/ 

http://www.tpc.org/tpch/spec/tpch_2_15.0.zip 

http://www.haidongji.com/2011/03/30/data-generation-with-tpc-hs-dbgen-for-load-testing/ 

https://github.com/kj-ki/tpc-h-impala 

Rough Steps in HADOOP:

1) Download TPC-H Zip file

2) Compile and run DBGEN to generate dataset

(e.g. ./dbgen -vf -s 10 generates the 7.2Gb Lineitem table with 60 Million rows)

3) Copy TEXT files to HDFS

4) Create External Impala Table(s) pointing HDFS TEXT files

5) Create New Impala Parquet Table(s) populated from TEXT tables

Page 19: Impala vs HANA

7/18/2019 Impala vs HANA

http://slidepdf.com/reader/full/impala-vs-hana 19/19

TPC-H Dataset (cont)

Rough Steps in HANA:

1) Download TPC-H Zip file

2) Compile and run DBGEN to generate dataset(e.g. ./dbgen -vf -s 10 generates the 7.2Gb Lineitem table with 60 Million rows)

3) Create table in HANA studiocreate column table "lineitem"

(

L_ORDERKEY integer,

L_PARTKEY integer,

L_SUPPKEY integer,

L_LINENUMBER integer,L_QUANTITY numeric (20,2),

L_EXTENDEDPRICE numeric (20,2),

L_DISCOUNT numeric (3,2),

L_TAX numeric (3,2),

L_RETURNFLAG character(1),

L_LINESTATUS character(1),

L_SHIPDATE date,

L_COMMITDATE date,

L_RECEIPTDATE date,

L_SHIPINSTRUCT character(25),

L_SHIPMODE character(10),L_COMMENT varchar(44),

primary key (L_ORDERKEY, L_LINENUMBER)

)

PARTITION BY ROUNDROBIN PARTITIONS 5;

4) Import tableIMPORT FROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/lineitem.tbl' INTO "00_TPCH"."lineitem" WITH RECORD DELIMITED BY '\n' FIELD DELIMITED BY '|';