Impala vs HANA
-
Upload
mijesh-mohan -
Category
Documents
-
view
152 -
download
0
description
Transcript of Impala vs HANA
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 1/19
HANA vs Impala, on AWS Cloud
SAP HANA Version 1.0 SPS05
vs
Cloudera Hadoop Impala 1.0 GA
Written by: Aron [email protected]
19/05/2013
(Last updated 30/05/2013)
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 2/19
Overview
SAP HANA is a database which runs completely in Memory. Disks are only used to keep database logs, e.g. For recoveryin event of power failure. HANA works only with Structured Data
Cloudera Impala, runs on HADOOP, so is scalable, but still relies on Disks for Data storage and Data retrieval. It isn’t afully fledge SQL database but currently supports INSERTs and SELECTS. Impala supports queries on structured andunstructured data
The purpose of this document is to compare and contrast the core functionality of 2 new products, which have bothbeen designed to provide real-time reporting of ‘Big DATA’
The key criteria compared in this document include: – Costs on AWS
– Data Capacity
– Data Load times
– Query run times
The comparisons were performed on the cheapest configurations options available on Amazon Web Service (AWS)The additional tools each environment offer are not compared in this document. HANA has a significant number of
query modelling tools and integration tools with other SAP products (e.g real time replication of OLTP data) thatare not available at this time for Impala.
Note: The Appendix at end of this document has some links for additional reading and some diagrams of thearchitecture of both HANA and Impala
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 3/19
AWS Pricing Options (US East, N Virginia)
• A small HANA system Costs $300 p/month. HANA on AWS utilises EBS volumes can be
stopped and started to reduced Costs when not in use.
• An Extra Small Hadoop cluster running Impala costs $175 p/month. Cloudera Hadoop on
AWS currently requires Instance Stores so can NOT be stopped and started to reduce costs.
• Rows highlighted above are the configuration options I have used for performance testing
*AWS Has a lot of other API types (e.g. m1.xlarge) that can be used by Cloudera Manager. I’ve only focussed on the cheaper
ones (<$1/hour), but prices range from 0.04 p/hour to over $3/hour depending on your requirement. AWS also offers discounts
of up to 70%, if you sign up for up for 1-3 years, rather than pay as you go monthly.
** As of Cloudera Manager 4.5 there is no longer a node limit. 1776 nodes is used only as an illustration of a 1 Petabyte Cluster#
System
Size (based on
Hardware specs of a
single instance) Usage
AWS API
Name AWS Instance Spec
Instance
s
Cost p/hour
USD
(1 Instance)
Cost p/mth
USD
(Total Req.
Instances)
Cost p/year
USD
(Total Req.
Instances)
SAP HANA on AWS
Small DEV m2.xlarge 17.1Gb Mem, 2 Virtual Core, 420 Gb 1 0.41 299.30 3,591.60
Medium DEV m2.2xlarge 34.2Gb Mem, 4 Virtual Core, 850 Gb 1 0.82 598.60 7,183.20
Large DEV m2.4xlarge 68.4Gb Mem, 8 Virtual Core, 1.690 Tb 1 1.64 1,197.20 14,366.40
Small PROD (add 0.99 /hr Licence Cost SAP) m2.xlarge 17.1Gb Mem, 2 Virtual Core, 420 Gb 1 1.40 1,022.00 12,264.00
Medium PROD (add 0.99 /hr Licence Cost SAP) m2.2xlarge 34.2Gb Mem, 4 Virtual Core, 850 Gb 1 1.81 1,321.30 15,855.60
Large PROD (add 0.99 /hr Licence Cost SAP) m2.4xlarge 68.4Gb Mem, 8 Virtual Core, 1.690 Tb 1 2.63 1,919.90 23,038.80
HADOOP Cluster
(using Cloudera
Manager) *
Small (1 Node) Minimum (Master + 1 Data Node) m1.medium 3.75Gb Mem, 1 Virtual Core, 410Gb 2 0.12 175.20 2,102.40
Small (3 Node) Minimum recommended (Master + 3 Data Node) m1.medium 3.75Gb Mem, 1 Virtual Core, 410Gb 4 0.12 350.40 4,204.80
Medium (3 node) Minimum recommended (Master + 3 Data Node) m1.large 7.5Gb Mem, 2 Virtual Core, 850Gb 4 0.24 700.80 8,409.60
Large (3 Node) Minimum recommended (Master + 3 Data Node) m1.xlarge 15Gb Mem, 4 Virtual Core, 1.69 Tb 4 0.48 1,401.60 16,819.20
Large (1776 Nodes))
1 Petabyte Cluster with 26 Tb Memory & 7100 Virtual Cores
(1 Master + 1776 Node Cluster, with Replication Factor 3)
Note: Cost Excludes Cloudera Manager charges for Clusters
with more than 50 Nodes m1.xlarge
15Gb Mem, 4 Virtual Core, 1.69 Tb
(Assuming 3 way replication of data, this
would give 1777** 0.48 622,660.80 7,471,929.60
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 4/19
Example Dataset (SAP SPL Line item table)
• Used an SAP SPL line table as basis forComparison.
• Example Data was sourced from a ProductiveSAP ECC system
• 1 Period of Historical Data from a moderatelylarge company code was used. (~510K rowsof data /115Mb uncompressed Data)
• Master data & Data values were replacedwith more generic values for data securityreasons
• To simulate Larger volumes of Data the datawas duplicated to 12 periods and then to 10Company Codes.
• 510K records, was duplicated to ~60 Million
• For optimal query run-times a COLUMNstore table type were used in HANA andImpala (using PARQUET file format)
Table ZSPLA - Example of a Special Purpose Ledger (SPL) Line item table in SAP
FIELD Field Description
HANA
DATA Type
IMPALA
Data Type
RYEAR Year INT INT
RBUKRS Company Code CHAR(4) STRING
DOCNR Document Number CHAR(10) STRING
DOCLN Document Line Nr CHAR(6) STRING
POPER Posting Period INT INT
RTCUR Transaction Currency CHAR(5) STRING
DRCRK DR DR Indicator CHAR(1) STRING
RACCT Account CHAR(10) STRING
RCNTR Cost Center CHAR(10) STRING
RPRCTR Profit Center CHAR(10) STRING
RZZKOKRS Controlling Area CHAR(4) STRING
RMVCT Transaction Type CHAR(3) STRING
RJVNAM Joint Venture CHAR(6) STRING
REGROU Equity Group CHAR(3) STRING
RORDNR Internal Order CHAR(12) STRING
ZZPOSID WBS CHAR(24) STRING
ZZRRECIN Recover Indicator CHAR(2) STRING
TSL Amount (Transaction Currency) DECIMAL(17,2) DOUBLE*
HSL Amount (Company Code Currency) DECIMAL(17,2) DOUBLE*
KSL Amount (Group Currency USD) DECIMAL(17,2) DOUBLE*
SGTXT Line item Text CHAR(50) STRING
DOCTY Document Type CHAR(2) STRING
BUDAT Posting Date DATE TIMESTAMP
WSDAT Document Date DATE TIMESTAMP
CPUDT CPU Date DATE TIMESTAMP
CPUTM CPU Time CHAR(6) STRING
USNAM User Name CHAR(12) STRING
* Impala 1.0 does not currently support Decimal Type, so additional rounding is needed
for Financial reporting
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 5/19
Data Load Times & CompressionHANA Small vs Impala Small (1 Node Cluster)
• Initially 510K records were loaded into a HANAColumn store table & Impala Parquet table
• Scripts were run on HANA & Impala to: – Insert 6 Million records per Company Code
– Execute ‘Select Count(*), SUM(*)’ on table after the Insert
Summary Results:
- Compression of Example Dataset- HANA: 10 Times compression (125 Mb / 6 Million Records)- Impala: 7 Times Compression (186 Mb / 6 Million Records)
- Load Times- HANA: Load Time degrades as table size increases (60
Million records took ~1 Hour)
- Impala: Load times are is more constant (60 Million recordstook ~20 Minutes)
- Select Times- HANA: Simple Select statement on table took constantly less
than 2 seconds, irrespective of Table size- Impala: Simple Select statement on table degraded at anoticeable rate as table size increased. E.g. Increased from 6Seconds to over 30 seconds as table grew. [Note: This may bein part be due to a memory leak bug being fixed on the nextpatch release]
00:00:00
00:01:26
00:02:5300:04:19
00:05:46
00:07:12
00:08:38
00:10:05
00:11:31
00:12:58
6 12 18 24 31 37 43 49 55 61
T i m e ( h h : m m : s s )
Total Number of Records in the Table (Millions)
Impala - Insert 6 Million
Impala - Select Count & SUM
HANA - Insert 6 Million
HANA - Select Count & SUM
Note: Similar Impala queries running other table types (e.g.
Using HBASE tables) are significantly slower than Parquet
tables. E.g. ‘Select Count(*), SUM(*)’ on a HBASE Impala Table
takes more than 30 seconds with only 510K records. This
though is still a significant improvement of similar HIVE queries
on the same HBASE table which may take several minutes.
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 6/19
Query Run Times(on table with 60 Million Records)
• Overall Run times on HANA and Impala were both Good, however HANA constantly returned results 1 Second.
• Performance on Impala improves as more Nodes are added to the Cluster
• Impala with 18 Nodes (AWS m1.medium) performed similar to HANA (AWS m2.xlarge)
• The Impala Run times above were also improved further by between 30% & 65%, when re-executed. Runningqueries on the tables primes data onto the cache on each of the cluster nodes. Cloudera are apparently workingon further enhancements to memory caching, which will hopefully be released later this this year.
Records
Returned
Time (Seconds)
Abbreviated Select Statement (Group by statement were used on most of the queries)
HANA
Small
Impala
Small
(1 Node)
Impala
Small
(3 Nodes)
Impala
Small
(9 Nodes)
Impala
Small
(18 Nodes)
select count(*) 1 1 4 1 1 1
select count(*), sum(ksl) 1 1 14 3 1 1
select rbukrs, count(*), sum(ksl) 10 3 28 5 4 3
select rbukrs, poper, count(*), sum(ksl) 120 5 34 7 4 2
select rbukrs, poper, count(*), sum(ksl)
where ( rbukrs = "A009" OR rbukrs = "A010") and poper = 9 2 1 34 6 5 3
select rbukrs, poper, count(*), sum(ksl)
where rbukrs = "A009" and poper = 9 1 1 26 5 8 2
select rbukrs, racct, count(*), sum(ksl)where rbukrs = "A009" and poper = 9 623 1 34 8 5 2
select rbukrs, racct, count(*), sum(ksl)
where rbukrs = "A009" and poper = 9 and racct ="606500001" 1 1 35 8 5 3select
ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl
where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 4 79 16 8 4
select *
where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 8 163 42 18 9
Est. Monthly Cost of Production Environment on AWS(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $876 $1664
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 7/19
Data Limits (based on the Example Dataset)
• The above row limits are the crude, theoretical limits, based on the
example Dataset used, assuming the system does not store other tables.
• The Impala limit is based on 90% of the total HDFS disk space of the
Hadoop Cluster• The HANA limit is based on a maximum 75% memory utilisation (less
space used by general system processes)
* Based on only a 3 Node Cluster with Replication Factor of 3 (i.e. Data
replicated on 3 Data nodes for Hardware failure protection)
System AWS Size Theorectical Row Limit
SAP HANA on AWS
SMALL 0.2 Billion rows
MEDIUM 0.6 Billion rows
LARGE 1.5 Billion rows
HADOOP on AWS*
SMALL
(3 Nodes) 12 Billion rows
MEDIUM
(3 Nodes) 25 Billion rows
LARGE
(3 Nodes) 50 Billion rows
LARGE (1776 Node Cluster) 30 Trillion rows
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 8/19
Impala Query Run Times(1 Billion Records)
• This example does not include HANA stats as the AWS Small Hana Box will not support 1
Billion records for the example dataset , only 0.2 Billion
RecordsReturned
Time (Seconds)
Abbreviated Select Statement
Impala
Small(3 Nodes)
Impala
Small(9 Nodes)
Impala
Small(18 Nodes)
select count(*) 1 18 5 3
select count(*), sum(ksl) 1 80 31 17
select rbukrs, count(*), sum(ksl) 10 168 55 25
select rbukrs, poper, count(*), sum(ksl) 120 229 65 27
select rbukrs, poper, count(*), sum(ksl)
where ( rbukrs = "A009" OR rbukrs = "A010") and poper = 9 2 237 66 21
select rbukrs, poper, count(*), sum(ksl)
where rbukrs = "A009" and poper = 9 1 205 59 17
select rbukrs, racct, count(*), sum(ksl)
where rbukrs = "A009" and poper = 9 623 285 76 26
select rbukrs, racct, count(*), sum(ksl)
where rbukrs = "A009" and poper = 9 and racct ="606500001" 1 294 85 25select
ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl
where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 449 152 70
select *
where rbukrs = "A009" and poper = 9 and racct = "606500001" 14 Timeout 336 174
Est. Monthly Cost of Production Environment on AWS
(Impala m1.medium) $350 $876 $1664
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 9/19
1 Billion Rows on 18 Node cluster
• As data is loaded, a gradual decline in both Insert & Select times were noted
• After 800 Million records the cluster started to perform slowly, but after ‘Impala’ was restarted runtimesreturned to normal. (A memory leak bug is the cause, it will be fixed on the next patch release)
• Based on runtime stats of a 3, 9 & 18 Node Cluster runtimes appear quite linear to both the number of nodesand number of records.
• To achieve query run times of < 10 Seconds* then it is estimated that a 300 node cluster would be required ( ~USD $26K / mth on AWS)
*Impala also has the option to partition your tables (e.g. by Year / Period). This would significantly reduce run times in yo ur queries, assuming the appropriateWHERE clause restrictions are used. Partitions would increase runtimes for those queries which need to span all partitions. In the above scenario a partition byYEAR & Period should reduced the longest running query, illustrated above, to less than 10 seconds, without the need to add additional nodes to the cluster.
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200
R u n t i m e ( S e c o n d s )
Total Number of Records in the Table (Millions)
Runtime: select ryear,poper,docnr,docln,rbukrs,racct,rprctr,tsl,hsl,ksl where rbukrs = "A009" and poper = 9 and racct = "606500001"
Runtime: Insert 500K records
Runtime: Select count(*), Sum(ksl)
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 10/19
TPC-H Query Run Times (lineitem table 60 Million Rows)
• Ran comparison queries using lineitem table from TPC-H. TPC Benchmark™H (TPC-H) is a decision support benchmark. SeeAppendix for more details e.g. if you wish to use the same dataset for your own benchmark
• Overall Run times on HANA and Impala with 3 Node Cluster using Parquet format tables were good
• Unfortunately HANA was unable to complete the MERGE DELTA operation after the initial load(to optimize memory utilisation) with 1Partition, so the table was split into 5 ‘ROUND ROBIN’ Partitions.
• Also due to the memory limitations on the small AWS HANA box, it was necessary to perform an additional LOAD ALL statement prior toexecuting the queries to ensure table data was primed for reporting. This pre-step took approx. 10 seconds to complete.
Records
Returned
Time (Seconds)
Select Statement
HANA
Small
Impala
Small
(1 Node)
Parquet
Impala
Small
(3 Nodes)
Parquet
Impala
Small
(1 Node)
Text
Impala
Small
(3 Nodes)
Textselect count(*) from lineitem 1 1 3 1 74 31
select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by
l_shipmode 7 8 23 5 74 28
select l_shipmode, count(*), sum(l_extendedprice) from lineitem where
l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitem
group by l_shipmode, l_linestatus 14 10 32 7 74 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from lineitemwhere l_shipmode = 'AIR' and l_linestatus = 'F' group by l_shipmode, l_linestatus 1 1 27 5 72 29
select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 23 5 73 30
select l_shipmode, l_linestatus, l_extendedprice from lineitem where l_shipmode
= 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31
select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 104 21 73 30
Size
(5 Part.)
1.9Gb (40 files x 80mb)3.2Gb (1 file – No Compression)7.2Gb
Est. Monthly Cost of Production Environment on AWS
(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 11/19
Quick Technical SummaryHANA(Ver 1.0 SPS05) HADOOP & IMPALA (1.0 GA)
AWS Small:
Capacity 17Gb 410Gb
Rows of Data Supported* 0.2 Billion 12 Billion
Cost per year (DEV) $ 3,591.60 $ 4,204.80
Cost per year (PROD) $ 12,264.00 $ 4,204.80
AWS Limit: (Limited only by Budget)
Capacity 64Gb Petabytes (limited by your budget)
Rows of Data Supported* 1.5 Billion Theorectically limited only by Budget
Cost per year (DEV) $ 14,366.40 Many Millions+++
Cost per year (PROD) $ 23,038.80 Many Millions+++
Non AWS Limit: (Limited only by Budget)
Capacity 15Tb (Current Limit though expected to increase) Petabytes (limited by your budget)
Rows of Data Supported* 375 Billion Theorectically limited only by Budget
Cost Many Millions+++ Many Millions+++
*Base on my sample Data structure only (similar to main fields stored in SAP SPL Line item table)
HANA(Ver 1.0 SPS05) HADOOP & IMPALA (1.0 GA)
OS Several (incl Linux) Linux
Data Storage Memory Only (Backup Logs written to Disk) Disk & Memory
Structure Data YES YES
Unstructured Data NO YES
Other files (e.g. Images, videos, etc) NO YES, but not accessible via IMPALA
Software Licenced Open Source
Integration to SAP OLTP Data Realtime (SLT) ** Scheduled (BO Data Service or Custom)
Integration to SAP OLAP Data(BW) Realtime (if BW on HANA) ** Scheduled (BO Data Service or Custom)
Reporting Tools Business Objects + other Third Pary Business Objects + other Third Pary
SQL Statements YES Limited (Primarily support SELECT queries)
SQL Modelling Tool YES NO
Predictive Analyse Tools YES NO
Application Development Tools
Many Open Source tools (e.g. Eclipse with JAVA &
PYTHON addons)
(or ABAP if running HANA Netweaver version) Open Source Tools
System Monitoring Tool HANA Studio Cloudera Manager (Free for < 50 Node Cluster)
** Not supported on AWS HANA version, Custom Scheduled JDBC integration required
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 12/19
Big Data Real-time Analytics•
Storing Data in column storage provides the best response times for real time analytic reporting and reduces storage space,whether in memory or on Disk (E.g. PARQUET/SNAPPY file format for IMPALA, or COLUMN store tables in HANA)
• In Memory Computing is definitely the future for the fastest possible reporting of large data Sets
• However, storing VERY large datasets perpetually in memory, in the off chance it is needed, is a more costly route
Analogy :
• At the moment I would liken HANA to a Ferrari: Sexy, goes very fast, but has limited luggage space.
• Impala by comparison is a fleet of MPV's: Good performance and good capacity, at an affordable price.
• Hadoop (without Impala) is a fleet of Long Haul trucks: Moderate performance, Excellent Capacity and drives overnight.
The good thing about Cloudera’s Hadoop offering is that when you buy the Trucks they throw in the MPV's for free.
If ALL your corporate data (ERP & NON ERP) was printed out on paper, would you haul it around in a Ferrari , MPV's or Truck's?
The answer is probably that it would depend on the amount of paper and how quickly you need to deliver it.
Weaknesses:
HANA can’t store unstructured data and is costly to store COLD/FROEN/HISTORICAL Data
IMPALA does not have sophisticated query modelling tools and does not yet support real time replication logic from your OLTP
Solution:
Don't think of Impala and HANA as competitors but rather different vehicles to solve different delivery requirements.
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 13/19
Use Cases
*Above are some Non-specific use cases which
make use of the different tools
Use Case* Potential Tool
Real-time Reporting of SAP OLTP data, including joins and
data transformations SAP HANA
Summarise Unstructured DATA LOGS (scheduled) HADOOP MAP/REDUCE
Realtime reporting of Summarised Data Logs, including
Joins to other NON OLTP Data IMPALA
Near Realtime reporting of Social Media Data IMPALA + HADOOP MAP/REDUCE (scheduled to collect recent Social Media Data)
Realtime reporting of recent OLTP data joined with recent
Social Media Data
HANA + HADOOP MAP/REDUCE (scheduled to collect recent Social Media Data and
load into HANA)
Image Analysis Process(scheduled)
HADOOP MAP/REDUCE (scheduled job to run sophisticated program which analyses
Image/Video files and stores the results in a structured file)
Image Analysis Reporting IMPALA (to report on results file)
Predictive Analysis Reporting (comparing OLTP & NON
OLTP DATA)
HANA + HADOOP MAP/REDUCE (scheduled to collect & transfer applicable Historic
or relevant Non OLTP Data to HANA)
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 14/19
Wish List
SAP HANA/BW/ECC
• Add Near Line Storage (NLS) and Archiving capability to HADOOP IMPALA tables, so OLD data canstill be accessed
• Integration with Impala to allow a single query to aggregate CURRENT data in SAP and Historicaldata in IMPALA, similar to what's being attempted by SAP with a new federation layer connectingto Sybase
IMPALA• Memory Management options, for caching Parquet tables (or subsets off e.g. Current Month) in
Memory of Cluster Nodes
• SQL modelling Tool
• Decimal Values (for Financial reporting)
• Support for AWS EBS (not Instance Store), and automated remapping of internal IP's (based onpublic DNS or Elastic IP's) upon Stop and Start, to reduced costs when cluster not in use. AddingAWS Start / Stop functionality to Cloudera Manger would be very useful
COLLABORATION
• Better integration tools for Bi-Directional transfer of Data between IMPALA & HANA
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 15/19
Appendix
Links to setting up your own test environments:
HANA:
http://scn.sap.com/docs/DOC-28294
IMPALA:
http://blog.cloudera.com/blog/2012/10/set-up-a-hadoophbase-cluster-on-ec2-in-about-an-hour/
AWS:
http://aws.amazon.com/ec2/instance-types/
http://aws.amazon.com/ec2/pricing/
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 16/19
Cloudera Impala Architecture
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
SQL statements supported:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 17/19
SAP HANA Architecture
http://www.slideshare.net/SAPCommunityNetwork/architecture-and-
technology-in-sap-hana
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 18/19
TPC-H Dataset
Useful Links:
http://www.tpc.org/tpch/
http://www.tpc.org/tpch/spec/tpch_2_15.0.zip
http://www.haidongji.com/2011/03/30/data-generation-with-tpc-hs-dbgen-for-load-testing/
https://github.com/kj-ki/tpc-h-impala
Rough Steps in HADOOP:
1) Download TPC-H Zip file
2) Compile and run DBGEN to generate dataset
(e.g. ./dbgen -vf -s 10 generates the 7.2Gb Lineitem table with 60 Million rows)
3) Copy TEXT files to HDFS
4) Create External Impala Table(s) pointing HDFS TEXT files
5) Create New Impala Parquet Table(s) populated from TEXT tables
7/18/2019 Impala vs HANA
http://slidepdf.com/reader/full/impala-vs-hana 19/19
TPC-H Dataset (cont)
Rough Steps in HANA:
1) Download TPC-H Zip file
2) Compile and run DBGEN to generate dataset(e.g. ./dbgen -vf -s 10 generates the 7.2Gb Lineitem table with 60 Million rows)
3) Create table in HANA studiocreate column table "lineitem"
(
L_ORDERKEY integer,
L_PARTKEY integer,
L_SUPPKEY integer,
L_LINENUMBER integer,L_QUANTITY numeric (20,2),
L_EXTENDEDPRICE numeric (20,2),
L_DISCOUNT numeric (3,2),
L_TAX numeric (3,2),
L_RETURNFLAG character(1),
L_LINESTATUS character(1),
L_SHIPDATE date,
L_COMMITDATE date,
L_RECEIPTDATE date,
L_SHIPINSTRUCT character(25),
L_SHIPMODE character(10),L_COMMENT varchar(44),
primary key (L_ORDERKEY, L_LINENUMBER)
)
PARTITION BY ROUNDROBIN PARTITIONS 5;
4) Import tableIMPORT FROM CSV FILE '/sap/tpc-h/tpch_2_15.0/dbgen/lineitem.tbl' INTO "00_TPCH"."lineitem" WITH RECORD DELIMITED BY '\n' FIELD DELIMITED BY '|';