Taming the Elephant: The Power of SQL on Hadoop
-
Upload
inside-analysis -
Category
Technology
-
view
159 -
download
1
description
Transcript of Taming the Elephant: The Power of SQL on Hadoop
Grab some coffee and enjoy the pre-show banter before the top of the hour!
H T Technologies of 2014
HOST: Eric Kavanagh
THIS YEAR is…
SQL on Hadoop
SQL has been the de facto query language for decades
Hadoop provides an innovative data platform, but accessing and leveraging the file system has so far often meant the need for a whole new skill set
The marriage of highly performant SQL and Hadoop can be a giant step forward
ANALYST:
John O’Brien Principal & CEO, Radiant Advisors
ANALYST:
Robin Bloor Chief Analyst, The Bloor Group
GUEST:
John Santaferraro Vice President of Marketing, Actian TH
E LINE UP
INTRODUCING
John O’Brien
© Copyright 2014 Radiant Advisors. All Rights Reserved
TAMING THE ELEPHANT: !THE POWER OF SQL-ON-HADOOP
Hot Technologies – Inside Analysis July 16, 2014
John O’Brien | Principal Advisor and CEO, Radiant Advisors @obrienjw @radiantadvisors [email protected]
8
© Copyright 2014 Radiant Advisors. All Rights Reserved
Enable Highly Iterative Access, assemble, verify, deploy process, and modern data platform Enable fail-fast, short shelf life, personalized to enterprise context
Self-Sufficiency is the New Self-Service Agility and data integration through abstraction usage
Enable many business analysts, not just programmers, with pre-built
Intuitive Visualization Tools Oriented All forms of business analytics required from SQL, nPath, Graph,
Textual, Statistical, Predictive to achieve business goals
The Power of SQL-on-Hadoop HOW SQL UNLOCKS DISCOVERY
9
© Copyright 2014 Radiant Advisors. All Rights Reserved
Busin
ess
Value
Users Involved
Power Users
Analysts & Casual Users
MapReduce
çHCatalog
BI To
ol
Very Few Data Scientists
Many Many Consumers
DB
More Analysts
Hadoop Distributed File System
Hive
PIG
Hadoop v1
The Power of SQL-on-Hadoop UNLOCKING BIG DATA VALUE
Have to meet the Casual Users expectations
10
© Copyright 2014 Radiant Advisors. All Rights Reserved
The Power of SQL-on-Hadoop INDEPENDENT BENCHMARK DOWNLOAD
11
© Copyright 2014 Radiant Advisors. All Rights Reserved
12
The Power of SQL-on-Hadoop KEY EVALUATION CONSIDERATIONS
Evaluation Criteria
SQL Capability • Tools Compatibility • ANSI SQL • Analytic SQL • User Defined Functions
Scalability • How many nodes max? • All nodes in cluster? • Subset of cluster? • Data duplication?
Speed • Response time • Ad-hoc workloads • Without caching • Concurrency
Architecture • YARN compatible • Data file formats • Data Lake strategy • Semantic Layer
© Copyright 2014 Radiant Advisors. All Rights Reserved
*Vor
tex
PIG
Hive
-QL
MapReduce
Hadoop HDFS
Hadoop v1
Map
Re
duce
PIG
Hive
0.
13
YARN
Hadoop HDFS
Hadoop v2
PIG
Hive
M/R
YARN
Hadoop HDFS
Impa
la, H
AWQ
Infin
iDB,
Pre
sto
MPP
Eng
ines
The Power of SQL-on-Hadoop EVOLVING ARCHITECTURE FOR SQL
Tez
Tez
Batch-oriented SQL Interactive SQL Architectural SQL
Hadoop v2 with more SQL options
© Copyright 2014 Radiant Advisors. All Rights Reserved
Flexibility Class
14
Enterprise Data
Warehouses
Master Reference
Data
Discovery, Scalable, Programs Stable, Context, SQL Discovery & Analytics Oriented
Apache Hadoop
Highly Optimized for Analytics
In-memory MOLAP MPP
Optimized Class Reference Class
Gen
erat
e
Hiv
e S
QL
askdjfl kasjdfl iuyuiio
Highly Specialized for Analytics
Graphs Document
Stores Text
Analytics
P
IG /
Hiv
e
Map
Red
uce
Ope
ratio
nal S
yste
ms,
Big
Dat
a, S
tream
s
HD
FS
Columnar
Extending SQL Access to Big Data and Hadoop via Hive and other HDFS SQL engines
The Power of SQL-on-Hadoop MODERN DATA PLATFORM UNIFIED SQL
© Copyright 2014 Radiant Advisors. All Rights Reserved
THANK YOU!
For more information www.RadiantAdvisors.com
Twitter: @RadiantAdvisors RSS: feed://radiantadvisors.com/feed/ Email us at: [email protected] Linked IN: www.linkedin.com/company/radiant-advisors
© Copyright 2014 Radiant Advisors. All Rights Reserved
16
1. What file format do you recommend loading data into for SQL? (e.g. RC, ORC, Sequence, Parquet, JSON, proprietary)
2. Are the data files accessible by other Hadoop engines (Hive, PIG, MapReduce, Spark) or duplicated for SQL access?
3. Where is the schema meta data stored in Hadoop? (e.g. Hive metastore, HCatalog, other)
4. How easy is it for business analyst to create and work with schema definitions?
The Power of SQL-on-Hadoop ANALYST QUESTIONS
INTRODUCING
Dr. Robin Bloor
Hadoop
The Obvious Role of Hadoop is as the Staging Area for Data
Refinement
But it can also be a file system for a database
Big Data Architecture In Overview
Think Logical, Implement Physical
Two Data Flows
Within The Data Hub
Within The Data Hub
Nevertheless, the main workload is SQL And SQL with analytics
SQL on Hadoop
It’s not about SQL on Hadoop, It’s about fast SQL on Hadoop
Hadoop both as a file system and a database is probably
desirable
INTRODUCING
John Santaferraro
Confiden'al © 2014 Ac'an Corpora'on 26
Ac'an Analy'cs Pla7orm Hadoop SQL Edi'on Hot Technologies Webinar
John Santaferraro, VP of Solu'on and Product Marke'ng, Ac'an
July 16, 2014
Confiden'al © 2014 Ac'an Corpora'on 27
Transforma)onal Value Data Explosion
? Actian Analytics PlatformTM
Analyze Act Connect
Customer Delight
Competitive Advantage
World-Class Risk Management
Disruptive New Business Models
Ac'an Turns Data into Transforma'onal Value
Discovery without limitations Low latency at any scale
Reactive to predictive Static to dynamic
Segment of 1
Best in Class Usage
Design-time & run-time optimization Linear parallelism Rich analytics DNA Pipeline architecture Affordable unlimited scale
Best in Class Capabili)es
Confiden'al © 2014 Ac'an Corpora'on 28
Libraries of Analytics
Mas
sive
ly P
aral
lel
Inte
grat
ion
Hadoop High-
Performance, Low Latency Analytics in
Database
Connections for Any Data
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
Real-Tim
e A
nalytic Services
Visual Data Science and Analytics Workbench
SaaS Data
Ac'an Analy'cs Pla7orm: Next Genera'on Big Data Analy'cs
Amazon Redshift
High Performance Data Science
Natively in Hadoop
Confiden'al © 2014 Ac'an Corpora'on 29
What HOT? " Turns Hadoop into a High-‐Performance, Fully-‐Func'onal Analy'cs Pla7orm
What makes it HOT? " Highest performing, most industrialized SQL access to Hadoop data
" Only end-‐to-‐end analy'c processing na'vely in Hadoop
" Most consumable, accessible, manageable Hadoop analy'cs
What does this mean to YOU? " Removes all barriers for business access to big data analy'cs
" Unleashes millions of business-‐savvy, SQL users with no constraints on Hadoop data
" Accelerates 'me to value and turns Hadoop data into transforma'onal value
Ac'an Analy'cs Pla7orm – Hadoop SQL Edi'on Industrialized, High-‐Performance SQL in Hadoop
Confiden'al © 2014 Ac'an Corpora'on 30
HADOOP
YARN Namenode
HDFS
SQL
Datanode
HDFS
Visual Data Science
& Analytics Workbench
Ac'an Analy'cs Pla7orm – Hadoop SQL Edi'on Transform Hadoop into a High Performance Analy?cs Pla@orm
Datanode
HDFS
Datanode
HDFS
Datanode
HDFS
X100 X100 X100
Read Load
Ac'an Vector Blend & Enrich
Data Science & Analy'cs
Datanode
HDFS
X100
HDFS
Vector
• Original file format • Standard block
replication
• Column-based blocks
• Binary • Compressed • Partitioned
• Faster Loading • Faster SQL • Standard SQL • Better Scaling
High Performance, Industrialized SQL
Database
High Performance Dataflow Engine
Confiden'al © 2014 Ac'an Corpora'on 31
Visual Data Science & Analy'cs Workbench • Drag/drop 1000+ analytic functions • Connect, blend, & enrich data • Perform discovery analytics & data science • Build and test predictive models
MapReduce
Coding
Confiden'al © 2014 Ac'an Corpora'on 32
" Comprehensive – covers full analy'c process: data blending & enrichment, discovery & data science, analy'cs & opera'onal BI
" Accessible – standard ANSI SQL-‐92 to support standard BI tools; plus key advanced analy'cs including cube, grouping sets and windowing func'ons
" Op)mized – mature, proven planner and op'mizer; op'mal use of every node, CPU, memory, and cache
" Secure – na've DBMS security including authen'ca'on, user and role-‐based security, data protec'on, and encryp'on
" Reliable -‐ fully ACID-‐compliant with mul'-‐version read consistency, plus system-‐wide failover protec'on
" Manageable – resources managed automa'cally in Hadoop via YARN
" Consumable – now usable by millions of users with every SQL tool and applica'on on the planet
" Scalable – unlimited expansion to handle extreme #s of users, nodes, data
Most Industrialized SQL in Hadoop
Confiden'al © 2014 Ac'an Corpora'on 33
Up to 30X Faster Than Impala
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
“Impala Subset” of TPC-DS at Scale Factor 3000 (3TB) Actian vs Impala
Impala Actian
Background to “Impala Subset “of TPC-DS benchmark can be found here: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
Both Executed on the Same Hardware and Software Environment: 5 Node Cluster with 64GB of RAM per node and 12x2TB Hard Disks.
Average
Highest Performing SQL in Hadoop Ti
mes
Fas
ter T
han
Impa
la
Confiden'al © 2014 Ac'an Corpora'on 34
“wrapped legacy”
“from scratch”
Maturity (SQL support,
ACID, reliability, security, connectivity,
performance)
Hadoop Integra)on Low Native
High
“connections” Mature & Integrated
“SQL on Hadoop” Vendor Landscape
+ End-to-End
Confiden'al © 2014 Ac'an Corpora'on 35
Libraries of Analy'cs
Hadoop
Connec'ons for Any Data
Ac'an Analy'cs Pla7orm – Hadoop SQL Edi'on
Visual Data Science and Analy'c Workbench
High Performance Dataflow Engine
High Performance, Industrialized SQL Analy)cs Database
Removes all barriers for business access to big data analy'cs
Business Processes
Users
Machines
Applications
Expansive Connec'vity Data Blending & Enrichment Discovery Data Science Analy'cs Opera'onal BI
Enterprise Data
Machine Data
Social Data
Data Warehouse
SaaS Data
Amazon Redshift
Confiden'al © 2014 Ac'an Corpora'on 36
Ubiquitous Skills
■ 1 Million+ SQL Users
■ $ Inexpensive ■ Easy to find, in most companies
■ Embedded in the business
Specialty Skills
■ 150K MapReduce Programmers
■ $$$ Expensive ■ 170K Shortage, hard to find ■ Separate from the business
Unleash millions of business-‐savvy, SQL users with no constraints on Hadoop data
Actian Analytics PlatformTM
Analyze Act Connect +
Confiden'al © 2014 Ac'an Corpora'on 37
Accelerate 'me to value and turn Hadoop data into transforma'onal value
Data Scien'st
Discover new opportuni'es, build and test models. Come up with candidate models.
Data Miner
Validate models, apply data mining techniques. Choose and maintain contender models.
Business Analyst
Select model for deployment based on business impact.
Opera'onal User
Use models for opera'onal intelligence and embed analy'cs in real-‐'me systems.
COMMON DATA & ANALYTICS ACCESS
COLLABORATIVE DATA SCIENCE ENVIRONMENT
Confiden'al © 2014 Ac'an Corpora'on 38
Actian transforms Hadoop from a data lake into a high-performance analytics platform.
Ac'an Analy'cs Pla7orm – Hadoop SQL Edi'on Industrialized, High-‐Performance SQL in Hadoop
" Only end-‐to-‐end analy'c processing na'vely in Hadoop
" Highest performing, most industrialized SQL in Hadoop
" Removes all barriers for business access to big data analy'cs
" Unleashes millions of business-‐savvy SQL users on Hadoop data
" Speed 'me to value for big data analy'cs projects
" Outperforms Cloudera’s Impala by up to 30x
Confiden'al © 2014 Ac'an Corpora'on 39
What Big Data Analy'cs Pricing Was Meant to Be
All-In-One (1 SKU)
Right-to-Deploy (no limits)
Confiden'al © 2014 Ac'an Corpora'on 40
www.ac'an.com
facebook.com/ac'ancorp
@ac'ancorp
Thank You
Confiden'al © 2014 Ac'an Corpora'on 41
Vector in Hadoop Technical Overview
Confiden'al © 2014 Ac'an Corpora'on 42
ING
RE
S
SQL parser
Optimizer
Cross compiler
parsed tree
query plan
Client Application or BI Tools
X100 algebra
X10
0
Distributed rewriter
Builder
Execution engine
annotated query tree
operator tree
Buffer manager
data data request
HDFS
Lead
er n
ode
(nam
enod
e)
SQL query
I/O
X10
0
Rewriter
Builder
Execution engine
annotated query tree
partial operator tree
Buffer manager
data data request
HDFS Wor
ker n
ode
[1..n
] (da
tano
des)
I/O
MPI
annotated tree
result
MPI
partial result set
MP
I in
ter-
node
com
mun
icat
ion
Active Passive Fail--over for Leader Node
Actian Director for Management
Vector in Hadoop Architecture
The Archive Trifecta: • Inside Analysis www.insideanalysis.com • SlideShare www.slideshare.net/InsideAnalysis • YouTube www.youtube.com/user/BloorGroup
THANK YOU!