SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL and Machine Learning on Hadoop
-
Upload
mukund-babbar -
Category
Technology
-
view
278 -
download
1
Transcript of SQL and Machine Learning on Hadoop
![Page 1: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/1.jpg)
1 1 Pivotal Confidential–Internal Use Only
SQL & Machine Learning on Hadoop
Mukund Babbar Pivotal Feb, 2015
![Page 2: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/2.jpg)
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Journey to Apache
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks PostgreSQL
Hadoop 1.0 Released
HAWQ & MADlib go Apache
HAWQ launched
Hadoop 2.0 Released
MADlib launched
Greenplum open sourced
![Page 3: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/3.jpg)
3 3 Pivotal Confidential–Internal Use Only
Apache HAWQ Overview
![Page 4: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/4.jpg)
4
HAWQ – SQL on Hadoop
![Page 5: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/5.jpg)
5
Shared-Nothing Database Architecture
Standby Master
Segment Host with one or more Segment Instances Segment Instances process queries in parallel
High speed interconnect for continuous pipelining of data processing …
Master Host
SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts
Interconnect
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
Segment Hosts have their own CPU, disk and memory (shared nothing)
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node1
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node2
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node3
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
nodeN
![Page 6: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/6.jpg)
6
Key Features of
HAWQ
5
![Page 7: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/7.jpg)
7
5 • Up to 30x SQL-‐on-‐Hadoop performance advantage
• Faster ;me to insight • Massive MPP scalability to petabytes
Benefits: Near real-‐;me latency, complex queries and advanced analy;cs at scale
1. Advanced Analy9cs Performance
Key Features of
HAWQ
![Page 8: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/8.jpg)
8
HAWQ Performance vs Impala
HAWQ Faster
Impala Faster
2 28 46 66 73 76 79 80 88 90 96
HAWQ • Faster on 46 of 62
TPC-DS queries completed*
• 4.55x mean avg. • 12 hrs faster total
* Impala supported 74 of 99 queries, 12 crashed mid-run
![Page 9: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/9.jpg)
9
HAWQ vs Apache Hive w/Tez
HAWQ Faster
Hive Faster
3 7 15 25 27 34 46 48 76 79 89 90 96
HAWQ • Faster on 45 of 60
TPC-DS queries completed*
• 3.44x mean avg. • 9 hrs faster total
* Hive supported 65 of 99 queries, 5 crashed mid-run
![Page 10: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/10.jpg)
10
5 • ANSI SQL-‐92, -‐99, -‐2003 • All 99 TPC-‐DS queries tested, no modifica;ons
• Plus, OLAP extensions • Complete ACID integrity and reliability Benefits: 100% SQL compliant No risk to SQL applica;ons All na;ve on HDP via HAWQ
2. 100% ANSI SQL Compliant
Key Features of
HAWQ
![Page 11: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/11.jpg)
11
5 • Advanced machine learning for big data • Local, in-‐database opera;on • Excep;onal MPP/parallel performance • Open source, Postgres-‐based Benefits: Advanced, highly scalable, machine learning, directly on data in Hadoop
3. Integrated Machine Learning
Key Features of
HAWQ
![Page 12: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/12.jpg)
12
5 • HDP, PHD, other ODPi-‐derived distros • Easily managed via Ambari • On premises, in cloud, or PaaS • HBase, Avro, Parquet and more • Connectors to make HAWQ data available to other SQL query tools Benefits: Flexibility Accessibility Portability
4. Flexible Deployment
Key Features of
HAWQ
![Page 13: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/13.jpg)
13
5 • Cost-‐based query op;miza;on • Robust query plan op;miza;on • Complex big data management
Benefits: Op;mize performance and costs Maximize Hadoop cluster resources Offload EDW w/o compromise
5. Query Op9miza9on Op9ons
Key Features of
HAWQ
![Page 14: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/14.jpg)
14
Advanced MPP: Polymorphic Storage™
� Columnar storage is well suited to scanning a large percentage of the data
� Row storage excels at small lookups � Most systems need to do both � Row and column orienta;on can be
mixed within a table or database
� Both types can be drama;cally more efficient with compression
� Compression is definable column by column: � Blockwise: Gzip1-‐9 & QuickLZ � Streamwise: Run Length Encoding (RLE) (levels 1-‐4)
� Flexible indexing, par;;oning enable more granular control and enable true ILM
TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov
Row-‐oriented for Small Scans Column-‐oriented for Full Scans
![Page 15: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/15.jpg)
15
PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}
• Allows users to write HAWQ functions in R, Perl, Java, Perl, pgsql or C languages
• The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster
• Data Parallelism: – PL/X piggybacks on
HAWQ’s MPP architecture
![Page 16: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/16.jpg)
16
Apache HAWQ
● Discover New Rela9onships ● Enable Data Science ● Analyze External Sources ● Query All Data Types!
Mul9-‐level Fault Tolerance
Granular Authoriza9on
Resource Mgmt (+ YARN)
high mul(-‐tenancy
ANSI SQL Standard
OLAP Extensions
JDBC ODBC Connec9vity
Parallel Processing
Online Expansion
HDFS
Petabyte Scale
Cost Based Op9mizer
Dynamic Pipelining
ACID + Transac9onal
Mul9-‐Language UDF Support
Built-‐in Data Science Library
Extensible (PXF)
Query External Sources
Hardened, 10+ Years Investment, Produc9on Proven
Accessibility + Usability
HDFS Na9ve File Formats
● Manage Mul9ple Workloads ● Petabyte Scale Analy9cs ● Security controls
● Leverage Exis9ng SQL Skills & BI Tools
● Easily Integrate with Other Tools
● Sub-‐second Performance
Compression + Par99oning
core
compliance
● Hadoop-‐Na9ve ● Supports Pivotal HD
and Hortonworks Data Pladorm
● Ambari-‐Integrated
![Page 17: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/17.jpg)
17
Apache HAWQ 2.0 (new features..) Areas of Enhancement New Features
Elas;c & Scalable Architecture
Hadoop-‐Na;ve Integra;ons
Simplified External Data Access/Queries
Performance & Op;miza;ons
On-‐Demand Virtual Segments
Flexible Query Dispatch on subset nodes
3 Tier RM: YARN level>User>Query-‐Operator
Dynamic Cluster Expansion (no redistribute)
New Fault Tolerance Service
HCatalog integra;on -‐ Read Access
HDFS Catalog Cache
Per Table Directory storage (user friendly)
Single physical segment per node
Easier Administra;on/Usage
Cloud-‐Ready Simpler Management Commands
![Page 18: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/18.jpg)
18
HAWQ Segments
HAWQ Masters
Yarn
Physical Segment
Client
Parser/ Analyzer
Op;mizer
Dispatcher
DataNode
NodeManager
NameNode NameNode
External Data Stores via Xtension Framework (Hive/HBase/etc)
Resource Manager
Fault Tolerance Service
Catalog Service
Virtual Segment
Virtual Segment
Physical Segment
DataNode
NodeManager
Virtual Segment
Virtual Segment
Physical Segment
DataNode
NodeManager
Virtual Segment
Virtual Segment
Resource Broker
libYARN
HDFS Catalog Cache
Interconnect Interconnect
Apache HAWQ 2.0 Architecture
![Page 19: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/19.jpg)
19 19 Pivotal Confidential–Internal Use Only
Apache MADlib Overview
![Page 20: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/20.jpg)
20
Scalable, In-Database Machine Learning
• Open Source https://github.com/apache/incubator-madlib • Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL • Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)
![Page 21: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/21.jpg)
21
Functions Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers • Linear Algebra
Matrix Factorization • Singular Value Decomposition (SVD) • Low Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards Regression • Elastic Net Regularization • Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Apriori) • Topic Modeling (Parallel LDA) • Decision Trees • Random Forest • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation • Naïve Bayes • Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators • CountMin (Cormode-Muth.) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series • ARIMA
Oct 2014
![Page 22: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/22.jpg)
22
MADlib Advantages
� Better parallelism – Algorithms designed to leverage MPP and
Hadoop architecture
� Better scalability – Algorithms scale as your data set scales
� Better predictive accuracy – Can use all data, not a sample
� ASF open source (incubating) – Available for customization and optimization
![Page 23: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/23.jpg)
23
Calling MADlib Functions: Fast Training & Scoring • MADlib allows users to easily create
models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
• All the data can be used in one model • Built-in functionality to create multiple
smaller models (e.g. classification grouped by feature)
• Open source lets you tweak and extend methods, or build your own
![Page 24: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/24.jpg)
24
Challenges in computing OLS solution
a b c d e f g h
X
Segment 1
Segment 2
![Page 25: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/25.jpg)
25
Challenges in computing OLS solution
a b c d e f g h
X
Segment 1
Segment 2
a c e g b d f h
Segm
ent 1
Segm
ent 2
XT
![Page 26: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/26.jpg)
26
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Data across nodes are multiplied
![Page 27: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/27.jpg)
27
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Looks like the result can be decomposed
ab+cd+ef+gh b2+d2+f2+h2
ab+cd+ef+gh
![Page 28: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/28.jpg)
28
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Data across nodes are multiplied!
ab+cd+ef+gh b2+d2+f2+h2
ab+cd+ef+gh
= + a b e f
e f a b + c d g
h g h c
d +
![Page 29: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/29.jpg)
29
Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
![Page 30: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/30.jpg)
30
Contributors Welcome!
• Web sites – http://hawq.incubator.apache.org/ – http://madlib.incubator.apache.org/ – https://cran.r-project.org/web/packages/PivotalR/index.html
• Github – https://github.com/apache/incubator-hawq – https://github.com/apache/incubator-madlib – https://github.com/pivotalsoftware/PivotalR
![Page 31: SQL and Machine Learning on Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022021918/589ab9401a28abff4f8b65d5/html5/thumbnails/31.jpg)
31
?