Post on 20-May-2020
A NEW PLATFORM FOR A NEW ERA
Additional Line 18 Point Verdana
2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD and HAWQ Immersion v5 John Funk
3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Course Outline � PHD and HAWQ Introduction
� HAWQ Architecture
� HDFS Review
� HAWQ Distribution, Partitioning and Storage options
� Query execution in HAWQ
� Loading and Unloading data in HAWQ
� PXF – Pivotal Xtension Framework Best Practices
� HAWQ, HBASE and HIVE Comparative Usage
� Securing HAWQ
4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD and HAWQ Introduction and Positioning
5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD and HAWQ is the…
Enterprise platform that provides the fewest barriers, lowest risk, most cost effective and fastest way to enter in to
big data analytics on Hadoop
6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Evolved From… • Greenplum database re-platformed on Hadoop/HDFS
• Over a decade of proven Greenplum database performance
• HAWQ provides all major features found in Greenplum database • SQL Completeness: 2003 Extensions • Robust Query Optimizer • Row or Column-Oriented Table Storage • Compression • Distributions • Multi-level Partitioning • Parallel Loading and Unloading • High speed data redistribution
• Views • External Tables • Resource Management • Security • Authentication • Management and Monitoring • ODBC/JDBC Compliant
7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Benefits… • Out of the box SQL for Hadoop
• SQL adoption versus learning MapReduce programming
• GPXF External Tables providing SQL access to Hadoop • HDFS, HBase, Hive or any data types
• Broad data access, integration and portability
• Performance and Scalability • Parallel Everything • Dynamic Pipelining • High Speed Interconnect • Optimized HDFS access with libhdfs3
• Co-Location • Partition Elimination • Higher Cluster Utilization • Concurrency Control
8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource
Management & Workflow
Yarn
Zookeeper
Apache Pivotal
Command Center Configure,
Deploy, Monitor, Manage
Data Loader
Pivotal HD Enterprise
Spring
Unified Storage Service
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced Database Services
Hadoop Virtualization Extension
Distrubuted In-memory
Store
Query Transactions
Ingestion Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time Database Services
MADlib Algorithms
Oozie Vaidya
9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Flexible Deployment Model
deploy
Portable
Elastic
Promotable
HW Abstracted
Manageable
Public Cloud On Premise Private Cloud
10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD � World’s first true SQL processing for enterprise-ready
Hadoop
� 100% Apache Hadoop-based platform
� Virtualization and cloud ready with VMWare and Isilon
� Scale tested in 1000 node Pivotal Analytics Workbench
� Available as a software-only or appliance-based solution
� Backed by EMC’s global, 24x7 support infrastructure
11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Introduction to Pivotal HD
12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource
Management & Workflow
Yarn
Zookeeper
Apache Pivotal
Command Center Configure,
Deploy, Monitor, Manage
Data Loader
Pivotal HD Enterprise
Spring
Unified Storage Service
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced Database Services
Hadoop Virtualization Extension
Distrubuted In-memory
Store
Query Transactions
Ingestion Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time Database Services
MADlib Algorithms
Oozie Vaidya
13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pig � Pig provides a high-level, data flow oriented, abstraction for
MapReduce – Much more concise than MapReduce code – Though not very intuitive
� Compiles to MapReduce programs, which it runs for you
� Output can be dumped to terminal, or as files in HDFS for access by HAWQ or other tools
� Useful operators, extensible through “Piggybank”
� Developed at Yahoo!
14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Hive � Hive provides a SQL-like interface to data in HDFS
� To users who know SQL, Hive provides a much more intuitive interface than MapReduce or Pig
� Like Pig, Hive operates by translating the user’s query into one or more MapReduce jobs, running these on potentially very large data sets, and finally printing the result
� Drawbacks – Limited SQL, job latency and frequent I/O (slow)
� Developed at Facebook
15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HBase � HBase provides random, real time read/write access to data stored within
HDFS – Sparse, wide tables
� Flexible schema
� Key/value store: given (‘table’, ‘rowkey’), retrieve row – Does not perform well if not retrieved by key/value
� Update to row adds new data with current timestamp – Previous state can be recovered using previous timestamp
� Using PXF external tables, HAWQ is able to incorporate HBase data into queries
– Pushing predicates into HBase when possible
16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HAWQ
17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ: The Crown Jewels � SQL compliant
� World-class query optimizer
� Interactive query
� Horizontal scalability
� Robust data management
� Common Hadoop formats
� Deep analytics
18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
High Performance Query Processing HAWQ
� Interactive and true ANSI SQL support
� Multi-petabyte horizontal scalability
� Cost-based parallel query optimizer
� Programmable analytics
19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Enterprise Class Database Services & Management HAWQ
� Scatter-gather data loading
� Row and column storage
� Workload management
� Multi-level partitioning
� 3rd-party tool & open client interfaces
20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pre-Integrated Deep Analytics HAWQ
� Performance via fully parallelized implementation
� Consistent, user friendly SQL interfaces
� Ease of data preparation
� Pre-integrated MADLib support – Linear Regression – Logistic Regression – Multinomial Logistic
Regression
– K-Means – Association Rules – PLDA - useful for topic
modeling
21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
A fast extensible framework connecting HAWQ to a data
store of choice that exposes a parallel API
PXF: Pivotal Xtension Framework
22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF � An advanced version of GPDB
external tables
� Enables combining HAWQ data and Hadoop data in single query
� Supports connectors for HDFS (read and write), HBase and Hive
� Provides extensible framework API to enable custom connector development for other data sources
– GemFireXD, JSON format, Cassandra, Accumulo
HDFS HBase Hive
PXF Xtension Framework
23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Features � What is it?
– HAWQ feature to access data stored in other popular Hadoop modules (HDFS, HBase, Hive) using full SQL interface of HAWQ
� Why is it important? – A customer may prefer to primarily manage certain data
in HBase, but want to join this to other data sets stored in HAWQ for analytics purposes. Or a customer may need SQL access to data in HBase or HDFS.
� When/who to use with? – An important feature to discuss with data and
application architects who are concerned about unifying data access patterns across the variety of Hadoop components
– Also useful to address any concerns about HAWQ using a proprietary data format not currently readable by other Hadoop processes.
Text HBase Hive Avro
HDFS
PXF Transparent, Optimized SQL Access to non-
HAWQ formats
HDFS
24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Feature Summary ★ HBase (w/filter pushdown) ★ Hive (w/partition exclusion. various storage file types) ★ HDFS Files: read (delimited text, csv, Sequence, Avro) ★ HDFS Files: write (delimited text, csv, Sequence, various compression
codecs and options) ★ GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★ Stats collection ★ Automatic data locality optimizations ★ Extensibility!
25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD and HAWQ Rapid Innovation A look at features released in 2014
26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
What’s New in PHD 1.1 � Gemfire XD Beta
� Orca
� PXF: Writable HDFS Table Support
� HAWQ Format Reader
� UDF Support
� Oozie
� Vaidya
� Kerberos Support (HDFS, HAWQ, USS)
� Pgcrypto for HAWQ
� Unified Storage Service: CDH4 as a data source
27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
What’s New in PHD 1.1.1 � Automatic HD configuration via ICM
– Manual failover of HAWQ/PXF
� Manual NameNode HA
� Kerberos authentication support (includes HAWQ, PXF, HBase, Hive)
� Parameterized Hadoop environment variables
� Backup and restore scripts for Admin node
� Rebalance HDFS using web API
� PiggyBank Support in Pig 0.12
� HAWQ gp_toolkit support
28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
What’s New in PHD 2.0 � GemfireXD GA
� Pivotal HD Stack – Hadoop 2.2 Rebase; Built w/JDK 1.7 – Hive 0.12, Hbase 0.96 – Graphlab 2.2 BETA (via Hamster/OpenMPI)
� HAWQ – Automated NameNode and HAWQ Master Failover – MADlib 1.5 as separately deployable package, PL/Java, (PL/R and PL/Python from 1.1.1) – Add Segments (HAWQ expand) – Pluggable storage Phase 1 – Basic Parquet support – Error Tables
� PCC and ICM – New ‘Read Only’ user role – Log Management – DCA/Isilon enhancements
29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ 1.2 Deep Scalable Analytics
� Linear Regression � Logistic Regression � Multinomial Logistic Regression � K-Means � Association Rules � Latent Dirichlet Allocation � Naïve Bayes � Elastic Net Regression � Decision Trees / Random Forest � Support Vector Machines � Cox Proportional Hazards Regression � Descriptive Statistics � ARIMA
30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Pivotal vs. PL/R
� Interface is R client � Execution is in database � Parallelism handled by PivotalR � Supports a portion of R
PivotalR • Interface is SQL client • Execution is in R • Parallelism via SQL function
invocation • Supports all of R
PL/R
31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
More to Come…! • PostGIS • Enhanced Optimizer • Query 3rd party remote clusters • …and Much More
32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Greenplum Database and HAWQ
33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Evolved From… � Greenplum database re-platformed on Hadoop/HDFS
� HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – JDBC Compliant – Robust Query Optimizer – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High speed data redistribution
– Views – External Tables – Compression – Resource Management – Security – Authentication – Management and Monitoring
34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ � GPDB on HDFS
� Not shared nothing built on a distributed file system (HDFS) – Nodes can access shards of data on other nodes
� Built for large I/O, append-only, write-once, read-many
� Segments are stateless – HA is one of the main drivers towards HDFS
HDFS DataNode
HDFS NameNode
HDFS DataNode HDFS DataNode
35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Features � HAWQ provides all major features found in Greenplum
database that can be supported in Hadoop/HDFS including – Row or Column-oriented table storage – Distributions – Partitioning – Views – External tables
� Using some features without understanding implications in HDFS may result in problems
– We will discuss this the modules on each specific topic.
36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Architectural Differences from GPDB � Stateless Segment Hosts
– Segments do not know what is visible or aborted in their physical data – Segments do not know what columns are in a table
� HA model deviates from shared nothing environment – If segment is down simply read from the replica in HDFS – No lengthy failover process
� HDFS design doesn’t lend itself to local transaction management
– Frequent, small bursts of I/O on HDFS perform poorly
37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Architectural Implications of Using HDFS � To re-platform GPDB on HDFS, segment workers had to be simplified
(or made dumber) – GPDB segment workers had their own copies of metadata,
transaction management and local storage
� Heap storage in GPDB requires the database to make modifications to tuples on disk
– HDFS is append only therefore heap storage cannot work on DataNodes
– Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog
38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance
� No Update and Delete – Truncate is supported
� No catalog on segment servers
� No local transaction management at the segment level
� No indexes
� Local storage exists on segments but is used for temporary purposes
39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ or Greenplum Database? GPDB HAWQ
Real time random read/writes ✗
Large I/O write once, read many ✗
Petabytes of data ✗
Hadoop/HDFS platform ✗
Updates ✗
Deletes ✗
Indexes ✗
Row or columnar oriented table storage ✗ ✗
User Defined Data Distributions ✗ ✗
User Defined Partitioning ✗ ✗
Resource Management ✗ ✗
User Defined Functions (UDFs) ✗ ✗
External Tables ✗ ✗
GPText ✗
MADLib Algorithms ✗ ✗
40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Introduction to HAWQ Architecture
41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Interconnect
Basic HAWQ Architecture
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
HDFS
…
HAWQ Standby Master
In production there will be other nodes for example, Pivotal CC/
ICM admin node, YARN Resource Manager node,
Secondary NameNode, etc.
42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Master � Located on a separate node from the NameNode in production
– For a small POC cluster the HAWQ Master may run on the NameNode
� Does not contain any user data
� Contains Global System Catalog – System tables that contain HAWQ metadata
� Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
Catalog
43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Metadata � Metadata is stored only in the HAWQ Master on local file
system ▪ Catalog information makes use of heap store
� No catalog/metadata on segment nodes (DataNodes) ▪ Segment nodes are stateless ▪ No heap store
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
Catalog
44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Segments • A HAWQ segment within a Segment Host is an
HDFS client that runs on a DataNode • Multiple segments per Segment Host/DataNode • Segment is a basic unit of parallelism
• Multiple segments work together to form a single parallel query processing system
• Operations (scans, joins, aggregations, sorts, etc.) execute in parallel across all segments simultaneously
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Segments Access Data Stored in HDFS � Segments are stateless
– Does not store database and table metadata – HAWQ Master dispatches query plan along with related metadata
obtained from the NameNode
� Segments communicate with NameNode to obtain block lists where data is located
� Segments access data stored in HDFS
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Interconnect
HAWQ Parser
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Clients
JDBC
SQL
• Enforces syntax and semantics • Converts SQL query into a
parse tree data structure describing details of the query
47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Interconnect
HAWQ Parallel Query Optimizer
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on lineitem Hash
Seq Scan on orders
Hash
HashJoin
Seq Scan on customer Hash
Broadcast Motion
Seq Scan on nation
48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Interconnect
HAWQ Dispatch and Query Executor
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
NameNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
1. Dispatch communicates the query plan to segments
2. Query Executor executes the physical steps in the plan
ScanBarsb
HashJoinb.name = s.bar
ScanSellss Filterb.city = 'San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ScanBarsb
HashJoinb.name = s.bar
ScanSellss Filterb.city = 'San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Transactions � DataNodes in HDFS do not know what is visible
– No idea what data they have – Visibility is defined by the NameNode
� Therefore, segment nodes do not know what is visible – Visibility is defined by HAWQ Master
� No distributed transaction management – No UPDATE or DELETE
� Truncate is implemented to support rollback of failed transactions
� Transaction logs present only on HAWQ Master – For inserts, single phase commit performed on HAWQ Master
Local Storage
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
Catalog
50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Interconnect Performance and Scalability � Inter-process communication between segments
– Standard Ethernet switching fabric
� Uses UDP protocol (User Datagram Protocol) – Improved performance and scalability
� Additional packet verification and checking not performed by UDP – Reliability equivalent to TCP
Interconnect
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
Local Temp Storage
Segment Host Query Executor
HDFS
PXF
Segment
Segment
DataNode
51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Dynamic Pipelining tm
Local Temp Storage
Segment Host Query Executor
DataNode
PXF
Local Temp Storage
Segment Host Query Executor
DataNode
PXF
Local Temp Storage
Segment Host Query Executor
DataNode
PXF
• Differentiating competitive advantage! • Core execution technology from GPDB • Parallel data flow using the high speed UDP interconnect • No materialization
• As performed with MapReduce
Dynamic Pipelining Interconnect
52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Dynamic Pipelining tm � Framework that enables parallel data flow
– Combines high speed UDP interconnect and a run time execution environment for big data workloads
– Data from upstream components in the dynamic pipeline are transmitted to downstream components through UDP interconnect
� Dynamic Pipelining run time layer ensures that queries complete, even for very demanding queries under heavy cluster utilization
– Provides a seamless data partitioning mechanism which groups together parts of a data set which are often used in any given query
– Enables queries to run without materializing contents to disk
53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Lab
54 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Create tables for lab exercises HAWQ_DDL Lab
� Run the DDL script to create HAWQ database and tables
� Review HAWQ tables
55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 55 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HDFS Review
56 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
What is HDFS? � HDFS uses a Java file system
– Uses libhdfs (JNI) to access the file system
� Scalable, distributed, fault-tolerant file system
� Designed to run well on commodity hardware
� Acknowledge that components frequently fail – Entire node may fail or, more commonly, one or more disks within a
node will fail – Gracefully continue to run in the presence of failures (entire node or
disks within a node)
57 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HDFS Basic Architecture Client
Ingest Files
Local Data Stores
HDFS
DataNode
Local Data Stores
DataNode
Local Data Stores
DataNode
Metadata
NameNode
3 x Replication (default)
Egress Files
Ethernet
58 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HDFS Model � Mostly POSIX “like” file system, with some caveats
– Write once, read many – Doesn’t support updates to files (simple consistency model) – Pivotal HD supports append and truncate on its HDFS layer
� Access patterns are well-suited to SATA disk drives – Fewer seeks – Read large, contiguous blocks
� Prefer fewer, large files – Split files up into blocks ▪ 128MB default
– Evenly distribute blocks across the cluster
59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 59 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Data Storage and I/O
60 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Data Storage and I/O � Segments are HDFS clients that run on DataNodes
� Each table’s data is sharded on HDFS
� The DataNodes are responsible for serving read and write requests from HAWQ segments
– Data stored in HAWQ database tables
� Data stored external to HAWQ but within the Hadoop cluster can be read using PXF external tables and are extensible
– HDFS, Hive, HBase
� Data stored in HAWQ can be written to HDFS for external consumption using Writable HDFS Table Support
61 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Segment Files � Each table’s data is sharded on HDFS
� For example:
/hawq_data/gpseg<ID>/<DB OID>/<schema OID>/<table OID>.1,2,3,4,…
� Data inserted to the same segment is always appended to the segment file
� The maximum file size in HDFS is governed by dfs.namenode.fs-limits.max-blocks-per-file configuration in the hdfs-site.xml configuration file
– The default is 1048576 which is 64TB
62 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Data Locality � For tables using a hash distribution, data with the same hash key will always be
handled by the same segment and is always written to the same DataNode as the segment host
� Data locality will always be maintained unless one of the following conditions occur
– DataNode on the segment host is at full file capacity – DataNode on the segment host fails – DataNode experiences a number of failed drives more than value specified
by dfs.datanode.failed.volumes.tolerated configuration parameter
� Data locality is lost permanently when a DataNode fails for a long enough length of time for the NameNode to mark it down
63 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Local Read Failures
� When there is a failure to read from a local DataNode on a segment host, reads are performed from a remote DataNode (replicated copy)
� Performance impact of approximately 70% – This number quickly decreases with subsequent reads as a result of
caching the data – Decreases to 10% on subsequent reads when cache is hit
64 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HDFS I/O � HDFS uses a Java file system
– libhdfs (JNI) is used to access HDFS
� Cost difference in reading thru HDFS indirection layer to read HDFS is 1.75 to 2.5 times slower than reading directly from disk
– Cost of simply reading, doing an IPC into a java JVM and java reaching out to the file system
� The cost of reading through libhdfs in java (garbage collection + overhead) is slow
65 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
libhdfs3 � Pivotal rewrote libhdfs in C++ resulting in libhdfs3
– C based library – Leverages protocol buffers to achieve greater performance
� Libhdfs3 is used to access HDFS from HAWQ
� There is a GUC to disable libhdfs3 but is used for internal testing and debugging by engineering
– It should never be turned off or disabled in the field
66 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Reads and HDFS � In HAWQ data is physically partitioned/sharded across the cluster
� Accessing a large number of small files for a single query in HDFS is not the design point of HDFS
� For every DataNode running HAWQ, for every segment, for every partition and for every column (if using CO) a substantial amount of metadata is needed from the NameNode
� Typically when accessing the NameNode for a MapReduce job it reads one, large contiguous file across HDFS then carves it up into partitions at run time and executes
67 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Reads � The HAWQ master has a centralized catalog metadata store
� HDFS has a NameNode metadata store
� HAWQ master must interrogate the NameNode to obtain metadata from the NameNode and dispatch it along with the query plan to each of the HAWQ segments
� Then the segments callback to the NameNode to obtain a block location array consisting of block IDs
– For any given shard we don’t know the actual block IDs that we need to read
68 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Data Storage Performance Considerations � Data is still split per-segment, so there is one file, per object, per segment
� There can be a large number of partitions depending on the partition granularity – Every partition is a file
� Columnar orientation on very wide tables – Every column is a file
� Can result in – Many very small files – A huge number of calls to the NameNode – Errors (particularly when loading) and slowness (when running queries)
69 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Solution � You must consider #Segments X #Columns X #Partitions
� In general, determine the optimal number of segments on DataNodes
� Use a higher partition granularity
� Limit columnar orientation on a very wide tables – If partition granularity requirement is low, use row-based table
orientation
� NEVER use partitioning and column orientation together!
70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 70 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Distributions, Partitioning and Storage Options
71 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Data Distributions � Same functionality and behavior as GPDB
– Data locality/co-located joins, redistribution, broadcasts, etc. – Most important is an even distribution of data!
� Loading randomly distributed tables is faster on larger tables since data does not get hashed
� There is no difference in sequential scans on randomly distributed tables vs. hash distributed tables
� Complex queries (joins, aggregates, sorts) on large randomly distributed tables take longer due to re-hash of data for local joins and aggregates
72 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Loading Varying Storage Options
� Loading columnar tables take approximately 5-10x longer than loading the same row based table
� Loading compressed row (or columnar) tables only introduces slight overhead, 20% or less on small tables/loads
– And on larger tables is actually faster by 5-10% because less data (blocks) is being written to HDFS
� zlib compression reduced storage footprint by 50% on very high cardinality data
73 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Compressed Row Based vs Non-Compressed Row Based � Sequential scan operations (for example select count(x))
takes 2-6x longer with compressed tables based on table size
– As the table size increased the difference in query time reduced
� On more complex queries with aggregates and sort operations the difference in query time is almost unnoticeable
74 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Querying Row vs Columnar Based Tables � Sequential scan selecting a few columns takes only marginally less time
to execute on small columnar tables than the same row based table – As the table size increases there is no perceptible performance difference
� Wide queries and joins that read all columns in a columnar table does not display significant difference in query times than the same row based table
� Complex queries (sorts, aggregates, joins) that involve only a subset of columns in a table the difference between columnar and row based is negligible
– The majority of time for these queries is spent on the sort/aggregation operations and not the HDFS read
75 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Partition Row Based vs Non-Partition Row Based � Load times on small partitioned tables are 5x slower than non-partitioned
tables
� Load times on large partitioned tables were 2-3x slower than non-partitioned tables
� Sequential scans take 130-200% longer on partitioned tables
� Complex queries (aggregates, joins, sorts) that did not have a WHERE clause eliminating partitions, the query time was actually faster in the case of partitioned tables for larger tables
– May is due to the increased parallelism achieved with partition tables
76 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Partitioning � There can be a large number of partitions depending on the
partition granularity – Every partition is a file in HDFS – May result in many small files which is not desired
� In general, use partitioning on very large tables but use a higher partition granularity so there are fewer, larger files
� Do not use partitioning if load performance is critical
77 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Columnar Storage � Do not use columnar orientation on very wide tables
– Every column is a file in HDFS
� If partition granularity requirement is low, use row-based table orientation
� Optimally you want bigger files and fewer NameNode calls for scanning the same amount of data
� NEVER use columnar tables with partitioning! – Very different from GPDB
78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 78 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Running Queries in HAWQ
79 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
SQL Querying � Uses pipelined method of execution developed for
Greenplum Database – Efficient parallel execution – No MapReduce used behind the scenes – No intermediate materialization of data
� Only difference in operator level execution as compared to Greenplum database is the scan node
– Scan node is the operator that reads data from HDFS ▪ Versus reading data from file system in a Greenplum
database
80 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
SQL Querying Caveats � SQL query support similar to Greenplum Database
– Support for advanced SQL like OLAP, analytical functions (i.e. MADLib)
� No updates
� No deletes
� No support for indexes
� No GPText
81 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Query Example � SQL submitted to HAWQ master
– Validates SQL and parses query – Query Optimizer produces the plan – HAWQ master obtains metadata from NameNode and annotates the query
plan with metadata that segments need for execution
� HAWQ Master dispatches the plan to every segment
� Segments callback to the NameNode to obtain a block location array consisting of block IDs
� Libhdfs3 read operation begins, retrieving data from whichever DataNodes in the cluster it needs and returns data to upper level operators
� Upper level operators (e.g. hash-join, hash-agg) carry on the execution using motion operators as needed
82 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Query Using PXF External Tables � Data can be queried from external data sources and joined
with HAWQ data using external table methodology
� Regular external tables can be used for data residing outside of the Hadoop ecosystem
� For data residing in the Hadoop ecosystem PXF external tables can be used
– Read HDFS, HBase, Hive and other formats using standard SQL
83 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
3rd Party Application Querying
� JDBC interface for HAWQ – Used for queries but should not be used for inserts
� JDBC DML operations (CREATE TABLE, TRUNCATE TABLE) fall into a transaction block
– Meaning if you create a table in a transaction block and then rollback the transaction you’ll never see that table
� Can not perform updates, deletes, or create indexes
84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 84 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Loading and Unloading Data in HAWQ
85 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Loading Data into HAWQ
� When the data sources are outside the Hadoop ecosystem – Use regular gpfdist external tables – Use COPY command for loading small data sets only
� When the data sources are in the Hadoop ecosystem – Use PXF external tables
86 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Data Loading Options
HDFS DataNode
HAWQ Segment Host
HDFS DataNode
HAWQ Segment Host
HDFS DataNode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients JDBC
SQL Console
insert into <hawq-target-table> select * from <regular external table>;
HDFS Namenode
HAWQ Master Host
Query Optimizer Query Parser
Interconnect
External Data Sources
insert into <hawq-target-table> select * from <pxf external table>;
COPY command
87 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ Writes and Performance � The fastest method to write data in HAWQ is gpfdist
� Testing gpfdist write process capped at 1GB/sec (with 1 gpfdist server and 64 segment readers)
– This speed increases linearly with added gpfdist servers
� Testing hadoop fs –put capped at about 130MB/sec
� PXF external table copy to HAWQ table capped at 600MB/sec (for 64 segments)
� Testing gpfdist external table copy is approximately 160% faster than PXF external tables
88 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Write Paths � Using gpfdist HAWQ segments read chunks of data from the gpfdist servers in
parallel, then hashes on the distribution key and sends the data to the correct segment server to be written to HDFS locally by the DataNode
� Using PXF external tables HAWQ segments requests chunks of data from the PXF fragmenter, PXF reads data via a set of PXF accessors and returns the data to the segment, the segment then hashes on the distribution key and sends it to the correct segment (likely not on the same DataNode) for write to HDFS by the DataNode
– The highest number of NameNode RPC calls are observed since both the PXF fragmenter and segments are engaging in NameNode calls for block locations
89 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Optimizing gpfdist for Performance � In general, maximize the parallelism as the number of
segments increase
� Spread the data evenly across as many nodes as possible
� Spread the data evenly across as many file systems as possible
– Run two gpfdist's per file system
� Run gpfdist on as many interfaces (NICs) as possible
� Keep the work even across ALL of these resources – In an MPP shared nothing environment loading is as fast as the slowest node
90 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
gp_external_max_segs Optimization � Controls the maximum number of segments each gpfdist
serves
� Keep gp_external_max_segs and number of gpfdist processes an even factor
– gp_external_max_segs / # of gpfdist processes should have a remainder of 0
� Default is 64
91 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Error Handling � Single Row Error Handling
– Supported in external tables and COPY command – Define a table to catch the ‘unloadable’ rows – Load continues—does not fail
� Reject Limit – Capping the number of rejects – Once limit is met, load statement fails – Limit can be actual number or percent – Rejects evaluated at the segment
92 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Loading Recommendations
� Default recommendation is to use bulk load through gpfdist external tables
– Suitable from HDFS perspective
� To load smaller amount of data (Example: <100,000 rows) – COPY command can be used
� Single row inserts not recommended – Not suitable from HDFS perspective
93 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Unloading Data � Regular writable external tables can be used for scalable
unload – Same as in GPDB
� Copy command can be used for unloading small data sets
� Example for unloading to HDFS DROP EXTERNAL TABLE IF EXISTS foo_dump; CREATE WRITABLE EXTERNAL WEB TABLE foo_dump ( LIKE foo ) EXECUTE 'hadoop fs -put - hdfs://pivhdsne:8020/dump/foo/${GP_SEGMENT_ID}.tsv' FORMAT 'TEXT' (DELIMITER E'\t'); INSERT INTO foo_dump SELECT * FROM foo;
94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 94 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Lab
95 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HBASE_HAWQ_LOAD Lab
� Load dimension tables into HBase using importtsv – Data is in HDFS
� Load data into HAWQ tables using COPY – Data is in DAS
� Load a HAWQ AO table using SELECT from one of the PXF external tables defined
Loading Data into HBase and HAWQ
96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 96 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF External Tables
97 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF is...
A fast extensible framework connecting Hawq to a data
store of choice that exposes a parallel API
98 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Hawq External Tables • gpfdist
– Remote delimited text (or csv) files
• file
– Text files on segment filesystem
• execute
– Script execution and produced data
• pxf
– Text and binary data from available pxf connectors
99 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF � Load data into HAWQ from Hadoop
� Query Hadoop data without materializing it into HAWQ – HDFS: delimited text, csv, Sequence, Avro – HBase (w/filter pushdown) – Hive (w/partition exclusion) ▪ Text, Sequence and RCFile formats
� Write HAWQ data to HDFS – Delimited text, csv, Sequence – Various compression codecs and options
� Extensible! – GemFireXD, JSON format, Cassandra, Accumulo, others
100 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Features � Supports filtering through predicate push down in HBase
– <, >, <=, >=, =, != between a column and a constant – Can AND between these (but not OR)
� Supports Hive table partitioning
� Ability to analyze data stored on HDFS using a data processing system
– HAWQ optimizer uses the statistics to generate optimal plans on PXF external tables
� Extensible framework Java API to enable custom development for other data sources and custom formats
101 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Key Use Cases � Using analytics, SQL query functionality from HAWQ on
HDFS, HBase, or Hive data without materialization into HAWQ
� Join dimension tables stored in HAWQ with HBase fact tables
� Fast ingest/materialization of high value processed data from HDFS, Hive or HBase data into HAWQ
102 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Differentiators � Utilizes HAWQ fast parallel optimizer � Applies data locality optimizations to reduce resources and network
traffic � Extensible framework
� Customers and partners can configure support for any new data store that will automatically support a fast and parallel data transfer
� JSON format, Cassandra, Accumulo in beta � Supports ANALYZE for gathering HDFS file statistics and having it
available for the query planner at run time
103 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Feature Summary ★ HBase (w/filter pushdown) ★ Hive (w/partition exclusion. various storage file types) ★ HDFS Files: read (delimited text, csv, Sequence, Avro) ★ HDFS Files: write (delimited text, csv, Sequence, various compression
codecs and options) ★ GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★ Statistics collection ★ Automatic data locality optimizations ★ Extensibility!
104 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Components � Fragmenter
– On the NameNode – Metadata of data source (blocks and location) is passed back to the HAWQ Master
by the Fragmenter
� Accessor – Responsible for reading specific data fragments and passing them to the Resolver
� Resolver – De-serializes the records and serializes them into list of one field objects – One field objects converted into GPDBWritable that can be read by HAWQ
� Analyzer – Responsible for collecting statistics on external table data that can be used by HAWQ
optimizer
105 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
106 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
107 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Loading into HAWQ � To load data into HAWQ use a variation of
– insert into <hawq-target-table> select * from <pxf-external-table>;
� Data can be transformed in-flight before loading
� Data from Hadoop can also be joined in-flight with HAWQ data while loading
� Number of segments responsible for connecting to Pivotal HD for concurrent reading of data can be tuned
– gp_external_max_segs GUC – Default 64
108 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF Querying
� PXF external tables can be queried directly without materialization into HAWQ
� PXF data can be joined with HAWQ tables
� Ability to analyze external tables helps HAWQ optimizer to choose optimal plans
� HBase predicate push down
� Hive partitioning
109 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Profiles • Improved user experience • Informative error messages
LOCATION(‘pxf://<host:port>/sales?fragmenter=HiveFragmenter&accessor=HiveAccessor&resolver=HiveResolver’)
LOCATION(‘pxf://<host:port>/sales?profile=Hive’)
110 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
profiles.xml
<profile>
<name>HBase</name>
<description>Used for connecting to an HBase data store engine</description>
<plugins>
<fragmenter>HBaseDataFragmenter</fragmenter>
<accessor>HBaseAccessor</accessor>
<resolver>HBaseResolver</resolver>
<myidentifier>MyValue</myidentifier>
</plugins>
</profile>
111 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HDFS Files Example Analyze all text files that exist inside hdfs directory ‘sales/2012/01’
CREATE EXTERNAL TABLE jan_2012_sales (!!id int, !!total int, !!comments varchar!
)!LOCATION(‘pxf://10.76.72.26:50070/sales/2012/01/items_*.csv?! profile=HdfsTextSimple )!FORMAT ‘TEXT’ (delimiter ‘,’);!
112 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HBase Table Example Get data from an HBase table called‘sales’. In this example we are only interested in the rowkey, the qualifier ‘saleid’ inside column family ‘cf1’, and the qualifier ‘comments’ inside column family ‘cf8’
CREATE EXTERNAL TABLE hbase_sales (!!recordkey bytea, !!“cf1:saleid” int, !!“cf8:comments” varchar!
)!LOCATION(‘pxf://10.76.72.26:50070/sales?! profile=HBase )!FORMAT ‘custom’ (formatter='gpxfwritable_import');!
direct mapping
113 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Writable PXF – Export to HDFS
• gphdfs-like functionality but extensible – Supports text, csv, SequenceFile – Supports various Hadoop compression Codecs
CREATE WRITABLE EXTERNAL TABLE ... LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec') FORMAT ‘text’(delimiter ‘,’);
can create a new profile “HdfsTextSimpleGZipped” that includes compression_codec
LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimpleGZipped')
114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 114 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Lab
115 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF_PUSHDOWN Lab
� Review the HAWQ DDL
� Run the HBase query – customers_dim table
� Using “show” check the value of the GUC pxf_enable_filter_pushdown
– Toggle value to “off”
� Rerun the query
PXF external tables predicate pushdown
116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 116 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Lab
117 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
PXF_STATS Lab
� SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';
� ANALYZE table_name;
� SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';
PXF external tables statistics
118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 118 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ, Hive, HBase Comparative Usage
119 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Hive � Hive uses a SQL-based language called HiveQL, which is a subset of
SQL and has some additional MapReduce-specific syntax
� Hive interprets SQL into a series of native MapReduce jobs – Materializes data to disk
� Hive can manage its own tables or use external tables – No inherent performance difference, just ease of management
� Hive is typically used as the integration point for BI and ETL tools
� Hive only has a rudimentary query optimizer
120 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HBase � HBase is a scalable, sorted, columnar, key-value data store
– Linear scalability – Keys are sorted and partitioned, so fetching by key is fast
▪ Can support range scans
– Data is split into column families, which are stored as separate files underneath ▪ Allows for column pruning
– Stores data as a “doubly-nested map” key-value ▪ Row key->column family:label->data value ▪ Allows for very flexible schema as the label is arbitrary ▪ Can support hierarchical data structures easily
121 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive
Interface ANSI SQL Java API/Shell HiveQL (SQL Subset)
Client Connection JDBC/PXF Java/REST API JDBC (Limited)
Executes as MapReduce Never Yes Yes
SQL Completeness Yes No No
Nodes (Supported) 1,000+ 1,000+ 1,000+
Restart SQL on failure No No Yes
Performance High Low Low
Rely on MapReduce No Yes Yes
Open Source No Yes Yes
DDL Yes No Yes
ANSI Data Types Yes No Yes
Indexes No Yes No
When to use Adhoc Analytics Flexible Schema/Updates Batch
122 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive
User Defined Data Distributions Yes No Yes (limited)
Advanced Partitioning Yes Yes (limited) Yes (limited)
Robust SQL Optimizer Yes No No
Store Data on HDFS Yes Yes Yes
Has its own daemons Yes Yes No
Relational Database Yes No No
Manage own tables Yes Yes Yes
UDFs Yes No No
MADlib Yes No No
123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 123 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Lab
124 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
HIVE_VS_HAWQ Lab
� Run the Hive DDL to create Hive external tables against the existing HDFS data
� Run queries against HAWQ and Hive versions of these tables
Query Performance Comparison
125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 125 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Securing PHD Clusters
126 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
� Data Protection – Data access control – Data at rest encryption – Masking/Tokenization for data load – Data-In-Motion encryption
� User Management/Authentication/Authorization
� GRC (Governance, Risk, Compliance) /System Security
Security…Has Many Faces
127 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Security Dashboard Support secure cluster
Supports Kerberos for Authentication
Support LDAP for Authentication
HDFS Yes Yes Linux OS supports MapReduce/Pig Yes N/A Hive Yes (standalone mode) N/A
Hiveserver No No Hiveserver2 Yes Yes Yes Hbase Yes Yes Yes HAWQ* Yes Yes Yes GemfireXD Yes Yes Yes
128 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Hadoop Security � Pivotal HD follows the Hadoop community
– Today limited to Kerberos – Intent is to have the ability to plugin other than Kerberos as well as open
source single sign on gateways
� Requires KDC to manage cluster authentication
� Once user is authenticated – Provides authorization by enforcing HDFS file permissions – Ensuring jobs are run as the user in a Linux container
� We support everything Cloudera does in terms of securing a cluster
129 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Data Access Control
� Kerberos for user authorization
� Jobs will run in secure Linux containers
� Allows HDFS file permissions to be enforced – Similar to Linux file permissions
� Prevents service and user spoofing
130 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Challenges � Important to understand security expectations and
requirements
� Many types of security are not addressed by Hadoop – Data at rest protection
� Hadoop supports data in motion but the performance impact is high
– On wire encryption is not recommended especially 3des
131 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
3rd Party Data Protection Solutions Encryption, Masking, Tokenization, Token Management
Company Masking Tokenization Encryption Gazzang No No Yes Protegrity Yes Yes Yes DataGuise Yes Yes No
132 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.
Thank you jfunk@pivotal.io