A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional...

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD and HAWQ Immersion v5 John Funk

Course Outline �  PHD and HAWQ Introduction

�  HAWQ Architecture

�  HDFS Review

�  HAWQ Distribution, Partitioning and Storage options

�  Query execution in HAWQ

�  Loading and Unloading data in HAWQ

�  PXF – Pivotal Xtension Framework Best Practices

�  HAWQ, HBASE and HIVE Comparative Usage

�  Securing HAWQ

Pivotal HD and HAWQ Introduction and Positioning

Pivotal HD and HAWQ is the…

Enterprise platform that provides the fewest barriers, lowest risk, most cost effective and fastest way to enter in to

big data analytics on Hadoop

HAWQ Evolved From… •  Greenplum database re-platformed on Hadoop/HDFS

•  Over a decade of proven Greenplum database performance

•  HAWQ provides all major features found in Greenplum database •  SQL Completeness: 2003 Extensions •  Robust Query Optimizer •  Row or Column-Oriented Table Storage •  Compression •  Distributions •  Multi-level Partitioning •  Parallel Loading and Unloading •  High speed data redistribution

•  Views •  External Tables •  Resource Management •  Security •  Authentication •  Management and Monitoring •  ODBC/JDBC Compliant

HAWQ Benefits… •  Out of the box SQL for Hadoop

•  SQL adoption versus learning MapReduce programming

•  GPXF External Tables providing SQL access to Hadoop •  HDFS, HBase, Hive or any data types

•  Broad data access, integration and portability

•  Performance and Scalability •  Parallel Everything •  Dynamic Pipelining •  High Speed Interconnect •  Optimized HDFS access with libhdfs3

•  Co-Location •  Partition Elimination •  Higher Cluster Utilization •  Concurrency Control

Pivotal HD Architecture

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Data Loader

Pivotal HD Enterprise

Spring

Unified Storage Service

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Hadoop Virtualization Extension

Distrubuted In-memory

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie Vaidya

Flexible Deployment Model

deploy

Portable

Elastic

Promotable

HW Abstracted

Manageable

Public Cloud On Premise Private Cloud

Pivotal HD �  World’s first true SQL processing for enterprise-ready

Hadoop

�  100% Apache Hadoop-based platform

�  Virtualization and cloud ready with VMWare and Isilon

�  Scale tested in 1000 node Pivotal Analytics Workbench

�  Available as a software-only or appliance-based solution

�  Backed by EMC’s global, 24x7 support infrastructure

Introduction to Pivotal HD

Pivotal HD Architecture

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Data Loader

Pivotal HD Enterprise

Spring

Unified Storage Service

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Hadoop Virtualization Extension

Distrubuted In-memory

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie Vaidya

Pig � Pig provides a high-level, data flow oriented, abstraction for

MapReduce –  Much more concise than MapReduce code –  Though not very intuitive

� Compiles to MapReduce programs, which it runs for you

� Output can be dumped to terminal, or as files in HDFS for access by HAWQ or other tools

� Useful operators, extensible through “Piggybank”

� Developed at Yahoo!

Hive � Hive provides a SQL-like interface to data in HDFS

� To users who know SQL, Hive provides a much more intuitive interface than MapReduce or Pig

� Like Pig, Hive operates by translating the user’s query into one or more MapReduce jobs, running these on potentially very large data sets, and finally printing the result

� Drawbacks – Limited SQL, job latency and frequent I/O (slow)

� Developed at Facebook

HBase �  HBase provides random, real time read/write access to data stored within

HDFS –  Sparse, wide tables

�  Flexible schema

�  Key/value store: given (‘table’, ‘rowkey’), retrieve row –  Does not perform well if not retrieved by key/value

�  Update to row adds new data with current timestamp –  Previous state can be recovered using previous timestamp

�  Using PXF external tables, HAWQ is able to incorporate HBase data into queries

–  Pushing predicates into HBase when possible

Pivotal HAWQ

HAWQ: The Crown Jewels �  SQL compliant

� World-class query optimizer

�  Interactive query

�  Horizontal scalability

�  Robust data management

�  Common Hadoop formats

�  Deep analytics

High Performance Query Processing HAWQ

�  Interactive and true ANSI SQL support

� Multi-petabyte horizontal scalability

� Cost-based parallel query optimizer

� Programmable analytics

Enterprise Class Database Services & Management HAWQ

� Scatter-gather data loading

� Row and column storage

� Workload management

� Multi-level partitioning

� 3rd-party tool & open client interfaces

Pre-Integrated Deep Analytics HAWQ

� Performance via fully parallelized implementation

� Consistent, user friendly SQL interfaces

� Ease of data preparation

� Pre-integrated MADLib support –  Linear Regression –  Logistic Regression –  Multinomial Logistic

Regression

–  K-Means –  Association Rules –  PLDA - useful for topic

modeling

A fast extensible framework connecting HAWQ to a data

store of choice that exposes a parallel API

PXF: Pivotal Xtension Framework

PXF �  An advanced version of GPDB

external tables

�  Enables combining HAWQ data and Hadoop data in single query

�  Supports connectors for HDFS (read and write), HBase and Hive

�  Provides extensible framework API to enable custom connector development for other data sources

–  GemFireXD, JSON format, Cassandra, Accumulo

HDFS HBase Hive

PXF Xtension Framework

PXF Features �  What is it?

–  HAWQ feature to access data stored in other popular Hadoop modules (HDFS, HBase, Hive) using full SQL interface of HAWQ

�  Why is it important? –  A customer may prefer to primarily manage certain data

in HBase, but want to join this to other data sets stored in HAWQ for analytics purposes. Or a customer may need SQL access to data in HBase or HDFS.

�  When/who to use with? –  An important feature to discuss with data and

application architects who are concerned about unifying data access patterns across the variety of Hadoop components

–  Also useful to address any concerns about HAWQ using a proprietary data format not currently readable by other Hadoop processes.

Text HBase Hive Avro

PXF Transparent, Optimized SQL Access to non-

HAWQ formats

PXF Feature Summary ★  HBase (w/filter pushdown) ★  Hive (w/partition exclusion. various storage file types) ★  HDFS Files: read (delimited text, csv, Sequence, Avro) ★  HDFS Files: write (delimited text, csv, Sequence, various compression

codecs and options) ★  GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★  Stats collection ★  Automatic data locality optimizations ★  Extensibility!

Pivotal HD and HAWQ Rapid Innovation A look at features released in 2014

What’s New in PHD 1.1 �  Gemfire XD Beta

�  Orca

�  PXF: Writable HDFS Table Support

�  HAWQ Format Reader

�  UDF Support

�  Oozie

�  Vaidya

�  Kerberos Support (HDFS, HAWQ, USS)

�  Pgcrypto for HAWQ

�  Unified Storage Service: CDH4 as a data source

What’s New in PHD 1.1.1 �  Automatic HD configuration via ICM

–  Manual failover of HAWQ/PXF

�  Manual NameNode HA

�  Kerberos authentication support (includes HAWQ, PXF, HBase, Hive)

�  Parameterized Hadoop environment variables

�  Backup and restore scripts for Admin node

�  Rebalance HDFS using web API

�  PiggyBank Support in Pig 0.12

�  HAWQ gp_toolkit support

What’s New in PHD 2.0 �  GemfireXD GA

�  Pivotal HD Stack –  Hadoop 2.2 Rebase; Built w/JDK 1.7 –  Hive 0.12, Hbase 0.96 –  Graphlab 2.2 BETA (via Hamster/OpenMPI)

�  HAWQ –  Automated NameNode and HAWQ Master Failover –  MADlib 1.5 as separately deployable package, PL/Java, (PL/R and PL/Python from 1.1.1) –  Add Segments (HAWQ expand) –  Pluggable storage Phase 1 – Basic Parquet support –  Error Tables

�  PCC and ICM –  New ‘Read Only’ user role –  Log Management –  DCA/Isilon enhancements

HAWQ 1.2 Deep Scalable Analytics

�  Linear Regression �  Logistic Regression �  Multinomial Logistic Regression �  K-Means �  Association Rules �  Latent Dirichlet Allocation �  Naïve Bayes �  Elastic Net Regression �  Decision Trees / Random Forest �  Support Vector Machines �  Cox Proportional Hazards Regression �  Descriptive Statistics �  ARIMA

Pivotal vs. PL/R

�  Interface is R client �  Execution is in database �  Parallelism handled by PivotalR �  Supports a portion of R

PivotalR •  Interface is SQL client •  Execution is in R •  Parallelism via SQL function

invocation •  Supports all of R

More to Come…! •  PostGIS •  Enhanced Optimizer •  Query 3rd party remote clusters •  …and Much More

Greenplum Database and HAWQ

HAWQ Evolved From… �  Greenplum database re-platformed on Hadoop/HDFS

�  HAWQ provides all major features found in Greenplum database –  SQL Completeness: 2003 Extensions –  JDBC Compliant –  Robust Query Optimizer –  Row or Column-Oriented Table Storage –  Parallel Loading and Unloading –  Distributions –  Multi-level Partitioning –  High speed data redistribution

–  Views –  External Tables –  Compression –  Resource Management –  Security –  Authentication –  Management and Monitoring

HAWQ � GPDB on HDFS

� Not shared nothing built on a distributed file system (HDFS) –  Nodes can access shards of data on other nodes

� Built for large I/O, append-only, write-once, read-many

� Segments are stateless –  HA is one of the main drivers towards HDFS

HDFS DataNode

HDFS NameNode

HDFS DataNode HDFS DataNode

HAWQ Features � HAWQ provides all major features found in Greenplum

database that can be supported in Hadoop/HDFS including –  Row or Column-oriented table storage –  Distributions –  Partitioning –  Views –  External tables

� Using some features without understanding implications in HDFS may result in problems

–  We will discuss this the modules on each specific topic.

Architectural Differences from GPDB � Stateless Segment Hosts

–  Segments do not know what is visible or aborted in their physical data –  Segments do not know what columns are in a table

� HA model deviates from shared nothing environment –  If segment is down simply read from the replica in HDFS –  No lengthy failover process

� HDFS design doesn’t lend itself to local transaction management

–  Frequent, small bursts of I/O on HDFS perform poorly

Architectural Implications of Using HDFS �  To re-platform GPDB on HDFS, segment workers had to be simplified

(or made dumber) –  GPDB segment workers had their own copies of metadata,

transaction management and local storage

�  Heap storage in GPDB requires the database to make modifications to tuples on disk

–  HDFS is append only therefore heap storage cannot work on DataNodes

–  Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog

Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance

�  No Update and Delete –  Truncate is supported

�  No catalog on segment servers

�  No local transaction management at the segment level

�  No indexes

�  Local storage exists on segments but is used for temporary purposes

HAWQ or Greenplum Database? GPDB HAWQ

Real time random read/writes ✗

Large I/O write once, read many ✗

Petabytes of data ✗

Hadoop/HDFS platform ✗

Updates ✗

Deletes ✗

Indexes ✗

Row or columnar oriented table storage ✗ ✗

User Defined Data Distributions ✗ ✗

User Defined Partitioning ✗ ✗

Resource Management ✗ ✗

User Defined Functions (UDFs) ✗ ✗

External Tables ✗ ✗

GPText ✗

MADLib Algorithms ✗ ✗

Introduction to HAWQ Architecture

Interconnect

Basic HAWQ Architecture

Local Storage

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

NameNode

Local Temp Storage

Segment Host Query Executor

Segment

[Segment …]

DataNode

Local Temp Storage

Segment

[Segment …]

DataNode

HAWQ Standby Master

In production there will be other nodes for example, Pivotal CC/

ICM admin node, YARN Resource Manager node,

Secondary NameNode, etc.

HAWQ Master �  Located on a separate node from the NameNode in production

–  For a small POC cluster the HAWQ Master may run on the NameNode

�  Does not contain any user data

�  Contains Global System Catalog –  System tables that contain HAWQ metadata

�  Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

Catalog

HAWQ Metadata � Metadata is stored only in the HAWQ Master on local file

system ▪  Catalog information makes use of heap store

� No catalog/metadata on segment nodes (DataNodes) ▪  Segment nodes are stateless ▪  No heap store

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

Catalog

HAWQ Segments •  A HAWQ segment within a Segment Host is an

HDFS client that runs on a DataNode •  Multiple segments per Segment Host/DataNode •  Segment is a basic unit of parallelism

•  Multiple segments work together to form a single parallel query processing system

•  Operations (scans, joins, aggregations, sorts, etc.) execute in parallel across all segments simultaneously

Local Temp Storage

Segment

[Segment …]

DataNode

Segments Access Data Stored in HDFS � Segments are stateless

–  Does not store database and table metadata –  HAWQ Master dispatches query plan along with related metadata

obtained from the NameNode

� Segments communicate with NameNode to obtain block lists where data is located

� Segments access data stored in HDFS

Local Temp Storage

Segment Host

Query Executor

Segment

[Segment …]

Interconnect

HAWQ Parser

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

NameNode

Local Temp Storage

Segment

DataNode

Local Temp Storage

Segment

DataNode

Clients

•  Enforces syntax and semantics •  Converts SQL query into a

parse tree data structure describing details of the query

Interconnect

HAWQ Parallel Query Optimizer

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

NameNode

Local Temp Storage

Segment

DataNode

Local Temp Storage

Segment

DataNode

Gather Motion

HashAggregate

HashJoin

Redistribute Motion

HashJoin

Seq Scan on lineitem Hash

Seq Scan on orders

HashJoin

Seq Scan on customer Hash

Broadcast Motion

Seq Scan on nation

Interconnect

HAWQ Dispatch and Query Executor

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

NameNode

Local Temp Storage

Segment

DataNode

Local Temp Storage

Segment

DataNode

1.  Dispatch communicates the query plan to segments

2.  Query Executor executes the physical steps in the plan

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBarsb

HashJoinb.name = s.bar

ScanSellss Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

HAWQ Transactions �  DataNodes in HDFS do not know what is visible

–  No idea what data they have –  Visibility is defined by the NameNode

�  Therefore, segment nodes do not know what is visible –  Visibility is defined by HAWQ Master

�  No distributed transaction management –  No UPDATE or DELETE

�  Truncate is implemented to support rollback of failed transactions

�  Transaction logs present only on HAWQ Master –  For inserts, single phase commit performed on HAWQ Master

Local Storage

HAWQ Master

Local TM

Query Executor

Dispatch

Catalog

HAWQ Interconnect Performance and Scalability �  Inter-process communication between segments

–  Standard Ethernet switching fabric

�  Uses UDP protocol (User Datagram Protocol) –  Improved performance and scalability

�  Additional packet verification and checking not performed by UDP –  Reliability equivalent to TCP

Interconnect

Local Temp Storage

Segment

DataNode

Local Temp Storage

Segment

DataNode

HAWQ Dynamic Pipelining tm

Local Temp Storage

DataNode

Local Temp Storage

DataNode

Local Temp Storage

DataNode

•  Differentiating competitive advantage! •  Core execution technology from GPDB •  Parallel data flow using the high speed UDP interconnect •  No materialization

•  As performed with MapReduce

Dynamic Pipelining Interconnect

Dynamic Pipelining tm �  Framework that enables parallel data flow

–  Combines high speed UDP interconnect and a run time execution environment for big data workloads

–  Data from upstream components in the dynamic pipeline are transmitted to downstream components through UDP interconnect

�  Dynamic Pipelining run time layer ensures that queries complete, even for very demanding queries under heavy cluster utilization

–  Provides a seamless data partitioning mechanism which groups together parts of a data set which are often used in any given query

–  Enables queries to run without materializing contents to disk

Create tables for lab exercises HAWQ_DDL Lab

� Run the DDL script to create HAWQ database and tables

� Review HAWQ tables

HDFS Review

What is HDFS? � HDFS uses a Java file system

–  Uses libhdfs (JNI) to access the file system

� Scalable, distributed, fault-tolerant file system

� Designed to run well on commodity hardware

� Acknowledge that components frequently fail –  Entire node may fail or, more commonly, one or more disks within a

node will fail –  Gracefully continue to run in the presence of failures (entire node or

disks within a node)

HDFS Basic Architecture Client

Ingest Files

Local Data Stores

DataNode

Local Data Stores

DataNode

Local Data Stores

DataNode

Metadata

NameNode

3 x Replication (default)

Egress Files

Ethernet

HDFS Model �  Mostly POSIX “like” file system, with some caveats

–  Write once, read many –  Doesn’t support updates to files (simple consistency model) –  Pivotal HD supports append and truncate on its HDFS layer

�  Access patterns are well-suited to SATA disk drives –  Fewer seeks –  Read large, contiguous blocks

�  Prefer fewer, large files –  Split files up into blocks ▪  128MB default

–  Evenly distribute blocks across the cluster

HAWQ Data Storage and I/O

HAWQ Data Storage and I/O �  Segments are HDFS clients that run on DataNodes

�  Each table’s data is sharded on HDFS

�  The DataNodes are responsible for serving read and write requests from HAWQ segments

–  Data stored in HAWQ database tables

�  Data stored external to HAWQ but within the Hadoop cluster can be read using PXF external tables and are extensible

–  HDFS, Hive, HBase

�  Data stored in HAWQ can be written to HDFS for external consumption using Writable HDFS Table Support

Segment Files �  Each table’s data is sharded on HDFS

�  For example:

/hawq_data/gpseg<ID>/<DB OID>/<schema OID>/<table OID>.1,2,3,4,…

�  Data inserted to the same segment is always appended to the segment file

�  The maximum file size in HDFS is governed by dfs.namenode.fs-limits.max-blocks-per-file configuration in the hdfs-site.xml configuration file

–  The default is 1048576 which is 64TB

Data Locality �  For tables using a hash distribution, data with the same hash key will always be

handled by the same segment and is always written to the same DataNode as the segment host

�  Data locality will always be maintained unless one of the following conditions occur

–  DataNode on the segment host is at full file capacity –  DataNode on the segment host fails –  DataNode experiences a number of failed drives more than value specified

by dfs.datanode.failed.volumes.tolerated configuration parameter

�  Data locality is lost permanently when a DataNode fails for a long enough length of time for the NameNode to mark it down

Local Read Failures

� When there is a failure to read from a local DataNode on a segment host, reads are performed from a remote DataNode (replicated copy)

� Performance impact of approximately 70% –  This number quickly decreases with subsequent reads as a result of

caching the data –  Decreases to 10% on subsequent reads when cache is hit

HDFS I/O � HDFS uses a Java file system

–  libhdfs (JNI) is used to access HDFS

� Cost difference in reading thru HDFS indirection layer to read HDFS is 1.75 to 2.5 times slower than reading directly from disk

–  Cost of simply reading, doing an IPC into a java JVM and java reaching out to the file system

� The cost of reading through libhdfs in java (garbage collection + overhead) is slow

libhdfs3 � Pivotal rewrote libhdfs in C++ resulting in libhdfs3

–  C based library –  Leverages protocol buffers to achieve greater performance

� Libhdfs3 is used to access HDFS from HAWQ

� There is a GUC to disable libhdfs3 but is used for internal testing and debugging by engineering

–  It should never be turned off or disabled in the field

HAWQ Reads and HDFS �  In HAWQ data is physically partitioned/sharded across the cluster

�  Accessing a large number of small files for a single query in HDFS is not the design point of HDFS

�  For every DataNode running HAWQ, for every segment, for every partition and for every column (if using CO) a substantial amount of metadata is needed from the NameNode

�  Typically when accessing the NameNode for a MapReduce job it reads one, large contiguous file across HDFS then carves it up into partitions at run time and executes

HAWQ Reads �  The HAWQ master has a centralized catalog metadata store

�  HDFS has a NameNode metadata store

�  HAWQ master must interrogate the NameNode to obtain metadata from the NameNode and dispatch it along with the query plan to each of the HAWQ segments

�  Then the segments callback to the NameNode to obtain a block location array consisting of block IDs

–  For any given shard we don’t know the actual block IDs that we need to read

HAWQ Data Storage Performance Considerations �  Data is still split per-segment, so there is one file, per object, per segment

�  There can be a large number of partitions depending on the partition granularity –  Every partition is a file

�  Columnar orientation on very wide tables –  Every column is a file

�  Can result in –  Many very small files –  A huge number of calls to the NameNode –  Errors (particularly when loading) and slowness (when running queries)

Solution � You must consider #Segments X #Columns X #Partitions

�  In general, determine the optimal number of segments on DataNodes

� Use a higher partition granularity

� Limit columnar orientation on a very wide tables –  If partition granularity requirement is low, use row-based table

orientation

� NEVER use partitioning and column orientation together!

HAWQ Distributions, Partitioning and Storage Options

HAWQ Data Distributions �  Same functionality and behavior as GPDB

–  Data locality/co-located joins, redistribution, broadcasts, etc. –  Most important is an even distribution of data!

�  Loading randomly distributed tables is faster on larger tables since data does not get hashed

�  There is no difference in sequential scans on randomly distributed tables vs. hash distributed tables

�  Complex queries (joins, aggregates, sorts) on large randomly distributed tables take longer due to re-hash of data for local joins and aggregates

Loading Varying Storage Options

� Loading columnar tables take approximately 5-10x longer than loading the same row based table

� Loading compressed row (or columnar) tables only introduces slight overhead, 20% or less on small tables/loads

–  And on larger tables is actually faster by 5-10% because less data (blocks) is being written to HDFS

�  zlib compression reduced storage footprint by 50% on very high cardinality data

Compressed Row Based vs Non-Compressed Row Based � Sequential scan operations (for example select count(x))

takes 2-6x longer with compressed tables based on table size

–  As the table size increased the difference in query time reduced

� On more complex queries with aggregates and sort operations the difference in query time is almost unnoticeable

Querying Row vs Columnar Based Tables �  Sequential scan selecting a few columns takes only marginally less time

to execute on small columnar tables than the same row based table –  As the table size increases there is no perceptible performance difference

�  Wide queries and joins that read all columns in a columnar table does not display significant difference in query times than the same row based table

�  Complex queries (sorts, aggregates, joins) that involve only a subset of columns in a table the difference between columnar and row based is negligible

–  The majority of time for these queries is spent on the sort/aggregation operations and not the HDFS read

Partition Row Based vs Non-Partition Row Based �  Load times on small partitioned tables are 5x slower than non-partitioned

tables

�  Load times on large partitioned tables were 2-3x slower than non-partitioned tables

�  Sequential scans take 130-200% longer on partitioned tables

�  Complex queries (aggregates, joins, sorts) that did not have a WHERE clause eliminating partitions, the query time was actually faster in the case of partitioned tables for larger tables

–  May is due to the increased parallelism achieved with partition tables

HAWQ Partitioning � There can be a large number of partitions depending on the

partition granularity – Every partition is a file in HDFS – May result in many small files which is not desired

�  In general, use partitioning on very large tables but use a higher partition granularity so there are fewer, larger files

� Do not use partitioning if load performance is critical

HAWQ Columnar Storage � Do not use columnar orientation on very wide tables

–  Every column is a file in HDFS

�  If partition granularity requirement is low, use row-based table orientation

� Optimally you want bigger files and fewer NameNode calls for scanning the same amount of data

� NEVER use columnar tables with partitioning! –  Very different from GPDB

Running Queries in HAWQ

SQL Querying � Uses pipelined method of execution developed for

Greenplum Database – Efficient parallel execution – No MapReduce used behind the scenes – No intermediate materialization of data

� Only difference in operator level execution as compared to Greenplum database is the scan node

– Scan node is the operator that reads data from HDFS ▪  Versus reading data from file system in a Greenplum

database

SQL Querying Caveats � SQL query support similar to Greenplum Database

–  Support for advanced SQL like OLAP, analytical functions (i.e. MADLib)

� No updates

� No deletes

� No support for indexes

� No GPText

Query Example �  SQL submitted to HAWQ master

–  Validates SQL and parses query –  Query Optimizer produces the plan –  HAWQ master obtains metadata from NameNode and annotates the query

plan with metadata that segments need for execution

�  HAWQ Master dispatches the plan to every segment

�  Segments callback to the NameNode to obtain a block location array consisting of block IDs

�  Libhdfs3 read operation begins, retrieving data from whichever DataNodes in the cluster it needs and returns data to upper level operators

�  Upper level operators (e.g. hash-join, hash-agg) carry on the execution using motion operators as needed

Query Using PXF External Tables � Data can be queried from external data sources and joined

with HAWQ data using external table methodology

� Regular external tables can be used for data residing outside of the Hadoop ecosystem

� For data residing in the Hadoop ecosystem PXF external tables can be used

–  Read HDFS, HBase, Hive and other formats using standard SQL

3rd Party Application Querying

�  JDBC interface for HAWQ –  Used for queries but should not be used for inserts

�  JDBC DML operations (CREATE TABLE, TRUNCATE TABLE) fall into a transaction block

–  Meaning if you create a table in a transaction block and then rollback the transaction you’ll never see that table

� Can not perform updates, deletes, or create indexes

Loading and Unloading Data in HAWQ

Loading Data into HAWQ

� When the data sources are outside the Hadoop ecosystem – Use regular gpfdist external tables – Use COPY command for loading small data sets only

� When the data sources are in the Hadoop ecosystem – Use PXF external tables

HAWQ Data Loading Options

HDFS DataNode

HAWQ Segment Host

HDFS DataNode

HAWQ Segment Host

HDFS DataNode

HAWQ Segment Host . . . Query Executor Query Executor Query Executor

Clients JDBC

SQL Console

insert into <hawq-target-table> select * from <regular external table>;

HDFS Namenode

HAWQ Master Host

Query Optimizer Query Parser

Interconnect

External Data Sources

insert into <hawq-target-table> select * from <pxf external table>;

COPY command

HAWQ Writes and Performance �  The fastest method to write data in HAWQ is gpfdist

�  Testing gpfdist write process capped at 1GB/sec (with 1 gpfdist server and 64 segment readers)

–  This speed increases linearly with added gpfdist servers

�  Testing hadoop fs –put capped at about 130MB/sec

�  PXF external table copy to HAWQ table capped at 600MB/sec (for 64 segments)

�  Testing gpfdist external table copy is approximately 160% faster than PXF external tables

Write Paths �  Using gpfdist HAWQ segments read chunks of data from the gpfdist servers in

parallel, then hashes on the distribution key and sends the data to the correct segment server to be written to HDFS locally by the DataNode

�  Using PXF external tables HAWQ segments requests chunks of data from the PXF fragmenter, PXF reads data via a set of PXF accessors and returns the data to the segment, the segment then hashes on the distribution key and sends it to the correct segment (likely not on the same DataNode) for write to HDFS by the DataNode

–  The highest number of NameNode RPC calls are observed since both the PXF fragmenter and segments are engaging in NameNode calls for block locations

Optimizing gpfdist for Performance �  In general, maximize the parallelism as the number of

segments increase

� Spread the data evenly across as many nodes as possible

� Spread the data evenly across as many file systems as possible

–  Run two gpfdist's per file system

� Run gpfdist on as many interfaces (NICs) as possible

� Keep the work even across ALL of these resources –  In an MPP shared nothing environment loading is as fast as the slowest node

gp_external_max_segs Optimization � Controls the maximum number of segments each gpfdist

serves

� Keep gp_external_max_segs and number of gpfdist processes an even factor

–  gp_external_max_segs / # of gpfdist processes should have a remainder of 0

� Default is 64

Error Handling � Single Row Error Handling

–  Supported in external tables and COPY command –  Define a table to catch the ‘unloadable’ rows –  Load continues—does not fail

� Reject Limit –  Capping the number of rejects –  Once limit is met, load statement fails –  Limit can be actual number or percent –  Rejects evaluated at the segment

Loading Recommendations

� Default recommendation is to use bulk load through gpfdist external tables

–  Suitable from HDFS perspective

� To load smaller amount of data (Example: <100,000 rows) –  COPY command can be used

� Single row inserts not recommended –  Not suitable from HDFS perspective

Unloading Data � Regular writable external tables can be used for scalable

unload –  Same as in GPDB

� Copy command can be used for unloading small data sets

� Example for unloading to HDFS DROP EXTERNAL TABLE IF EXISTS foo_dump; CREATE WRITABLE EXTERNAL WEB TABLE foo_dump ( LIKE foo ) EXECUTE 'hadoop fs -put - hdfs://pivhdsne:8020/dump/foo/${GP_SEGMENT_ID}.tsv' FORMAT 'TEXT' (DELIMITER E'\t'); INSERT INTO foo_dump SELECT * FROM foo;

HBASE_HAWQ_LOAD Lab

� Load dimension tables into HBase using importtsv –  Data is in HDFS

�  Load data into HAWQ tables using COPY –  Data is in DAS

� Load a HAWQ AO table using SELECT from one of the PXF external tables defined

Loading Data into HBase and HAWQ

PXF External Tables

PXF is...

A fast extensible framework connecting Hawq to a data

store of choice that exposes a parallel API

Hawq External Tables • gpfdist

–  Remote delimited text (or csv) files

•  file

–  Text files on segment filesystem

• execute

–  Script execution and produced data

• pxf

–  Text and binary data from available pxf connectors

PXF �  Load data into HAWQ from Hadoop

�  Query Hadoop data without materializing it into HAWQ –  HDFS: delimited text, csv, Sequence, Avro –  HBase (w/filter pushdown) –  Hive (w/partition exclusion) ▪  Text, Sequence and RCFile formats

�  Write HAWQ data to HDFS –  Delimited text, csv, Sequence –  Various compression codecs and options

�  Extensible! –  GemFireXD, JSON format, Cassandra, Accumulo, others

PXF Features �  Supports filtering through predicate push down in HBase

–  <, >, <=, >=, =, != between a column and a constant –  Can AND between these (but not OR)

�  Supports Hive table partitioning

�  Ability to analyze data stored on HDFS using a data processing system

–  HAWQ optimizer uses the statistics to generate optimal plans on PXF external tables

�  Extensible framework Java API to enable custom development for other data sources and custom formats

Key Use Cases �  Using analytics, SQL query functionality from HAWQ on

HDFS, HBase, or Hive data without materialization into HAWQ

�  Join dimension tables stored in HAWQ with HBase fact tables

�  Fast ingest/materialization of high value processed data from HDFS, Hive or HBase data into HAWQ

PXF Differentiators �  Utilizes HAWQ fast parallel optimizer �  Applies data locality optimizations to reduce resources and network

traffic �  Extensible framework

�  Customers and partners can configure support for any new data store that will automatically support a fast and parallel data transfer

�  JSON format, Cassandra, Accumulo in beta �  Supports ANALYZE for gathering HDFS file statistics and having it

available for the query planner at run time

Feature Summary ★  HBase (w/filter pushdown) ★  Hive (w/partition exclusion. various storage file types) ★  HDFS Files: read (delimited text, csv, Sequence, Avro) ★  HDFS Files: write (delimited text, csv, Sequence, various compression

codecs and options) ★  GemFireXD, JSON format, Cassandra, Accumulo (currently Beta) ★  Statistics collection ★  Automatic data locality optimizations ★  Extensibility!

PXF Components �  Fragmenter

–  On the NameNode –  Metadata of data source (blocks and location) is passed back to the HAWQ Master

by the Fragmenter

�  Accessor –  Responsible for reading specific data fragments and passing them to the Resolver

�  Resolver –  De-serializes the records and serializes them into list of one field objects –  One field objects converted into GPDBWritable that can be read by HAWQ

�  Analyzer –  Responsible for collecting statistics on external table data that can be used by HAWQ

optimizer

PXF Loading into HAWQ � To load data into HAWQ use a variation of

–  insert into <hawq-target-table> select * from <pxf-external-table>;

� Data can be transformed in-flight before loading

� Data from Hadoop can also be joined in-flight with HAWQ data while loading

� Number of segments responsible for connecting to Pivotal HD for concurrent reading of data can be tuned

–  gp_external_max_segs GUC –  Default 64

PXF Querying

� PXF external tables can be queried directly without materialization into HAWQ

� PXF data can be joined with HAWQ tables

� Ability to analyze external tables helps HAWQ optimizer to choose optimal plans

� HBase predicate push down

� Hive partitioning

Profiles •  Improved user experience •  Informative error messages

LOCATION(‘pxf://<host:port>/sales?fragmenter=HiveFragmenter&accessor=HiveAccessor&resolver=HiveResolver’)

LOCATION(‘pxf://<host:port>/sales?profile=Hive’)

profiles.xml

<name>HBase</name>

<description>Used for connecting to an HBase data store engine</description>

<fragmenter>HBaseDataFragmenter</fragmenter>

<accessor>HBaseAccessor</accessor>

<resolver>HBaseResolver</resolver>

<myidentifier>MyValue</myidentifier>

</plugins>

</profile>

HDFS Files Example Analyze all text files that exist inside hdfs directory ‘sales/2012/01’

CREATE EXTERNAL TABLE jan_2012_sales (!!id int, !!total int, !!comments varchar!

)!LOCATION(‘pxf://10.76.72.26:50070/sales/2012/01/items_*.csv?! profile=HdfsTextSimple )!FORMAT ‘TEXT’ (delimiter ‘,’);!

HBase Table Example Get data from an HBase table called‘sales’. In this example we are only interested in the rowkey, the qualifier ‘saleid’ inside column family ‘cf1’, and the qualifier ‘comments’ inside column family ‘cf8’

CREATE EXTERNAL TABLE hbase_sales (!!recordkey bytea, !!“cf1:saleid” int, !!“cf8:comments” varchar!

)!LOCATION(‘pxf://10.76.72.26:50070/sales?! profile=HBase )!FORMAT ‘custom’ (formatter='gpxfwritable_import');!

direct mapping

Writable PXF – Export to HDFS

•  gphdfs-like functionality but extensible –  Supports text, csv, SequenceFile –  Supports various Hadoop compression Codecs

CREATE WRITABLE EXTERNAL TABLE ... LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimple&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec') FORMAT ‘text’(delimiter ‘,’);

can create a new profile “HdfsTextSimpleGZipped” that includes compression_codec

LOCATION(‘pxf://<host:port>/sales?profile=HdfsTextSimpleGZipped')

PXF_PUSHDOWN Lab

� Review the HAWQ DDL

� Run the HBase query –  customers_dim table

� Using “show” check the value of the GUC pxf_enable_filter_pushdown

–  Toggle value to “off”

� Rerun the query

PXF external tables predicate pushdown

PXF_STATS Lab

� SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';

�  ANALYZE table_name;

�  SELECT relpages, reltuples FROM pg_class WHERE relname = 'table_name';

PXF external tables statistics

HAWQ, Hive, HBase Comparative Usage

Hive �  Hive uses a SQL-based language called HiveQL, which is a subset of

SQL and has some additional MapReduce-specific syntax

�  Hive interprets SQL into a series of native MapReduce jobs –  Materializes data to disk

�  Hive can manage its own tables or use external tables –  No inherent performance difference, just ease of management

�  Hive is typically used as the integration point for BI and ETL tools

�  Hive only has a rudimentary query optimizer

HBase � HBase is a scalable, sorted, columnar, key-value data store

–  Linear scalability –  Keys are sorted and partitioned, so fetching by key is fast

▪  Can support range scans

–  Data is split into column families, which are stored as separate files underneath ▪  Allows for column pruning

–  Stores data as a “doubly-nested map” key-value ▪  Row key->column family:label->data value ▪  Allows for very flexible schema as the label is arbitrary ▪  Can support hierarchical data structures easily

HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive

Interface ANSI SQL Java API/Shell HiveQL (SQL Subset)

Client Connection JDBC/PXF Java/REST API JDBC (Limited)

Executes as MapReduce Never Yes Yes

SQL Completeness Yes No No

Nodes (Supported) 1,000+ 1,000+ 1,000+

Restart SQL on failure No No Yes

Performance High Low Low

Rely on MapReduce No Yes Yes

Open Source No Yes Yes

DDL Yes No Yes

ANSI Data Types Yes No Yes

Indexes No Yes No

When to use Adhoc Analytics Flexible Schema/Updates Batch

HAWQ, HBase, and Hive Comparison Item HAWQ HBase Hive

User Defined Data Distributions Yes No Yes (limited)

Advanced Partitioning Yes Yes (limited) Yes (limited)

Robust SQL Optimizer Yes No No

Store Data on HDFS Yes Yes Yes

Has its own daemons Yes Yes No

Relational Database Yes No No

Manage own tables Yes Yes Yes

UDFs Yes No No

MADlib Yes No No

HIVE_VS_HAWQ Lab

� Run the Hive DDL to create Hive external tables against the existing HDFS data

� Run queries against HAWQ and Hive versions of these tables

Query Performance Comparison

Securing PHD Clusters

� Data Protection –  Data access control –  Data at rest encryption –  Masking/Tokenization for data load –  Data-In-Motion encryption

� User Management/Authentication/Authorization

� GRC (Governance, Risk, Compliance) /System Security

Security…Has Many Faces

Security Dashboard Support secure cluster

Supports Kerberos for Authentication

Support LDAP for Authentication

HDFS Yes Yes Linux OS supports MapReduce/Pig Yes N/A Hive Yes (standalone mode) N/A

Hiveserver No No Hiveserver2 Yes Yes Yes Hbase Yes Yes Yes HAWQ* Yes Yes Yes GemfireXD Yes Yes Yes

Hadoop Security �  Pivotal HD follows the Hadoop community

–  Today limited to Kerberos –  Intent is to have the ability to plugin other than Kerberos as well as open

source single sign on gateways

�  Requires KDC to manage cluster authentication

�  Once user is authenticated –  Provides authorization by enforcing HDFS file permissions –  Ensuring jobs are run as the user in a Linux container

�  We support everything Cloudera does in terms of securing a cluster

Data Access Control

� Kerberos for user authorization

�  Jobs will run in secure Linux containers

� Allows HDFS file permissions to be enforced –  Similar to Linux file permissions

� Prevents service and user spoofing

Challenges �  Important to understand security expectations and

requirements

� Many types of security are not addressed by Hadoop –  Data at rest protection

� Hadoop supports data in motion but the performance impact is high

–  On wire encryption is not recommended especially 3des

3rd Party Data Protection Solutions Encryption, Masking, Tokenization, Token Management

Company Masking Tokenization Encryption Gazzang No No Yes Protegrity Yes Yes Yes DataGuise Yes Yes No

Thank you jfunk@pivotal.io

A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional...

Documents

Transcript of A NEW PLATFORM FOR A NEW ERA - John Funk · 2014-08-25 · A NEW PLATFORM FOR A NEW ERA Additional...

NEW ERA MOOCS.

HAWQ Architecture HL++ 2015 Moscow

Libya's New Era

Japan's new Era ReiwaJapan's new Era ReiwaJapan's new Era ...

Accumulo Summit 2014: SQL-on-Accumulo with Pivotal HAWQ and PXF

Pivotal HAWQ

New Era, New Rivals

Help Build the most Advanced SQL Database on Hadoop ......HAWQ: A Hadoop Native Parallel SQL Engine Apache HAWQ Discover New Relationships Enable Data Science Analyze External Sources

New Era Magazine

Apache HAWQ Resource Management and Integration

Table of Contents - Pivotal Software · What is HAWQ? HAWQ Architecture Table Distribution and Storage Elastic Query Execution Runtime Resource Management HDFS Catalog Cache HAWQ

PXF HAWQ Unmanaged Data

New Era Park

Pivotal hawq internals

New era prologue

new dental era

Pivotal Strata NYC 2015 Apache HAWQ Launch

new era retail

HAWQ Installation Guide

New Era Presentation