A NEW PLATFORM FOR A NEW ERA - johnfunk.com Technical Overview.pdf · Greenplum Oracle Exadata...

54
A NEW PLATFORM FOR A NEW ERA

Transcript of A NEW PLATFORM FOR A NEW ERA - johnfunk.com Technical Overview.pdf · Greenplum Oracle Exadata...

A NEW PLATFORM FOR A NEW ERA

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Greenplum DB Technical Overview aka GPDB

John Funk

3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Business Data Lake Architecture

4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Pivotal HD Architecture

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Data Loader

Pivotal HD Enterprise

Spring

Unified Storage Service

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Hadoop Virtualization Extension

Distrubuted In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Where should we put Data? When  do  I  need  it?   Now           Later  

What  do  I  want  to  do  with  it?  

Singular  event  processing  (OLTP  Analy?cs)   Transac?ons  

Exploratory  Analy?cs  

Structured,  deep  analy?cs  

How  do  I  need  to  store  it?   Temporarily  I  want  to,  but  am  not  required      

I  must  and  am  required  to  

How  will  I  query/search?   Structured,  regular  

Using  and  alterna?ve  index  (other  source)  

Unstructured,  unknown   AD  Hoc  SQL  

Where  is  it  coming  from?  Events,  stream,  message  stream   File   ETL  

GemFireXD   Pivotal  HD   GP  Hadoop  +  GP  All  3  solu?ons  

6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Big Data: Industry Perspective Retail • CRM – Customer Scoring • Store Siting and Layout • Fraud Detection / Prevention • Supply Chain Optimization

Advertising & Public Relations • Demand Signaling • Ad Targeting • Sentiment Analysis • Customer Acquisition

Financial Services • Algorithmic Trading • Risk Analysis • Fraud Detection • Portfolio Analysis

Media & Telecommunications • Network Optimization • Customer Scoring • Churn Prevention • Fraud Prevention

Manufacturing • Product Research • Engineering Analytics • Process & Quality Analysis • Distribution Optimization

Energy • Smart Grid • Exploration

Government • Market Governance • Counter-Terrorism • Econometrics • Health Informatics

Healthcare & Life Sciences • Pharmaco-Genomics • Bio-Informatics • Pharmaceutical Research • Clinical Outcomes Research

7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Internet of Things

Value of 1% efficiency improvement

8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Virtuous Cycle of Innovation

Key elements of Industrial Internet

9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Extreme Performance for Analytics

•  Performance through parallelism –  True performance through a shared-nothing MPP architecture –  In place, incremental scaling –  Optimized for analytic workloads –  Paralell Function Execution

•  Simple and automatic –  Just load and query like any database –  Tables are automatically distributed

across nodes

•  Flexibility and choice –  True column and row based storage –  Deep Hadoop integration –  Broad partner support –  Support for the deployment options right for you

GREENPLUM DATABASE

10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Architecture: Performance Via Parallelism

•  Scale-out architecture on standard commodity hardware

•  Automatic parallelization –  Load and query like any database

–  Automatically distributed tables across all nodes

–  No need for manual partitioning or tuning

•  Extremely scalable MPP shared-nothing architecture

–  All nodes can scan and process in parallel

–  Linear scalability by adding nodes –  On-line expansion when adding nodes

GREENPLUM DATABASE

Loading

Interconnect

Greenplum Database

Storage

Compute

11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Performance: Parallel Query Optimizer •  Cost-based optimization looks for

the most efficient plan •  Physical plan contains scans,

joins, sorts, aggregations, etc. •  Global planning avoids sub-

optimal ‘SQL pushing’ to segments

•  Directly inserts ‘motion’ nodes for inter-segment communication

PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(Slice 1)

HashJoin

Hash Hash

HashJoin

Hash

Broadcast Motion 4:4(Slice 2)

Seq Scan on motion

Seq Scan on customer Seq Scan on line item

Seq Scan on orders

12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Performance: Dynamic Pipelining •  A supercomputing-based “soft-switch” responsible for

–  Efficiently pumping streams of data between motion nodes during query-plan execution

–  Delivers messages, moves data, collects results, and coordinates work among the segments in the system

Dynamic Pipelining Software Interconnect

13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Architecture: Scalability with Scale-Out

Advantages: •  Scale In-Place •  No Forklifting •  Immediately Usable Simple Process •  Connect New Hardware •  Simple Restart •  Schedule Redistribution

of Existing Data

GREENPLUM DATABASE

...

New Segment Servers

Query planning & dispatch

...

14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Loading: Industry’s Fastest •  Industry leading performance

at 10+TB per-hour per-rack •  Scatter-Gather Streaming™ provides

true linear scaling •  Support for both large-batch and

continuous real-time loading strategies

•  Enable complex data transformations “in-flight”

•  Transparent interfaces to loading via support files, application, and services

SINGLE RACK COMPARISON

Greenplum load rates scale linearly with the number of racks, others do not.

For example, two racks = >20TB/H

Greenplum Oracle Exadata

Netezza Teradata

15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Loading: Massively-Parallel Ingest

•  Fast Parallel Load & Unload –  No Master Node

bottleneck –  10+ TB/Hour per Rack –  Linear scalability

•  Low Latency –  Data immediately

available –  No intermediate stores –  No data “reorganization”

•  Load/Unload To & From: –  File Systems –  ETL Products –  Hadoop Distributions

Extreme speed and, immediate usability from files, ETL & Hadoop

External Sources

Loading, streaming, etc.

gNet Network Interconnect

... ...

... ...

Master Servers

Query planning & dispatch

Segment Servers

Query processing & data storage

SQL

ETL File Systems

16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

SINGLE RACK COMPARISON

Most Powerful Data Loading Capabilities

•  Industry leading performance at 16+TB per-hour per-rack

•  Scatter-Gather Streaming™ provides true linear scaling

•  Support for both large-batch and continuous real-time loading strategies

•  Enable complex data transformations “in-flight”

•  Transparent interfaces to loading via support files, application, and services

Greenplum load rates scale linearly with the number of racks, others do not.

For example, two racks = >32TB/H

Greenplum Oracle Exadata

Netezza Teradata

GREENPLUM DATABASE

17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

DATA SET

Multi-Level Partitioning •  Hash Distribution to evenly spread data

across all segment instances •  Range Partition within an segment

instance to minimize scan work

Segment 1A

Segment 1C

Segment 1D

Segment 2A

Segment 2B

Segment 2C

Segment 2D

Segment 3A

Segment 3B

Segment 3C

Segment 3D

Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007

Segment 1B

18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Architecture: Polymorphic StorageTM

�  Enable Information Lifecycle Management (ILM)

�  Storage types can be mixed within a table or database –  Four table types: heap, row-oriented

append, column-oriented append and external

�  Rich compression functionality, definable column by column –  Blockwise: Gzip1-9 & QuickLZ –  Streamwise: RLE (levels 1-4)

�  Flexible indexing, partitioning, and more

TABLE ‘CUSTOMER’

Mar ‘11

Apr ‘11

May ‘11

Jun ‘11

Jul ‘11

Aug ‘11

Sept ‘11

Oct ‘11

Nov ‘11

Row-oriented for HOT DATA Column-oriented for COLD DATA

GREENPLUM DATABASE

19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

MANAGEABILITY, EXTENSIONS GREENPLUM DATABASE

20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Administration: Simple Tools •  Single console for both Database and

Hadoop •  Administration

–  Start, Stop Database –  Recover, Rebalance Segments

•  Interactive view of System Metrics –  Real-time –  Historic (Configurable by time period)

•  In-depth view for System Health –  Hardware health –  Software (Database, Hadoop)

•  Query Monitoring –  Search, Prioritize, Cancel Queries –  View Query’s Execution Plan

•  Workload Management –  Configure Resource Queues –  Prioritize Users

GREENPLUM DATABASE

21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Connection Management •  Control over how many

users can be connected. •  Provides pooling (to allow

large numbers) and caps (to restrict numbers if desired)

•  Intelligently frees and reacquires temporarily idle session resources

User-Based Resource Queues •  Each user is assigned to a

resource queue that performs ‘admission control’ of queries into the database

•  Allows DBAs to control the total number or total cost of queries allowed in at any point in time

Dynamic Query Prioritization •  Patent pending technique of

dynamically balancing resources across running queries

•  Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue

Administration: Workload Management

Work smarter, not harder.

GREENPLUM DATABASE

22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Security: Authentication & Authorization

User Authentication

Role Management

Connection Management

Authenticate With: •  Database Passwords •  LDAP •  Active Directory •  Kerberos/GSSAPI •  RADIUS •  Digital Certs. •  Pluggable Auth.

(PAM)

Manage Roles: •  Identify Users and

Groups •  Grant/Revoke Access to:

•  Databases •  Tables •  External Tables •  Functions •  Languages •  Schemas •  Etc.

•  Grant Permissions: •  Select •  Insert, Update,

Delete •  Rules •  Connect •  Execute •  Etc.

Connections: •  Where to Listen •  # of Connections •  Pools •  Encryption •  Authentication

Methods

GREENPLUM DATABASE

23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Security: Standards & Certs. Networks Encrypted using SSL, TLS

Database encryption supported using PGCrypto

� Algorithms: AES 128, 192, 256, DES, 3-DES and many others

Authentication

� MD5 (default, set at install time)

� SHA-256

� SHA-256-FIPS

Local Passwords Encrypted

� Super user can change password hashing algoritym

� Using GUC: password_hash_algorithm

� GUC can be set either system-wide or on a session level

Standards

� Federal Standard FIPS-140-2 compliant

GREENPLUM DATABASE

24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Data Load Options SQL INSERT

�  Standard Row by row insert – slowest method –  INSERT into tableX VALUES (‘John’, ‘Doe’, ‘Manager’)

�  All data is passed through Master server

PostgreSQL Copy command

�  Inserts data from a file or stdin (another query) – faster than SQL INSERT –  COPY tableX FROM {file | STDIN}

�  All data is passed through Master server

Parallel loading with gpfdist/gpload

�  Segment servers connect directly to external files served via gpfdist

�  Load bypasses Master server

�  Segment servers load in parallel

�  External tables point to the streamed files –  CREATE EXTERNAL TABLE ext_table LOCATION (gpfdist://dir/*) –  CREATE TABLE tableY AS SELECT * FROM ext_table

�  Integrated with Informatica PowerExchange and Pentaho

GREENPLUM DATABASE

25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Parallelized ETL with Greenplum One server, running Pentaho PDI and gpload. Provides parallelize data loading

GREENPLUM DATABASE

26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Multiple ETL Servers (DIA Module) Multiple ETL servers, each running Pentaho PDI and gpload. Even more parallelism for data loading.

GREENPLUM DATABASE

27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

HIGH AVAILABILITY, BACKUP, SUPPORT GREENPLUM DATABASE

28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Availability: Multi-Level Redundancy

Client Redundant Interconnect

MP Segment Servers

Primary Master

1

Sync & Failover

Processes

Standby Master

Primary Data

RAID 5 Protection

GREENPLUM DATABASE

A1

B1

C1

A2

B2

C2

A1

B1

C1

A2

B2 C2

Mirror Data

1

29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

•  Option 1: custom external tables –  Good control over which tables/data to backup –  Enables incremental backup –  Doesn’t include other objects, such as roles, resource queues, etc.

•  Option 2: pgdump –  Free utility –  Not parallelized –  Creates one dump file on the master

•  Option 3: gpcrondump –  Free utility –  Parallelized backup –  Creates SQL files on the master and segment hosts –  Must restore to same number of hosts/segments –  Incremental backup not supported

•  Option 4: EMC Data Domain

Backup in a nutshell GREENPLUM DATABASE

30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Backup/Restore with EMC Data Domain �  Integration options

–  NFS: Data Domain device mounted as NFS storage

–  DD Boost: Native, client-side deduplication. Supported in GPDB 4.2 and higher

�  Drastic reduction in backup storage requirement

�  Backup all segment servers in parallel directly to Data Domain

�  Data Domain Integrates seamlessly into standard Greenplum full backup data export and data restore procedures

GREENPLUM DATABASE

Full Appliance

+ Data Domain

Boost or NFS

2 X 10GBit IP

31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

�  Ideal for configurations with RPO and RTO requirements that can be specified in hours �  Supports:

–  Collection Replication for DD Boost backup –  Directory-level replication for NFS backup –  Encryption over the WAN

Data Domain Replication

LAN/WAN

Greenplum DCA Greenplum DCA

Data Domain Data Domain

GREENPLUM DATABASE

Backup and restore between remote and primary sites

Backup/Restore with EMC Data Domain

32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

•  Remote Technical Support –  24x7 technical support and remote troubleshooting –  Customer-managed case severity level –  Four-hour response objective

•  Onsite Support (DCA Only) –  Installation of replacement parts –  Replacement parts shipped for next business day arrival –  GP SW upgrade included

•  Proactive Service –  Secure remote monitoring for hardware –  Notification of engineering technical advisories –  Built-in tools maximize stability and performance

•  Secure Self-Help –  24x7 access to eService support tools including

knowledgebase, forums, and appropriately licensed software updates

GREENPLUM DATABASE

Customer Support Services

33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Deployment Options GREENPLUM DATABASE

34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

GREENPLUM DCA

Deployment Choice & Flexibility

Modular Appliances �  Modular Flexibility �  Database, Hadoop

and ETL Modules �  Future Partner-

Specific Modules �  Common Admin and

Network Mgmt. �  Incremental

Scalability �  Rapid Deployment

Software Editions �  Deploy on your x86

hardware �  Certified

Configurations �  Perpetual or

Subscription Lic. �  Community Editions

35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

GREENPLUM DCA

Modular Options

•  Modules: –  Greenplum Database –  Greenplum Hadoop –  Greenplum Data

Integration Accelerator –  Partner Modules

•  From ¼ to 12 Racks •  Incremental Scale •  Reduced Racking •  Reduced Enterprise

Networking

+

Add ¼ rack Increments

Greenplum DIA

Module

Greenplum Database Modules

or

or

Greenplum HD

Module

1st Rack

Functional Module

Functional Module

Functional Module

Greenplum Database Module

(required)

Add ¼ rack Increments

Greenplum DIA

Module

Greenplum Database Modules

or

or

Greenplum HD

Module

Additional Racks

Functional Module

Functional Module

Functional Module

Functional Module

36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved. 36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Greenplum DB Analytics

37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Extensible for Analytics: In-Database Analytical Algorithms

•  Bringing the power of parallelism to commonly-used modeling and analytics functions

•  In-database analytics –  SAS – HPA, Access, and Scoring Accelerator –  MADLib – An open-source library of advanced

analytics functions –  Analytics extensions supported, including

•  Graphlib – Analytics for graph data •  PostGIS - Geospatial support, PL/R - Statistical

Computing, PL/Java, PL/Perl, etc. •  GPText – massively parallel text processing

38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Stored Procedures Support

�  Extends SQL with user-defined logic

Greenplum gNet

Data Access & Query Layer

Stored Procedures MapReduce

Polymorphic Storage

SQL 2003/ 2008 OLAP

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics SAS

�  Written in SQL �  Used for deploying reusable logic

39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

SQL 2003/2008 OLAP Support

�  Simple aggregates

Greenplum gNet

Data Access & Query Layer

Stored Procedures MapReduce

Polymorphic Storage

SQL 2003/ 2008 OLAP

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics SAS

�  Window functions �  Excellent for BI

40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

MapReduce Support

�  Java-based programming

Greenplum gNet

Data Access & Query Layer

Stored Procedures MapReduce

Polymorphic Storage

SQL 2003/ 2008 OLAP

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics SAS

�  Command-line accessible

�  Run SQL and MapReduce against the same data

41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

In-Database Analytics Support

�  GPtext for unstructured data �  PostGIS for Geospatial analysis

Greenplum gNet

Data Access & Query Layer

Stored Procedures MapReduce

Polymorphic Storage

SQL 2003/ 2008 OLAP

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics SAS

�  MADlib for scalable in-database analytics

�  User Written Analytical Algorithims

42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Greenplumb: A powerful platform for machine learning

Regressions, Classification, Clustering, High Dimensionality Reduction, Cross validation and many more…

Recommender Systems, Connected Components, PageRank, Triangle Counting, Subgraph Centrality, Spectral Clustering and many more…

Machine learning on Relational data

Machine Learning on Graph data

43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Rich Machine Learning Library

�  Features: –  Rich set of SQL Machine Learning algorithms from MADlib 1.4 added –  Graphlab 2.2 supported (beta) –  UDF support in R, Python, and Java.

�  Benefits: –  Analyze relational and graph data together, without needing data

movement. –  Scalable machine learning algorithms helps do rapid data science

experiments on big data. –  Design custom algorithms using popular languages like R, Python and

Java.

44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

User-Written Analytical Algorithms �  Broad Choice of Development Language

–  R, C, Java, Python, Perl

�  Multiple Execution Models –  User Defined Aggregate – Scalar Result –  User Defined Function – List Result –  User Defined Table Function – Tabular Result

�  Can Be Embedded Within: –  SQL, Stored Procedures, MapReduce Maps

45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

MADlib In-Database Analytic Library •  Scalable, in-database analytic library

-  Parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data

•  Open-source, to enable extensibility and growth •  Fully Parallelized •  Can be customized by users •  Collaboration of developers from Greenplum, University of California at

Berkeley and other commercial entities

46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

MADlib In-Database Analytical Functions

Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining

CountMin (Cormode-Muthukrishnan) Sketch-based Estimator K-Means Clustering

FM (Flajolet-Martin) Sketch-based Estimator Naïve Bayes Classification

MFV (Most Frequent Values) Sketch-based Estimator Linear Regression

Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART

Latent Dirichlet Allocation Topic Modeling

47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

GPText for Text Analytics �  Full text indexing and search �  Join structured and text in single query �  Database security and availability features �  Parallel, linearly scalable performance �  No-Cost - bundled into GPDB

48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Spatial Analytics (PostGIS) �  Integrated PostGIS 2.0 includes support for Geography data type, Geometry

data type & previous PostGIS 1.4 features. –  Enables polygons that cover the polls or cross the dateline –  Easily allows users to work with latitude/longitude data without having to know about projections –  No other map projection works for big organizations with truly global data

�  Open-GIS Compatible �  GIS Data Types �  OpenGIS Simple Feature Access

PostGIS 2.0 features are available with GPDB 4.2.6 via the Greenplum Package Manager

49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Integrated with Tools/Languages, incl. R

•  List the columns in the table and preview the first 3 rows of data (the limit is passed through to the db)

•  Examine the resulting model

•  Load PivotalR Library

•  Create the “houses” object as a proxy object in R. The data is not loaded into R

•  Run a linear regression. This is executed in-database.

•  The model is stored in-database, greatly simplifying the development of scoring applications

50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

SAS

�  SAS Scoring Accelerator

Greenplum gNet

Data Access & Query Layer

Stored Procedures MapReduce

Polymorphic Storage

SQL 2003/ 2008 OLAP

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics SAS

�  SAS High Performance Analytics (HPA)

�  SAS Access �  SAS Grid

51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

Deep SAS Integration �  SAS/Access for Greenplum

–  Fast, transparent and secure access to Greenplum data from SAS

�  SAS High-Performance Analytics for Greenplum –  Closely-Integrated In-Memory Analytics –  Accelerates Computation –  Eliminates Most Data Movement –  Shares Segment Servers with Greenplum DB

�  SAS Scoring Accelerator for Greenplum –  Execute SAS Models in Parallel In-Database

�  SAS Grid for Greenplum –  Accelerate SAS Model Execution for Load and Run –  Integrated As Part of Greenplum DCA –  Leverages DCA’s High-Speed Interconnect –  Reduce Load on and Cost of Data Center Networks

Question again about using the Greenplum logo.

52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

A Mature Enterprise Platform

PRODUCT FEATURES

CLIENT ACCESS & TOOLS

Multi-Level Fault Tolerance (RAID, Mirroring, DR with

Data Domain Boost)

Shared-Nothing MPP

Parallel Query Optimizer

Polymorphic Data Storage™

CLIENT ACCESS ODBC, JDBC, OLEDB,

MapReduce, etc.

CORE MPP ARCHITECTURE

Parallel Dataflow Engine

gNet™ Software Interconnect

Scatter/Gather Streaming™ Data Loading

Online System Expansion Workload Management GREENPLUM

DATABASE ADAPTIVE SERVICES

LOADING & EXT. ACCESS

Petabyte-Scale Loading

Trickle Micro-Batching

Anywhere Data Access

STORAGE & DATA ACCESS

Hybrid Storage & Execution (Row- & Column-Oriented)

In-Database Compression

Multi-Level Partitioning

Indexes – Btree, Bitmap, etc.

External Table Support

LANGUAGE SUPPORT

Comprehensive SQL

Native MapReduce

SQL 2003 OLAP Extensions

Programmable Analytics

Analytics Extensions (GeoSpatial, PR/R, PL/Java,

PL/Python, PL/Perl)

3rd PARTY TOOLS BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS Greenplum Command Center

Greenplum Package Manager

GREENPLUM DATABASE

53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

�  Massively Parallel Analytics Performance

�  Industry-Leading Load Speed

�  No-Forklift Scalability

�  Rich SQL with Schema Agnosticism

�  In-Database Analytical Extensions

�  SAS Acceleration Options

�  Industry-Leading Workload Mgmt.

�  Parallel Co-Processing with Hadoop

�  Availability and Multi-Level Redundancy

�  Rich, Easy-to-Use Administration Tools

�  Big-Data-Capable Backup Facilities

�  Information and User Security

GREENPLUM DATABASE GPDB Delivers

54 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA