Open Source Innovations in the MapR Ecosystem Pack 2.0

Before We Begin

• This webinar is being recorded. Later this week, you will receive

an email on how to get the recording and slide deck.

• If you have any audio problems, please let us know in the chat

window and we’ll try to resolve them quickly.

• If you have any questions during the webinar, please type them in

the chat window.

Introducing Our Speakers from MapR

Dale Kim

Sr. Director, Industry Solutions

Ankur Desai

Sr. Manager, Platform and Products

Rachel Silver

Technical Product Manager – Ecosystem Projects

+ Carol McDonald, Solutions Architect

Agenda

• Quick overview of the MapR Ecosystem Pack (MEP) program

• Drill 1.9

• Spark 2.0.1

• Kafka Connect and Kafka REST Proxy for MapR Streams

• MapR Installer Stanzas

• Other Key Additions

– Hue 3.10

– Teradata Connector for Sqoop

• Q&A

MEP Overview

MapR (< 8/2016): You’re In Charge of Upgrades

➢ Customers may encounter inter-project

compatibility issues

➢ Unwieldy documentation and support

burden slows innovation

➢ Wide-ranging support means less

nuanced support for configurations like

packaging default JARs in Oozie

A way to decouple ecosystem install and upgrades

What is the MapR Ecosystem Pack (MEP)?

– A selected set of stable and popular components, connectors,

and interfaces from the open source ecosystem that we fully

support on the MapR platform.

– A single repository of selected versions of these components

fully tested to be interoperable.

– A delivery vehicle for connectors and developer APIs that allow

us to provide common ecosystem interfaces to MapR

components (e.g., Kafka Connect for MapR Streams).

MapR Moved to

Ecosystem

Packs in Q3 ‘16

Extended

Ecosystem

Where Does MEP Fit into the Bigger Ecosystem Picture?

MapR Core

Ecosystem

Outside support: vendor or

community.

Fully supported, updates tied

to MapR core.

Fully supported, updates

follow MEP process.

What Is in a MapR Ecosystem Pack (MEP)?

MEP contains a set of ecosystem projects, connectors, and APIs

Connectors

ProjectsAPIs

A selected set of open source ecosystem projects

that we ship, package, and fully support on the

MapR Converged Data Platform.

Connectors and APIs to provide common Hadoop

interfaces to core MapR products. (e.g., Kafka

Connect for MapR Streams)

MapR Ecosystem

Key Differentiator: Decoupled Ecosystem Upgrades

Competitor process: All-or-nothing

● Must upgrade full stack to receive any updates

● Infrequent opportunities for upgrade: ~2/year

MapR Ecosystem Packs (MEP) Process:

● Reduce upgrade effort – upgrade only at the level you

need, instead of your entire stack

● Frequent (quarterly) opportunities for upgrade

MEP 1.0 MEP 2.0 MEP ?

Less disruption to production environments! Upgrades are disruptive and infrequent!

MEP 2.0 ContentsMEP 2.0 Content

Apache Spark 2.0.1 Apache Sqoop2 1.99.7

Apache Drill 1.9 Apache Sqoop 1.4.6

Apache Hive 1.2.1 Apache Flume 1.6

Hue 3.10 Apache Storm 0.10.1

Apache Pig 0.16 Apache Mahout 0.12.2

Apache Oozie 4.2.0 Apache Myriad 0.1.0

Impala 2.5 Apache Sentry 1.6

★ Major Spark Upgrade!

★ Major feature updates to Drill!

★ MapR Installer Stanzas

★ Includes new connectors:

❖ Kafka Connect for MapR Streams

❖ Kafka REST Proxy For MapR Streams

❖ MapR Connector for Teradata (Powered by Teradata Connector for Hadoop)

Drill 1.9

Big Data StoreMapR-FS MapR-DB MapR Streams

Database Event Streaming

Batch ProcessingStream Processing

Real Time dashboardsBI/Ad-hoc queriesData exploration

Global Sources

Evolving towards Unified SQL Access Layer for MapR Platform

• Queries across Files,

Tables and Streams

• Real-time/Operational

analytics

• Schema-less JSON

flexibility

• Distributed in-memory

SQL engine for high

performance at Scale

• Analytics from familiar

BI/SQL tools

Drill Product Improvements over ReleasesDrill 1.0 GA

•Drill GA

Drill 1.1

•Automatic Partitioning for Parquet Files

•Window Functions support

•- Aggregate Functions: AVG, COUNT, MAX, MIN, SUM

•-Ranking Functions: CUME_DIST, DENSE_RANK, PERCENT_RANK, RANK and ROW_NUMBER

•Hive impersonation

•SQL Union support

•Complex data enhancements· and more

Drill 1.2

•Native parquet reader for Hive tables

•Hive partition pruning

•Multiple Hive versions support

•Hive 1.2.1 version support

•New analytical functions (Lead, lag, Ntile etc)

•Multiple window Partition By clauses support

•Drop table syntax

•Metadata caching

•Security support for web UI

•INT 96 data type support

•UNION distinct support

Drill 1.3/1.4

•Improved Tableau experience with faster Limit 0 queries

•Metadata (INFORMATION_SCHEMA) query speed ups on Hive schemas/tables

•Robust partition pruning (more data types, large # of partitions)

•Optimized metadata cache

•Improved window functions resource usage and performance

•New & improved JDBC driver

Drill 1.5/1.6

•Enhanced Stability & scale• New memory allocator

• Improved uniform query load distribution via connection pooling

•Enhanced query performance

• Early application of partition pruning in query planning

• Hive tables query planning improvements

• Row count based pruning for Limit N queries

• Lazy reading of parquet metadata caching

•Limit 0 performance

•Enhanced SQL Window function frame syntax

•Client impersonation

•JDK 1.8 support

Drill 1.71/.8

•Drill on YARN integration

•Access to Drill logs in the Web UI

•Addition of JDBC/ODBC client IP in Drill audit logs

•Monitoring via JMX

•Hive CHAR data type support

•Partition pruning enhancements

•Ability to return file names as part of queries

Window

functions

Enhanced

compatibility

Performance

/ Scale

Drill on

MapR-DB

JSON tables

Enterprise

manageability➢Drill 1.9 Product highlights• Enhanced Parquet Performance (Parquet filter

pushdown, Improved Scans with Async Parquet Reader,

Limit pushdown)

• Flexible & Dynamic UDFs

• Null equality joins support

• Efficient metadata queries

• HTTPD Format plugin

• ~60 bug fixes & improvements in SQL, Performance,

Usability

Parquet Filter Pushdown

• Applies during planning time

• Evaluates filter condition before the

• Planner evaluates filter conditions and

checks if a Parquet row group can be

eliminated

• Requires Parquet files to have min/max

statistics

• If min/max values are outside the range

of the filter, row group is dropped

• Supports only simple expressions

Example

SELECT * from table_t1

WHERE date_column between

date ‘2016-01-01’ and date ‘2016-

01-31’

Row group 1 : date_column : min = 2015-

01-01 max = 2015-12-31

Row group 2 : date_column : min = 2016-

01-01 max = 2016-12-31

Only row group 2 will be scanned

Parquet Filter Pushdown

The following are supported

• Clauses: WHERE, HAVING (if filter can be pushed past

GROUP BY)

• Operators: AND, OR, IN (in list < 10)

• Comparison operators: =, <>, <, >, <=, >=

• Data Types: INT, BIGINT, FLOAT, DOUBLE, DATE,

TIMESTAMP, TIME

• Functions: CAST (only for int, bigint, float, double)

Parquet Filter Pushdown (cont.)

Execution without filter pushdown

Plan with filter pushdown00-00 Screen : rowType = RecordType(ANY *): rowcount = 2925.0, cumulative cost = {26617.5 rows, 108517.5 cpu, 0.0 io, 0.0 network, 0.0

memory}, id = 2890

00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 2925.0, cumulative cost = {26325.0 rows, 108225.0 cpu, 0.0 io, 0.0

network, 0.0 memory}, id = 2889

00-02 Project(T10¦¦*=[$0]) : rowType = RecordType(ANY T10¦¦*): rowcount = 2925.0, cumulative cost = {26325.0 rows, 108225.0 cpu, 0.0

io, 0.0 network, 0.0 memory}, id = 2888

00-03 SelectionVectorRemover : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount = 2925.0, cumulative cost = {26325.0

rows, 108225.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2887

00-04 Filter(condition=[AND(>=($1, 1993-01-01), <=($1, 1994-01-01))]) : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount

= 2925.0, cumulative cost = {23400.0 rows, 105300.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2886

00-05 Project(T10¦¦*=[$0], orderdate=[$1]) : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount = 11700.0, cumulative cost =

{11700.0 rows, 23400.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2885

00-06 Scan(groupscan=[ParquetGroupScan [entries=

[ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_186.parquet],

ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_125.parquet],

ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_194.parquet]],

selectionRoot=file:/Users/pchandra/work/data/test_filter_pushdown, numFiles=117, usedMetadataFile=false, columns=[`*`]]]) :

rowType = (DrillRecordRow[*, orderdate]): rowcount = 11700.0, cumulative cost = {11700.0 rows, 23400.0 cpu, 0.0 io, 0.0 network, 0.0

memory}, id = 2884

Execution with filter pushdown

QueriesSelectivity

Withou

t FltrPD

FltrPD

Reduction

TPCH 06 15% 5,779 1,707 70%

TPCH 07 30% 12,395 5,188 58%

TPCH 14 1% 7,915 5,254 34%

TPCH 20 15% 9,174 8,333 9%

Asynchronous Parquet Reader

• High performance queries for scan intensive analytics (~33% I/O reduction)

• Parquet reader improvements include

– Buffered reads

– Parallel reads from file system

– Parallel decompression and decoding

– Reading and decoding is pipelined

Flexible & Dynamic UDFs

• Self Service ability for end users to deploy UDFs

• Simplified deployment without disruption

– No admin permissions on Drillbit nodes required & no

Drillbit restarts

• Works in standalone and YARN based Drill clusters

Refer to Drill Best Practices on the MapR Converge

Community

https://community.mapr.com/docs/DOC-1497

Spark 2.0.1

The Trinity of Real Time

Real Time

Producers

Global Messaging System NoSQL Database

Real Time

Operational

Analytics

Transformational Tier

Topic 1

Topic 2

Spark 2.0.1: Whole Stage Code-Gen: Planner

ParquetRelation

Filter

Project

Broadcast Hash join

Project

TungstenAggregate

Exchange

ParquetRelation

Filter

Project

ParquetRelation

Filter

Project

Broadcast Hash join

Project

TungstenAggregate

Exchange

ParquetRelation

Filter

Project

Whole Stage Codegen Whole Stage Codegen

Whole Stage Code-Gen: Spark as a Compiler

Filter

Project

Aggregate

Select count(*)

from store_sales

ss_item_sk=1000

long count =0;

for (ss_item_sk in

store_sales){

if(ss_item_sk==1000){

count += 1;

Class Filter{

def next(): Boolean = {

var found = false;

while(!found && child.next()){

found =

predicate(child.fetch())

return found;

def fetch(): InternalRow = {

child.fetch()

Volcano Iterator Model Whole Stage Code-gen

Spark 2.0.1: In-Memory Columnar Format

2 Mike 20

3 Bob 30

1 John 10 1 2 3

John Mike Bob

10 20 30

In-memory Row format In-memory Column format

Efficient: Dense storage, easy to index,

vectorized processing.

Compatibility: With external systems that

use columnar format, No serialization/copy.

Extensibility: Process encoded data,

integrate with columnar cache.

Spark 1.6 Spark 2.0+

Spark 2.0.1: Structured Streaming Preview

Structured Streaming lets you:

• Treat streams as if they were in a table

• Automatically appends new stream records into that “table”

• Coordinates some of the output to an external sink

Structured Streaming in Spark 2.0 is an alpha release.

Queryable data at time n

Queryable data at time n + 1

Streaming

data …

Spark 2.0.1: Structured Streaming Preview (cont.)

Processing Time

Input Table

Result Table

Program Output

(written to external

storage)

Output for

data at

time 1

Output for

data at

time 2

Output for

data at

time 3

Data up to

processing

time 1

Data up to

processing

time 2

Data up to

processing

time 3

Complete: All results are sent to the

external sink.

Append: Only new rows added since the

last trigger (each “time” in diagram at left)

are sent to the external sink.

Update (not yet available): Only changed

rows since the last trigger are sent to the

external sink.

Output Modes to the

External Sink

Kafka APIs for MapR Streams

Big Data is Continously Generated One Event at a Time

“time” : “6:01.103”,

“event” : “RETWEET”,

“location” :

“lat” : 40.712784,

“lon” : -74.005941

“time: “5:04.120”,

“severity” : “CRITICAL”,

“msg” : “Service down”

“card_num” : 1234,

“merchant” : ”MERCH1”,

“amount” : 50

Three Core Components of the Streaming Architecture

● Producer: Software-based system that is connected to the data

source. Producers publish event data into a streaming system.

● Streaming/messaging system: A systems that takes the data

published by the producers, persists it, and reliably delivers it to

consumers.

● Consumer: Subscribes to data from streams and manipulate or

analyze that data to look for alerts and insights. In the streaming

context, consumers are typically stream processing engines.

Social Media

Sensor Data

Database

Data Warehouse

Custom

KafkaStream

ProcessingPersistence

Collector

Kafka Producer

Kafka Consumer

Three Core Components of the Streaming Architecture

Simplifying the Streaming Architecture

● Making it easy to ingest data into the streaming system

- Connecting data sources using HTTP, making it simple for

any device to connect with Kafka

- Introducing a framework to connect most common data

systems with Kafka

● Converging the three core components on one platform

Social Media

Sensor Data

Database

Data Warehouse

Connect

KafkaStream

Kafka REST API

Simplifying the Streaming Architecture

Kafka Connect: Easy Connection to Data Systems

● Provides prebuilt connectors that allow most common data systems

to connect with Kafka

● Easily connect databases (such as Oracle), data warehouses (such

as Teradata) and Hadoop (HDFS) with Kafka

● Pull-based ingest of data, supporting sources that don't know how to

push into Kafka

● Push-based export of data from Kafka, supporting data systems that

don't know how to pull data from Kafka

Database

Data Warehouse

Connect

KafkaStream

Kafka Connect: Easy Connection to Data Systems

Kafka REST Proxy: Connect with Kafka using HTTP

● Any device that can communicate using HTTP can now

communicate directly with Kafka

● Any programming language in any runtime environment can now

connect with Kafka using HTTP

● The Kafka REST API eliminates intermediate data collectors

● Simplifying IoT architecture: any car, thermostat, machine sensor,

etc., can now directly communicate with Kafka

Social Media

Sensor Data

KafkaStream

Kafka REST API

Kafka REST Proxy: Connect with Kafka using HTTP

Social Media

Sensor Data

Database

Data Warehouse

Connect

Streams

Stream

Processing

(Spark)

Persistence

(MapR-DB)

(MapR-FS)

Kafka REST API

MapR Converged Data Platform

Converging Components of Streaming with MapR

MapR Installer Stanzas

MapR Installer “Stanzas”

• Under the Spyglass initiative, today we are

proud to announce MapR Installer Stanzas.

• MapR Installer Stanzas enable API-driven

installation for industry’s only Converged Data

Platform.

– Stanza contains layout and settings for

the cluster to be installed,

– It can be programmatically invoked to

provision clusters

– Automate successive cluster creation

with minimal changes

– Designed for both on-premises and cloud

deployments

Installer

Stanzas

Simple, Easy YAML

Lars Fredriksen • Built directly on top of installer REST api

• SDK models generated from swagger.json

• Installed in virtual python environment as

mapr_installer_cli module

• Connection mgmt, error handling, YAML parsing,

progress status

• Python app driven by YAML configuration

• Commands:

– Install

– Uninstall

– Export

– List

Example

environment:

mapr_core_version: 5.2.0

config:

hosts:

- demonode[1-3].example.com

ssh_id: root

license_type: enterprise

mep_version: 2.0

disks:

- /dev/sdb

- /dev/sdc

services:

template-05-converged:

How It Fits with the Current Installer Architecture

GUI Frontend(AngularJS + Bootstrap)

Java REST Backend(Jetty + Jersey + Jackson)

Installer Core(Python + Ansible)

NodesEmbedded

“Stanzas”(Python + Yaml)

Other Key Additions

Hue 3.10

Key Improvements:

● Oozie Improvements○ External Workflow Graph

○ Single Action Execution

○ New Ability: Dryrun Oozie job

● New SQL Query editor works over JDBC○ Look for an upcoming Community post on how to use this with Apache Drill!

● Directory and File-based Document Management○ Users can create their own directories and subdirectories and drag and drop documents

within the simple filebrowser interface:

MapR Connector for Teradata Powered by Teradata Connector for Hadoop

“MapR and Teradata share a customer base that

continually drives both of us to simplify and

orchestrate their analytical ecosystem. This

latest collaboration by our engineers is yet

another example of helping leading data driven

organizations realize value from big data faster

and easier.– Chad Meley, VP of Marketing at Teradata.

A Sqoop wrapper that

facilitates bulk data transfer

between Hadoop and external

data storage

As a Reminder…

https://community.mapr.com

• Q&A

• Discussions

• Code snippets

• Tutorials

Engage with us!

mapr-technologies

Thank You!

Open Source Innovations in the MapR Ecosystem Pack 2.0

Data & Analytics

Transcript of Open Source Innovations in the MapR Ecosystem Pack 2.0

MapR GA 3.0.1 Docs Final

MapR, Implications for Integration

HBase backups and performance on MapR

Hands on MapR -- Viadea

Mobile Innovations and Evolutions in Education Ecosystem

Table of Contents - MapR...Global Partner Proram i Table of Contents Company Background 1 Why MapR MapR Converge Network 1Partnering with MapR 1 MapR Value Proposition 2MapR Global

Fast Cars, Big Data How Streaming Can Help Formula 1€¦ · MapR Confidential © 2016 MapR Technologies© 2017 MapR Technologies1 Fast Cars, Big Data How Streaming Can Help Formula

MapR Converged Data Platformicg.port.ac.uk/~schewtsj/hpc2018/HadoopEssentials_LambdaArchite… · {“about” : “me”} Leon Clayton • MapR • Solution Architect • EMC •

Deep Insight Solutions - MapR Ultra

MAPR 2010 Presentation - Parrott Zhang

MapR lucidworks joint webinar

MapR 5.2 Product Update

PROCESSING LARGE / BIG DATA SET THROUGH MapR AND PIG€¦ · Keywords: MapR, Big Data, PIG, HDFS. Hadoop —————————— —————————— 1. INTRODUCTION

Philly DB MapR Overview

Ecosystem Innovation - How ecosystem innovations lead to sustained competitive advantage.

MAPR Multiple Antenna Profiler Radar

MapR Converged Data Platform

Distributed Deep Learning on MapR Converged Data Platform ...€¦ · MapR Confidential © 2017 MapR Technologies 2 Roadmaps • Enterprise Big data Journey • Distributed Deep Learning

Big Data Governance - MapR & AlyData

MapReduce Improvements in MapR Hadoop