Post on 16-Apr-2017
1© 2016 MapR Technologies 1© 2016 MapR Technologies
Open Source Innovations in the MapR Ecosystem Pack 2.0
2© 2016 MapR Technologies 2
Before We Begin
• This webinar is being recorded. Later this week, you will receive
an email on how to get the recording and slide deck.
• If you have any audio problems, please let us know in the chat
window and we’ll try to resolve them quickly.
• If you have any questions during the webinar, please type them in
the chat window.
3© 2016 MapR Technologies 3
Introducing Our Speakers from MapR
Dale Kim
Sr. Director, Industry Solutions
Ankur Desai
Sr. Manager, Platform and Products
Rachel Silver
Technical Product Manager – Ecosystem Projects
+ Carol McDonald, Solutions Architect
4© 2016 MapR Technologies 4
Agenda
• Quick overview of the MapR Ecosystem Pack (MEP) program
• Drill 1.9
• Spark 2.0.1
• Kafka Connect and Kafka REST Proxy for MapR Streams
• MapR Installer Stanzas
• Other Key Additions
– Hue 3.10
– Teradata Connector for Sqoop
• Q&A
5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies
MEP Overview
6© 2016 MapR Technologies 6
MapR (< 8/2016): You’re In Charge of Upgrades
➢ Customers may encounter inter-project
compatibility issues
➢ Unwieldy documentation and support
burden slows innovation
➢ Wide-ranging support means less
nuanced support for configurations like
packaging default JARs in Oozie
7© 2016 MapR Technologies 7
A way to decouple ecosystem install and upgrades
What is the MapR Ecosystem Pack (MEP)?
– A selected set of stable and popular components, connectors,
and interfaces from the open source ecosystem that we fully
support on the MapR platform.
– A single repository of selected versions of these components
fully tested to be interoperable.
– A delivery vehicle for connectors and developer APIs that allow
us to provide common ecosystem interfaces to MapR
components (e.g., Kafka Connect for MapR Streams).
MapR Moved to
Ecosystem
Packs in Q3 ‘16
8© 2016 MapR Technologies 8
Extended
Ecosystem
Where Does MEP Fit into the Bigger Ecosystem Picture?
MapR Core
Ecosystem
MEP
Outside support: vendor or
community.
Fully supported, updates tied
to MapR core.
Fully supported, updates
follow MEP process.
9© 2016 MapR Technologies 9
What Is in a MapR Ecosystem Pack (MEP)?
MEP contains a set of ecosystem projects, connectors, and APIs
Connectors
ProjectsAPIs
A selected set of open source ecosystem projects
that we ship, package, and fully support on the
MapR Converged Data Platform.
Connectors and APIs to provide common Hadoop
interfaces to core MapR products. (e.g., Kafka
Connect for MapR Streams)
MapR Ecosystem
Pack
10© 2016 MapR Technologies 10
Key Differentiator: Decoupled Ecosystem Upgrades
Competitor process: All-or-nothing
● Must upgrade full stack to receive any updates
● Infrequent opportunities for upgrade: ~2/year
MapR Ecosystem Packs (MEP) Process:
● Reduce upgrade effort – upgrade only at the level you
need, instead of your entire stack
● Frequent (quarterly) opportunities for upgrade
MEP 1.0 MEP 2.0 MEP ?
Less disruption to production environments! Upgrades are disruptive and infrequent!
11© 2016 MapR Technologies 11
MEP 2.0 ContentsMEP 2.0 Content
Apache Spark 2.0.1 Apache Sqoop2 1.99.7
Apache Drill 1.9 Apache Sqoop 1.4.6
Apache Hive 1.2.1 Apache Flume 1.6
Hue 3.10 Apache Storm 0.10.1
Apache Pig 0.16 Apache Mahout 0.12.2
Apache Oozie 4.2.0 Apache Myriad 0.1.0
Impala 2.5 Apache Sentry 1.6
★ Major Spark Upgrade!
★ Major feature updates to Drill!
★ MapR Installer Stanzas
★ Includes new connectors:
❖ Kafka Connect for MapR Streams
❖ Kafka REST Proxy For MapR Streams
❖ MapR Connector for Teradata (Powered by Teradata Connector for Hadoop)
12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies
Drill 1.9
© 2016 MapR Technologies 13
Big Data StoreMapR-FS MapR-DB MapR Streams
Database Event Streaming
Batch ProcessingStream Processing
Real Time dashboardsBI/Ad-hoc queriesData exploration
Global Sources
Evolving towards Unified SQL Access Layer for MapR Platform
• Queries across Files,
Tables and Streams
• Real-time/Operational
analytics
• Schema-less JSON
flexibility
• Distributed in-memory
SQL engine for high
performance at Scale
• Analytics from familiar
BI/SQL tools
14© 2016 MapR Technologies 14
Drill Product Improvements over ReleasesDrill 1.0 GA
•Drill GA
Drill 1.1
•Automatic Partitioning for Parquet Files
•Window Functions support
•- Aggregate Functions: AVG, COUNT, MAX, MIN, SUM
•-Ranking Functions: CUME_DIST, DENSE_RANK, PERCENT_RANK, RANK and ROW_NUMBER
•Hive impersonation
•SQL Union support
•Complex data enhancements· and more
Drill 1.2
•Native parquet reader for Hive tables
•Hive partition pruning
•Multiple Hive versions support
•Hive 1.2.1 version support
•New analytical functions (Lead, lag, Ntile etc)
•Multiple window Partition By clauses support
•Drop table syntax
•Metadata caching
•Security support for web UI
•INT 96 data type support
•UNION distinct support
Drill 1.3/1.4
•Improved Tableau experience with faster Limit 0 queries
•Metadata (INFORMATION_SCHEMA) query speed ups on Hive schemas/tables
•Robust partition pruning (more data types, large # of partitions)
•Optimized metadata cache
•Improved window functions resource usage and performance
•New & improved JDBC driver
Drill 1.5/1.6
•Enhanced Stability & scale• New memory allocator
• Improved uniform query load distribution via connection pooling
•Enhanced query performance
• Early application of partition pruning in query planning
• Hive tables query planning improvements
• Row count based pruning for Limit N queries
• Lazy reading of parquet metadata caching
•Limit 0 performance
•Enhanced SQL Window function frame syntax
•Client impersonation
•JDK 1.8 support
Drill 1.71/.8
•Drill on YARN integration
•Access to Drill logs in the Web UI
•Addition of JDBC/ODBC client IP in Drill audit logs
•Monitoring via JMX
•Hive CHAR data type support
•Partition pruning enhancements
•Ability to return file names as part of queries
SQL
Window
functions
Enhanced
Hive
compatibility
Query
Performance
/ Scale
Drill on
MapR-DB
JSON tables
Enterprise
manageability➢Drill 1.9 Product highlights• Enhanced Parquet Performance (Parquet filter
pushdown, Improved Scans with Async Parquet Reader,
Limit pushdown)
• Flexible & Dynamic UDFs
• Null equality joins support
• Efficient metadata queries
• HTTPD Format plugin
• ~60 bug fixes & improvements in SQL, Performance,
Usability
15© 2016 MapR Technologies 15
Parquet Filter Pushdown
• Applies during planning time
• Evaluates filter condition before the
scan
• Planner evaluates filter conditions and
checks if a Parquet row group can be
eliminated
• Requires Parquet files to have min/max
statistics
• If min/max values are outside the range
of the filter, row group is dropped
• Supports only simple expressions
Example
SELECT * from table_t1
WHERE date_column between
date ‘2016-01-01’ and date ‘2016-
01-31’
Row group 1 : date_column : min = 2015-
01-01 max = 2015-12-31
Row group 2 : date_column : min = 2016-
01-01 max = 2016-12-31
Only row group 2 will be scanned
16© 2016 MapR Technologies 16
Parquet Filter Pushdown
The following are supported
• Clauses: WHERE, HAVING (if filter can be pushed past
GROUP BY)
• Operators: AND, OR, IN (in list < 10)
• Comparison operators: =, <>, <, >, <=, >=
• Data Types: INT, BIGINT, FLOAT, DOUBLE, DATE,
TIMESTAMP, TIME
• Functions: CAST (only for int, bigint, float, double)
17© 2016 MapR Technologies 17
Parquet Filter Pushdown (cont.)
Execution without filter pushdown
18© 2016 MapR Technologies 18
Parquet Filter Pushdown (cont.)
Plan with filter pushdown00-00 Screen : rowType = RecordType(ANY *): rowcount = 2925.0, cumulative cost = {26617.5 rows, 108517.5 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 2890
00-01 Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 2925.0, cumulative cost = {26325.0 rows, 108225.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 2889
00-02 Project(T10¦¦*=[$0]) : rowType = RecordType(ANY T10¦¦*): rowcount = 2925.0, cumulative cost = {26325.0 rows, 108225.0 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 2888
00-03 SelectionVectorRemover : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount = 2925.0, cumulative cost = {26325.0
rows, 108225.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2887
00-04 Filter(condition=[AND(>=($1, 1993-01-01), <=($1, 1994-01-01))]) : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount
= 2925.0, cumulative cost = {23400.0 rows, 105300.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2886
00-05 Project(T10¦¦*=[$0], orderdate=[$1]) : rowType = RecordType(ANY T10¦¦*, ANY orderdate): rowcount = 11700.0, cumulative cost =
{11700.0 rows, 23400.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 2885
00-06 Scan(groupscan=[ParquetGroupScan [entries=
[ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_186.parquet],
…
ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_125.parquet],
ReadEntryWithPath [path=/Users/pchandra/work/data/test_filter_pushdown/0_0_194.parquet]],
selectionRoot=file:/Users/pchandra/work/data/test_filter_pushdown, numFiles=117, usedMetadataFile=false, columns=[`*`]]]) :
rowType = (DrillRecordRow[*, orderdate]): rowcount = 11700.0, cumulative cost = {11700.0 rows, 23400.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 2884
19© 2016 MapR Technologies 19
Parquet Filter Pushdown (cont.)
Execution with filter pushdown
20© 2016 MapR Technologies 20
Parquet Filter Pushdown (cont.)
TPCH
QueriesSelectivity
Withou
t FltrPD
(MB)
With
FltrPD
(MB)
I/O
Reduction
TPCH 06 15% 5,779 1,707 70%
TPCH 07 30% 12,395 5,188 58%
TPCH 14 1% 7,915 5,254 34%
TPCH 20 15% 9,174 8,333 9%
21© 2016 MapR Technologies 21
Asynchronous Parquet Reader
• High performance queries for scan intensive analytics (~33% I/O reduction)
• Parquet reader improvements include
– Buffered reads
– Parallel reads from file system
– Parallel decompression and decoding
– Reading and decoding is pipelined
22© 2016 MapR Technologies 22
Flexible & Dynamic UDFs
• Self Service ability for end users to deploy UDFs
• Simplified deployment without disruption
– No admin permissions on Drillbit nodes required & no
Drillbit restarts
• Works in standalone and YARN based Drill clusters
23© 2016 MapR Technologies 23
Refer to Drill Best Practices on the MapR Converge
Community
https://community.mapr.com/docs/DOC-1497
24© 2016 MapR Technologies 24© 2016 MapR Technologies© 2016 MapR Technologies
Spark 2.0.1
25© 2016 MapR Technologies 25
The Trinity of Real Time
Real Time
Producers
Global Messaging System NoSQL Database
Real Time
Operational
Analytics
Transformational Tier
Topic 1
Topic 2
26© 2016 MapR Technologies 26
Spark 2.0.1: Whole Stage Code-Gen: Planner
ParquetRelation
Filter
Project
Broadcast Hash join
Project
TungstenAggregate
Exchange
ParquetRelation
Filter
Project
ParquetRelation
Filter
Project
Broadcast Hash join
Project
TungstenAggregate
Exchange
ParquetRelation
Filter
Project
Whole Stage Codegen Whole Stage Codegen
27© 2016 MapR Technologies 27
Whole Stage Code-Gen: Spark as a Compiler
Scan
Filter
Project
Aggregate
Select count(*)
from store_sales
where
ss_item_sk=1000
long count =0;
for (ss_item_sk in
store_sales){
if(ss_item_sk==1000){
count += 1;
}
}
Class Filter{
def next(): Boolean = {
var found = false;
while(!found && child.next()){
found =
predicate(child.fetch())
}
return found;
}
def fetch(): InternalRow = {
child.fetch()
}
...
}
Volcano Iterator Model Whole Stage Code-gen
28© 2016 MapR Technologies 28
Spark 2.0.1: In-Memory Columnar Format
2 Mike 20
3 Bob 30
1 John 10 1 2 3
John Mike Bob
10 20 30
In-memory Row format In-memory Column format
Efficient: Dense storage, easy to index,
vectorized processing.
Compatibility: With external systems that
use columnar format, No serialization/copy.
Extensibility: Process encoded data,
integrate with columnar cache.
Spark 1.6 Spark 2.0+
29© 2016 MapR Technologies 29
Spark 2.0.1: Structured Streaming Preview
Structured Streaming lets you:
• Treat streams as if they were in a table
• Automatically appends new stream records into that “table”
• Coordinates some of the output to an external sink
Structured Streaming in Spark 2.0 is an alpha release.
Queryable data at time n
Queryable data at time n + 1
Queryable data at time n + 2
Queryable data at time n + 3
Streaming
data …
30© 2016 MapR Technologies 30
Spark 2.0.1: Structured Streaming Preview (cont.)
Processing Time
1
Input Table
Result Table
Program Output
(written to external
storage)
Output for
data at
time 1
Output for
data at
time 2
Output for
data at
time 3
Data up to
processing
time 1
Data up to
processing
time 2
Data up to
processing
time 3
Complete: All results are sent to the
external sink.
Append: Only new rows added since the
last trigger (each “time” in diagram at left)
are sent to the external sink.
Update (not yet available): Only changed
rows since the last trigger are sent to the
external sink.
Output Modes to the
External Sink
32
31© 2016 MapR Technologies 31© 2016 MapR Technologies© 2016 MapR Technologies
Kafka APIs for MapR Streams
© 2016 MapR Technologies
Big Data is Continously Generated One Event at a Time
“time” : “6:01.103”,
“event” : “RETWEET”,
“location” :
“lat” : 40.712784,
“lon” : -74.005941
“time: “5:04.120”,
“severity” : “CRITICAL”,
“msg” : “Service down”
“card_num” : 1234,
“merchant” : ”MERCH1”,
“amount” : 50
© 2016 MapR Technologies
Three Core Components of the Streaming Architecture
● Producer: Software-based system that is connected to the data
source. Producers publish event data into a streaming system.
● Streaming/messaging system: A systems that takes the data
published by the producers, persists it, and reliably delivers it to
consumers.
● Consumer: Subscribes to data from streams and manipulate or
analyze that data to look for alerts and insights. In the streaming
context, consumers are typically stream processing engines.
© 2016 MapR Technologies
Social Media
Sensor Data
Database
Data Warehouse
Custom
Code
KafkaStream
ProcessingPersistence
Data
Collector
Kafka Producer
Kafka Producer
Kafka Consumer
Three Core Components of the Streaming Architecture
© 2016 MapR Technologies
Simplifying the Streaming Architecture
● Making it easy to ingest data into the streaming system
- Connecting data sources using HTTP, making it simple for
any device to connect with Kafka
- Introducing a framework to connect most common data
systems with Kafka
● Converging the three core components on one platform
© 2016 MapR Technologies
Social Media
Sensor Data
Database
Data Warehouse
Kafka
Connect
KafkaStream
ProcessingPersistence
Kafka REST API
Simplifying the Streaming Architecture
© 2016 MapR Technologies
Kafka Connect: Easy Connection to Data Systems
● Provides prebuilt connectors that allow most common data systems
to connect with Kafka
● Easily connect databases (such as Oracle), data warehouses (such
as Teradata) and Hadoop (HDFS) with Kafka
● Pull-based ingest of data, supporting sources that don't know how to
push into Kafka
● Push-based export of data from Kafka, supporting data systems that
don't know how to pull data from Kafka
© 2016 MapR Technologies
Database
Data Warehouse
Kafka
Connect
KafkaStream
ProcessingPersistence
Kafka Connect: Easy Connection to Data Systems
39© 2016 MapR Technologies 39
Kafka REST Proxy: Connect with Kafka using HTTP
● Any device that can communicate using HTTP can now
communicate directly with Kafka
● Any programming language in any runtime environment can now
connect with Kafka using HTTP
● The Kafka REST API eliminates intermediate data collectors
● Simplifying IoT architecture: any car, thermostat, machine sensor,
etc., can now directly communicate with Kafka
40© 2016 MapR Technologies 40
Social Media
Sensor Data
KafkaStream
ProcessingPersistence
Kafka REST API
Kafka REST Proxy: Connect with Kafka using HTTP
© 2016 MapR Technologies
Social Media
Sensor Data
Database
Data Warehouse
Kafka
Connect
MapR
Streams
Stream
Processing
(Spark)
Persistence
(MapR-DB)
(MapR-FS)
Kafka REST API
MapR Converged Data Platform
Converging Components of Streaming with MapR
42© 2016 MapR Technologies 42© 2016 MapR Technologies© 2016 MapR Technologies
MapR Installer Stanzas
43© 2016 MapR Technologies 43
MapR Installer “Stanzas”
• Under the Spyglass initiative, today we are
proud to announce MapR Installer Stanzas.
• MapR Installer Stanzas enable API-driven
installation for industry’s only Converged Data
Platform.
– Stanza contains layout and settings for
the cluster to be installed,
– It can be programmatically invoked to
provision clusters
– Automate successive cluster creation
with minimal changes
– Designed for both on-premises and cloud
deployments
SMapR
Installer
Stanzas
44© 2016 MapR Technologies 44
Simple, Easy YAML
Lars Fredriksen • Built directly on top of installer REST api
• SDK models generated from swagger.json
• Installed in virtual python environment as
mapr_installer_cli module
• Connection mgmt, error handling, YAML parsing,
progress status
• Python app driven by YAML configuration
• Commands:
– Install
– Uninstall
– Export
– List
Example
environment:
mapr_core_version: 5.2.0
config:
hosts:
- demonode[1-3].example.com
ssh_id: root
license_type: enterprise
mep_version: 2.0
disks:
- /dev/sdb
- /dev/sdc
services:
template-05-converged:
45© 2016 MapR Technologies 45
How It Fits with the Current Installer Architecture
GUI Frontend(AngularJS + Bootstrap)
Java REST Backend(Jetty + Jersey + Jackson)
Installer Core(Python + Ansible)
HTTPS
NodesEmbedded
DB
APIs
“Stanzas”(Python + Yaml)
46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies
Other Key Additions
47© 2016 MapR Technologies 47
Hue 3.10
Key Improvements:
● Oozie Improvements○ External Workflow Graph
○ Single Action Execution
○ New Ability: Dryrun Oozie job
● New SQL Query editor works over JDBC○ Look for an upcoming Community post on how to use this with Apache Drill!
● Directory and File-based Document Management○ Users can create their own directories and subdirectories and drag and drop documents
within the simple filebrowser interface:
48© 2016 MapR Technologies 48
“
MapR Connector for Teradata Powered by Teradata Connector for Hadoop
“MapR and Teradata share a customer base that
continually drives both of us to simplify and
orchestrate their analytical ecosystem. This
latest collaboration by our engineers is yet
another example of helping leading data driven
organizations realize value from big data faster
and easier.– Chad Meley, VP of Marketing at Teradata.
A Sqoop wrapper that
facilitates bulk data transfer
between Hadoop and external
data storage
49© 2016 MapR Technologies 49
As a Reminder…
https://community.mapr.com
• Q&A
• Discussions
• Code snippets
• Tutorials
50© 2016 MapR Technologies 50
Q & A
@mapr
Engage with us!
mapr-technologies
Thank You!