Enabling Real-Time Business with Change Data Capture

20
© 2017 MapR Technologies 1 + Rupal Shah, Solutions Architect at StreamSets Audrey Egan, Solutions Engineer at MapR Enabling Real-Time Business with Change Data Capture

Transcript of Enabling Real-Time Business with Change Data Capture

Page 1: Enabling Real-Time Business with Change Data Capture

© 2017 MapR Technologies 1

+

Rupal Shah, Solutions Architect at StreamSets

Audrey Egan, Solutions Engineer at MapR

Enabling Real-Time Business with

Change Data Capture

Page 2: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 2

•MapR 6.0 for Real-Time Business

•Utilizing change data capture (CDC)

•How StreamSets enables working with CDC

•Demo with MapR and StreamSets

•Q&A

Agenda

Page 3: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 3

Complexity: Multiple Data Sources and Pipelines

User

Interaction

Message

Bus

Operational

Database ETL Analytics/Big

Data ClusterAnalytics

Batch analysis

(e.g., 24 hours)

Page 4: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 4

Challenges with Traditional Data Warehouse Based Architectures

Ever increasing cost

Expensive licensing models combined with the

massive volumes of data being created means the

cost of data warehousing is significantly rising.

Inability to scale-out

Data warehouses cannot scale-out linearly using

commodity hardware. Buying new expensive

hardware is straining IT budgets.

Unused data driving cost up

70% of data in DW is unused, i.e. never queried in

past 1 year.

Misuse of CPU capacity

Almost 60% of CPU capacity is used for ETL/ELT.

15% of CPU consumed by ETL to load unused data.

This affects performance of queries.

Inability to support non-relational data

Designed for relational data, data warehouses are

not suitable for unstructured data coming from

sensors, logs, devices, social media etc.

Inability to support modern analytics

DWs do not support modern analytics technologies

such as machine learning and stream processing.

>$10KCost/TB

70%Data Unused

60%CPU used for ETL

Page 5: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 5

Data

Exploration

Using Drill

Event

Streaming with

MapR-ES

Transformations

Real-Time Analytics and Dashboards with Stream Processing

Operational

Database

Stream Processing

Operational

Database

Operational

Database

Change Data

Capture

Change Data

Capture

Change Data

Capture

Real-time

Business

Intelligence

Static Data

– Inserts

only

Frequently

updated

data

Page 6: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 6

CONVERGED DATA PLATFORMHigh Availability Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace

EXISTING ENTERPRISE

APPLICATIONSBATCH & INTERACTIVE

ANALYTICS

INTELLIGENT

APPLICATIONS

ANALYTICS &

ML ENGINES

ON-PREMISE, MULTI-CLOUD, EDGE

IOT & EDGE

CLOUD-SCALE

DATA STORE

OPERATIONAL

DATABASE

GLOBAL EVENT

STREAMS

Powerful Distributed NoSQL Database in MapR Platform

Page 7: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 7

•Distributed database on commodity hardware with horizontal scale•Store data that is hierarchical and nested, and that evolves over time.

•Read and write individual document fields, subsets of fields, or whole documents from and to disk.

•Build Java applications with the MapR-DB JSON API library

•Control read and write access to single fields and subsets of fields within a JSON table by using access-control expressions (ACEs).

MapR-DB 6.0

OJAI HBASE JDBC/ODBC

Document Wide-Column Key-Value Custom data models (Graph, Time series)

Hadoop/Spark

CDC

Application Development Interactive SQL/BI Batch/Real-timedata processing

Page 8: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 8

Powerful and Efficient Data Access w/Native Secondary indexes

• Built for MapR-DB JSON

• Flexible & Efficient queries on any fields

• Scalable & Enterprise grade indexing

– Auto-propagation

– Auto-scale

– Auto-management

• Seamless queries across primary,

secondary index tables

• Rich indexing functionality

– Unlimited indexes, composite indexes with large

# of columns, 13 datatypes support, hashed

indexes, covering/non-covering query support

• Optimized index based access for

application development & BI/Analytics

Sample row { “Name” : “John”, Age : 10, Address : { “Apt” : 802, Street : “Holger

way”, State : CA, Zip : 94086}, Activity: Medium}

Example index

Create index on t fields “Address.State DESC, Age ASC”

covered fields Activity

Primary table Index table

Example query

Select Age, Address.State, Name, Activity From t

Where Age > 10 && Age < 20 Order by Age desc

_id Activity

20_CA_Row5 High

20_NJ_Row2 Low

12_FL_Row3 Low

10_CA_Row1 Medium

10_NJ_Row4 High

_id Age State Activity

ROW1 10 CA Medium

ROW2 20 NJ Low

ROW3 12 FL Low

ROW4 10 NJ High

ROW5 20 CA High

Page 9: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 9

In-place and Continuous Analytics/ML

w/Native Spark Connectivity

1. Real time pipelines - MapR-DB as a sync/destination for Spark Streaming

MapR-DB

Native Spark

Connector

MapR-DB

Native Spark

Connector

2. Data transformation - MapR-DB as a Source and Sync/Destination for Spark/SparkSQL

XD

MapR-DB

Native Spark

Connector

Page 10: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 10

Real-Time Data Integration and Micro-Services w/Global

Change Data Capture

•Allows arbitrary external systems to consume changes in MapR-DB tables globally

•Build Scalable real time data hubs for fast ingesting and fast ingesting big data

•Enables Real-time event driven micro-services app fabrics to create rich experiences

Stream Processing

Remote MapR-DB

Elastic Search

Change Data CaptureSolr

Page 11: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 11

Optimized Drill integration for Data Exploration & Operational BI

• Flexible schema-less queries on JSON data

w/Native Drill integration

• Real-time access to data

• Familiar BI/SQL tool access

• Distributed in-memory query execution on

large scale datasets

• Powerful optimizations for using Secondary

indexes. Filter/Sort support for indexes

• Cost based optimization using tablet distribution

& data statistics

• Improved parallelization

• Brings together historical and operational data

in a unified interface

Brings together historical and operational data

in a unified interface

Faster data access via indexes & distributed execution

Page 12: Enabling Real-Time Business with Change Data Capture

New standards for data warehousing

ETL ETL

Ingest Analyze

Past (ETL)

➢ Fixed schema ETL for Data

warehouses

➢ Source Data structured and

rigid transaction data

Data Sources Data Stores Data Consumers

Emerging (Ingest)

➢ Explosion of Data Stores –

fluid infrastructure

➢ Source Data predominantly

multi-structured interaction

data

➢ Data Drift: Structure,

Semantic, InfrastructureProcess

Page 13: Enabling Real-Time Business with Change Data Capture

© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential

Full Dataflow Lifecycle Management

● Developers

● Scientists

● Architects

Develop Dataflows

StreamSets Data Collector (SDC)

Development Environment

● Operators

● Stewards

● Architects

Operate PlatformStreamSets

Dataflow PerformanceManager (DPM)

Operational Environment

Develop and Evolve

Deploy

Operate and Remediate

DATA PLANE CONTROL PLANEGOVERNANCE

Execute

Page 14: Enabling Real-Time Business with Change Data Capture

© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential

Optimize the EDW

Page 15: Enabling Real-Time Business with Change Data Capture

© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential

Enable real-time streams

Operational

Database

Operational

Database

Operational

Database

Change Data

Capture

Change Data

Capture

Change Data

Capture

Real-time

Business

Intelligence

Microservices CDC

Page 16: Enabling Real-Time Business with Change Data Capture

© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential

Demo Time

Page 17: Enabling Real-Time Business with Change Data Capture

StreamSets tools

➢ StreamSets Data Collector

○ Multiple CDC Origins

○ MapR DB CDC Origin

➢ Convert Apache Sqoop™ Commands Into StreamSets

Data Collector Pipelines

* Continuous and incremental pulls * Data Drift Aware

* Removes scaffolding *

Automated Monitoring

https://streamsets.com/blog/using-streamsets-data-collector-modernize-apache-sqoop/

Page 18: Enabling Real-Time Business with Change Data Capture

© 2017 MapR TechnologiesMapR Confidential 18

Summary

Multiple Data Models

Data Access with

Change Data Capture

Secondary Index Queries

Full integration with

MCP

Plethora of Connectors

Enabling Microservices

Enable

Real-Time

Analytic

Applications

Page 19: Enabling Real-Time Business with Change Data Capture

© 2017 MapR Technologies 19

Q&A

ENGAGE WITH US

Contact us at:

855-NOW-MAPR

Or

[email protected]

https://twitter.com/mapr

https://www.linkedin.com/company/mapr-

technologies

Follow us at:

Page 20: Enabling Real-Time Business with Change Data Capture

© 2017 MapR Technologies 20

Additional Resources

• Upcoming webinars:

• 11/16 – Self-Service Data Science for Leveraging ML & AI on All of Your

Data

• 11/28 – ML Workshop 2: Machine Learning Model Comparison & Evaluation

• eBooks:O'Reilly Book: "Machine Learning Logistics: Model Management in the Real World"

O'Reilly Report: "Data Where You Want It: Geo-Distribution of Big Data and

Analytics"

O'Reilly Book: "Streaming Architecture: New Designs Using

Apache Kafka and MapR Streams"