Enabling Real-Time Business with Change Data Capture
-
Upload
mapr-data-technologies -
Category
Data & Analytics
-
view
359 -
download
5
Transcript of Enabling Real-Time Business with Change Data Capture
© 2017 MapR Technologies 1
+
Rupal Shah, Solutions Architect at StreamSets
Audrey Egan, Solutions Engineer at MapR
Enabling Real-Time Business with
Change Data Capture
© 2017 MapR TechnologiesMapR Confidential 2
•MapR 6.0 for Real-Time Business
•Utilizing change data capture (CDC)
•How StreamSets enables working with CDC
•Demo with MapR and StreamSets
•Q&A
Agenda
© 2017 MapR TechnologiesMapR Confidential 3
Complexity: Multiple Data Sources and Pipelines
User
Interaction
Message
Bus
Operational
Database ETL Analytics/Big
Data ClusterAnalytics
Batch analysis
(e.g., 24 hours)
© 2017 MapR TechnologiesMapR Confidential 4
Challenges with Traditional Data Warehouse Based Architectures
Ever increasing cost
Expensive licensing models combined with the
massive volumes of data being created means the
cost of data warehousing is significantly rising.
Inability to scale-out
Data warehouses cannot scale-out linearly using
commodity hardware. Buying new expensive
hardware is straining IT budgets.
Unused data driving cost up
70% of data in DW is unused, i.e. never queried in
past 1 year.
Misuse of CPU capacity
Almost 60% of CPU capacity is used for ETL/ELT.
15% of CPU consumed by ETL to load unused data.
This affects performance of queries.
Inability to support non-relational data
Designed for relational data, data warehouses are
not suitable for unstructured data coming from
sensors, logs, devices, social media etc.
Inability to support modern analytics
DWs do not support modern analytics technologies
such as machine learning and stream processing.
>$10KCost/TB
70%Data Unused
60%CPU used for ETL
© 2017 MapR TechnologiesMapR Confidential 5
Data
Exploration
Using Drill
Event
Streaming with
MapR-ES
Transformations
Real-Time Analytics and Dashboards with Stream Processing
Operational
Database
Stream Processing
Operational
Database
Operational
Database
Change Data
Capture
Change Data
Capture
Change Data
Capture
Real-time
Business
Intelligence
Static Data
– Inserts
only
Frequently
updated
data
© 2017 MapR TechnologiesMapR Confidential 6
CONVERGED DATA PLATFORMHigh Availability Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace
EXISTING ENTERPRISE
APPLICATIONSBATCH & INTERACTIVE
ANALYTICS
INTELLIGENT
APPLICATIONS
ANALYTICS &
ML ENGINES
ON-PREMISE, MULTI-CLOUD, EDGE
IOT & EDGE
CLOUD-SCALE
DATA STORE
OPERATIONAL
DATABASE
GLOBAL EVENT
STREAMS
Powerful Distributed NoSQL Database in MapR Platform
© 2017 MapR TechnologiesMapR Confidential 7
•Distributed database on commodity hardware with horizontal scale•Store data that is hierarchical and nested, and that evolves over time.
•Read and write individual document fields, subsets of fields, or whole documents from and to disk.
•Build Java applications with the MapR-DB JSON API library
•Control read and write access to single fields and subsets of fields within a JSON table by using access-control expressions (ACEs).
MapR-DB 6.0
OJAI HBASE JDBC/ODBC
Document Wide-Column Key-Value Custom data models (Graph, Time series)
Hadoop/Spark
CDC
Application Development Interactive SQL/BI Batch/Real-timedata processing
© 2017 MapR TechnologiesMapR Confidential 8
Powerful and Efficient Data Access w/Native Secondary indexes
• Built for MapR-DB JSON
• Flexible & Efficient queries on any fields
• Scalable & Enterprise grade indexing
– Auto-propagation
– Auto-scale
– Auto-management
• Seamless queries across primary,
secondary index tables
• Rich indexing functionality
– Unlimited indexes, composite indexes with large
# of columns, 13 datatypes support, hashed
indexes, covering/non-covering query support
• Optimized index based access for
application development & BI/Analytics
Sample row { “Name” : “John”, Age : 10, Address : { “Apt” : 802, Street : “Holger
way”, State : CA, Zip : 94086}, Activity: Medium}
Example index
Create index on t fields “Address.State DESC, Age ASC”
covered fields Activity
Primary table Index table
Example query
Select Age, Address.State, Name, Activity From t
Where Age > 10 && Age < 20 Order by Age desc
_id Activity
20_CA_Row5 High
20_NJ_Row2 Low
12_FL_Row3 Low
10_CA_Row1 Medium
10_NJ_Row4 High
_id Age State Activity
ROW1 10 CA Medium
ROW2 20 NJ Low
ROW3 12 FL Low
ROW4 10 NJ High
ROW5 20 CA High
© 2017 MapR TechnologiesMapR Confidential 9
In-place and Continuous Analytics/ML
w/Native Spark Connectivity
1. Real time pipelines - MapR-DB as a sync/destination for Spark Streaming
MapR-DB
Native Spark
Connector
MapR-DB
Native Spark
Connector
2. Data transformation - MapR-DB as a Source and Sync/Destination for Spark/SparkSQL
XD
MapR-DB
Native Spark
Connector
© 2017 MapR TechnologiesMapR Confidential 10
Real-Time Data Integration and Micro-Services w/Global
Change Data Capture
•Allows arbitrary external systems to consume changes in MapR-DB tables globally
•Build Scalable real time data hubs for fast ingesting and fast ingesting big data
•Enables Real-time event driven micro-services app fabrics to create rich experiences
Stream Processing
Remote MapR-DB
Elastic Search
Change Data CaptureSolr
© 2017 MapR TechnologiesMapR Confidential 11
Optimized Drill integration for Data Exploration & Operational BI
• Flexible schema-less queries on JSON data
w/Native Drill integration
• Real-time access to data
• Familiar BI/SQL tool access
• Distributed in-memory query execution on
large scale datasets
• Powerful optimizations for using Secondary
indexes. Filter/Sort support for indexes
• Cost based optimization using tablet distribution
& data statistics
• Improved parallelization
• Brings together historical and operational data
in a unified interface
Brings together historical and operational data
in a unified interface
Faster data access via indexes & distributed execution
New standards for data warehousing
ETL ETL
Ingest Analyze
Past (ETL)
➢ Fixed schema ETL for Data
warehouses
➢ Source Data structured and
rigid transaction data
Data Sources Data Stores Data Consumers
Emerging (Ingest)
➢ Explosion of Data Stores –
fluid infrastructure
➢ Source Data predominantly
multi-structured interaction
data
➢ Data Drift: Structure,
Semantic, InfrastructureProcess
© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential
Full Dataflow Lifecycle Management
● Developers
● Scientists
● Architects
Develop Dataflows
StreamSets Data Collector (SDC)
Development Environment
● Operators
● Stewards
● Architects
Operate PlatformStreamSets
Dataflow PerformanceManager (DPM)
Operational Environment
Develop and Evolve
Deploy
Operate and Remediate
DATA PLANE CONTROL PLANEGOVERNANCE
Execute
© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential
Optimize the EDW
© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential
Enable real-time streams
Operational
Database
Operational
Database
Operational
Database
Change Data
Capture
Change Data
Capture
Change Data
Capture
Real-time
Business
Intelligence
Microservices CDC
© 2017 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Confidential
Demo Time
StreamSets tools
➢ StreamSets Data Collector
○ Multiple CDC Origins
○ MapR DB CDC Origin
➢ Convert Apache Sqoop™ Commands Into StreamSets
Data Collector Pipelines
* Continuous and incremental pulls * Data Drift Aware
* Removes scaffolding *
Automated Monitoring
https://streamsets.com/blog/using-streamsets-data-collector-modernize-apache-sqoop/
© 2017 MapR TechnologiesMapR Confidential 18
Summary
Multiple Data Models
Data Access with
Change Data Capture
Secondary Index Queries
Full integration with
MCP
Plethora of Connectors
Enabling Microservices
Enable
Real-Time
Analytic
Applications
© 2017 MapR Technologies 19
Q&A
ENGAGE WITH US
Contact us at:
855-NOW-MAPR
Or
https://twitter.com/mapr
https://www.linkedin.com/company/mapr-
technologies
Follow us at:
© 2017 MapR Technologies 20
Additional Resources
• Upcoming webinars:
• 11/16 – Self-Service Data Science for Leveraging ML & AI on All of Your
Data
• 11/28 – ML Workshop 2: Machine Learning Model Comparison & Evaluation
• eBooks:O'Reilly Book: "Machine Learning Logistics: Model Management in the Real World"
O'Reilly Report: "Data Where You Want It: Geo-Distribution of Big Data and
Analytics"
O'Reilly Book: "Streaming Architecture: New Designs Using
Apache Kafka and MapR Streams"