Geo-Distributed Big Data and Analytics
Click here to load reader
-
Upload
mapr-data-technologies -
Category
Data & Analytics
-
view
669 -
download
2
Transcript of Geo-Distributed Big Data and Analytics
© 2017 MapR Technologies 1
Geo-Distributed Big Data and Analytics:
Data where you need it
Computation where you want it
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email [email protected] [email protected]
Twitter @ted_dunning
© 2017 MapR Technologies 3
Contact Information
Ellen Friedman, PhD
Principal Technologist, MapR Technologies
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email [email protected] [email protected]
Twitter @Ellen_Friedman
© 2017 MapR Technologies 4
Imagine a future where …
You easily collect, access & analyze big data where ever you need it across
the globe in a seamless system under the same security & administration
© 2017 MapR Technologies 5
The future is here: Global Data Fabric
• Global data fabric lets you update business globally
• Local activities coordinate with global analyses
• Do this without requiring huge teams at each site
• Do this affordably, reliably, with low latency and at large scale
We’re here to explain how, but first: a real-world case study
© 2017 MapR Technologies 6
MapR customer: “A year in the bank”
© 2017 MapR Technologies 7
A Year of Technical Credit, not Debt
• Streaming media delivery is inherently geo-distributed
• Must measure if you want to manage
– metrics need to be collected and processed locally
– and distributed back to central facility
• How?
Streams!
© 2017 MapR Technologies 8
Collect Data
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 9
And Transport to Global Analytics
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
© 2017 MapR Technologies 10
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
© 2017 MapR Technologies 11
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 12
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 13
Analytics Doesn’t Care About Location
GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
© 2017 MapR Technologies 14
MapR Streams: Geo-distributed data
plus separation of concerns
© 2017 MapR Technologies 15
Why stream?
© 2017 MapR Technologies 16
Mechanism for a Global Data Fabric: Streaming
• Streaming data is becoming mainstream
• Innovative technologies are emerging to handle and process
streaming data
• Stream-1st architecture is a powerful approach with surprisingly
widespread advantages
© 2017 MapR Technologies 17
Images © E. Friedman
.
IoT Data: Sensors & Smart Parts
©WesAbrams
© 2017 MapR Technologies 18
Revolution in Manufacturing: Smart Tools
• Respond quickly to new requirements
• IoT enabled “smart tools” on
manufacturing floor
• Reconfigurable factory
• Removes barriers to communication:
engineering, design, analysis,
manufacture in one space
Image credit Bond Bryan Architecture, used with permission
Factory 2050: Boeing AMRC at University of Sheffield
© 2017 MapR Technologies 19
Streaming data has value beyond real-time insights
© 2017 MapR Technologies 20
Predictive Maintenance
• Streaming sensor data + long term maintenance histories
• Machine learning model detects anomalous pattern
• “Failure signature” warns of need for maintenance before
damage occurs
Image courtesy Mtell used with permission.
in Real World Hadoop by Dunning & Friedman © 2015
© 2017 MapR Technologies 21
Stream-1st Architecture
Real-time analytics
EMRPatient Facilities
management
Insurance audit
A
B
Medical tests
C
Medical test results
People often start with event
data streamed to a real-time
application (A)
Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly book Streaming Architecture
used with permission
© 2017 MapR Technologies 22
Heart of Stream-1st Architecture: Message Transport
Real-time analytics
EMRPatient Facilities
management
Insurance audit
A
B
Medical tests
C
Medical test results
The right messaging tool
supports other classes of use
cases (B & C)
Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly book Streaming Architecture
used with permission
© 2017 MapR Technologies 23
Key capabilities:
Message Transport: Apache Kafka & MapR Streams
• Highly scalable
• High throughput, low latency
• Multiple producers & consumers: decoupled
• Durable messages
• Geo-distributed replication preserves offsets (unique to MapR Streams)
Consumer group
Messages
Producer
Consumer group
Consumer group
Producer
Image © 2016 Ted Dunning & Ellen Friedman from Chap 2 of O’Reilly book Streaming Architecture
used with permission
© 2017 MapR Technologies 24
Stream transport supports micro services
© 2017 MapR Technologies 25
Stream-1st Architecture: Basis for Micro-Services
Stream instead of database as the shared “truth”
POS 1..n
Fraud detector
Last card use
Updater
Card analytics
Other
card activity
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
© 2017 MapR Technologies 26
MapR Streams: efficient bi-directional,multi-master replication
© 2017 MapR Technologies 27
With MapR, Geo-Distributed Data Appears Local
streamData
sourceConsumer
© 2017 MapR Technologies 28
With MapR, Geo-Distributed Data Appears Local
stream
streamData
source
Consumer
© 2017 MapR Technologies 29
With MapR, Geo-distributed Data Appears Local
stream
streamData
source
ConsumerGlobal Data Center
Regional Data Center
© 2017 MapR Technologies 30
Unique to MapR: Manage Topics at Stream Level
• Many more topics on MapR cluster
• Topics are grouped together in Stream (different from Kafka)
• Policies set at the Stream level such as time-to-live, ACEs (controlled
access at this level is different than Kafka)
• Geo-distributed stream replication (different from Kafka)
Stream
Topic 1
Topic 3
Topic 2
Image © 2016 Ted Dunning & Ellen Friedman from Chap 5 of O’Reilly book Streaming Architecture used with permission
© 2017 MapR Technologies 31
Multi-Master Matters: MapR Table Replication
SF
DB
NY
DB
Source Source
SF
DB
NY
DB
Source Source
A
B
Better:
Bi-directional table replication
© 2017 MapR Technologies 32
Multi-Master Replication Matters
From O’Reilly report “Data Where You Want It: Geo-Distribution of Big Data and
Analytics” © 2017 by Ted Dunning & Ellen Friedman, used with permission
© 2017 MapR Technologies 33
Legacy Applications
MapR: Files, tables, streams in one technology
Data Center
Big Data 1.0 Applications Next-Gen Applications
Converged Data Platform
High Availability Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace
Real-Time Database Stream TransportWeb-Scale Storage
© 2017 MapR Technologies 34
Example
Files
Table
Streams
Directories
Cluster
Volume mount point
© 2017 MapR Technologies 35
Geo-distributed data with ease of management
© 2017 MapR Technologies 36
Remember this? Universal Pathname
Files
Table
Streams
Directories
Cluster
Volume mount point
© 2017 MapR Technologies 37
Global Namespace: Advantage for Geo-Distribution
• Enables program to refer to data anywhere
– On premise, in cloud, around the world
• Easier to manage: Supports separation of concerns
• Unique to MapR
© 2017 MapR Technologies 38
The Need for Cloud
• Fewer machines to purchase, flexibility regarding hardware
• Eases pressure on IT
• Cloud-bursting possible for short-term increase in computing
• Lets you optimize cluster usage (especially if you maintain
cloud neutrality)
© 2017 MapR Technologies 39
Cloud Neutrality for Optimization
Burst
Private
On-premise
data center
Core
4x cheaper for base load
4x cheaper for
peak loads
© 2017 MapR Technologies 40
Hybrid Cloud On-Premise Architecture
• Optimization of resources, but…
• Too complicated unless have a good platform for geo-distributed
data
• MapR is such a platform
© 2017 MapR Technologies 41
The Need for Containers
• To deploy exactly the same thing in many data-centers
• To get predictable behavior in production compared with testing
• Result: Docker-style containers becoming ubiquitous
© 2017 MapR Technologies 42
Key to Lightweight Containers: Persist State to MapR
Data platform
Stateful
Application
Stateful
Application
Stateless
Application
Container management system
• Works for files, tables, streams
• Run stateful applications in stateless containers
© 2017 MapR Technologies 43
Data where you want it, compute where you need it…
© 2017 MapR Technologies 44
... including processing at the IoT edge
© 2017 MapR Technologies 45
MapR Edge
• Small footprint cluster for remote processing
• Intended to run right next to the data producing device– Unified end-to-end security policy
– Reliable replication to cloud and data center, even with occasional connections
• Small but full MapR data services (files, tables, streams)– Normal data protection (snapshots, mirroring, replication)
– Normal management capabilities (volumes, fine-grained access control, monitoring)
© 2017 MapR Technologies 46
Why MapR Edge is Useful
Data
source
Data
source
Data
source
Report
Data
source• Designed to sit at IoT edge
• Intended to run right next to data- producing
device
© 2017 MapR Technologies 47
Who needs MapR Edge?
• Connected car industry
• Telecommunications industry
• Hospitals and medical testing facilities
• Anyone who benefits from global learning but needs local
action
© Ellen Friedman
© Ellen Friedman
©WesAbrams
© 2017 MapR Technologies 48
MapR Edge: Improves Time to Insight
Before MapR Edge After MapR Edge
• Oil & gas
• Medical device
• Car test & dev
48 hours
12 hours
24 hours
< 2 hours
< 15 minutes
< 5 minutes
© 2017 MapR Technologies 49
Use Case: Telecommunications
Callers
Towers
cdr data
© 2017 MapR Technologies 50
Streaming in Telecom
• Data collection & handling happens at different levels – tower, local data center, central data center)
• Batch: Can take 30 minutes per level
• Streaming: Latency drops to seconds or sub-seconds per level
• Ability to respond as events occur
• MapR Streams enables stream replication with offsets across data centers
© 2017 MapR Technologies 51
Telecom Reporting and Logging
Tower 2
Tower 1
Data
source
HQ
Aggregate
Data
source
© 2017 MapR Technologies 52
Data Center
RESThttps REST GW
Use Case: Automotive IoT
Car
CAN Bus
μRaw stream
Dispatcher
Data stream
Metrics
μ
© 2017 MapR Technologies 53
Global analytics
modelsGHQ
metrics
data center 1
data center 2
metrics
m1
m2
m3
m4
metrics
m1
m2
m3
m4
models
models
Global Machine Learning Foundation
© 2017 MapR Technologies 54
Learn globally, act locally
© 2017 MapR Technologies 55
Image © Ellen Friedman 2015, used with permission. From Chap 7 “Streaming
Architecture” book. Read free online: http://bit.ly/streams-ebook-ch7
.
Over 20% of world’s
shipping containers pass
through Singapore’s port.
Use Case: Container Shipping
© 2017 MapR Technologies 56
IoT Data for Container Shipping
Tokyo:
Sensors stream data to on-board cluster that reports to onshore cluster while in port
En route to Singapore:
MapR Streams geo-replication sends data to next port before ship arrives.
Problem in Sydney:
Real-time insights alert to “high humidity” in some containers
Singapore
Tokyo
Sydney
Corporate
HQ
A
B
C
Details in Chapter 7 “Streaming Architecture” book. Read free online here: http://bit.ly/streams-ebook-ch7
Figure used with permission.
© 2017 MapR Technologies 57
Additional Resources
O’Reilly report by Ted Dunning & Ellen Friedman © March 2017
Download free pdf courtesy of MapR:
http://bit.ly/mapr-geo-distribution-ebook-pdf
O’Reilly book by Ted Dunning & Ellen Friedman
© March 2016
Read free courtesy of MapR
https://mapr.com/streaming-architecture-using-
apache-kafka-mapr-streams/
© 2017 MapR Technologies 58
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2017 MapR Technologies 59
Q&A
@mapr
Maprtechnologies
ENGAGE WITH US
@ ted_dunning
@ Ellen_Friedman