Geo-Distributed Big Data and Analytics

59

Click here to load reader

Transcript of Geo-Distributed Big Data and Analytics

Page 1: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 1

Geo-Distributed Big Data and Analytics:

Data where you need it

Computation where you want it

Page 2: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 2

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 3

Contact Information

Ellen Friedman, PhD

Principal Technologist, MapR Technologies

Committer Apache Drill & Apache Mahout projects

O’Reilly author

Email [email protected] [email protected]

Twitter @Ellen_Friedman

Page 4: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 4

Imagine a future where …

You easily collect, access & analyze big data where ever you need it across

the globe in a seamless system under the same security & administration

Page 5: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 5

The future is here: Global Data Fabric

• Global data fabric lets you update business globally

• Local activities coordinate with global analyses

• Do this without requiring huge teams at each site

• Do this affordably, reliably, with low latency and at large scale

We’re here to explain how, but first: a real-world case study

Page 6: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 6

MapR customer: “A year in the bank”

Page 7: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 7

A Year of Technical Credit, not Debt

• Streaming media delivery is inherently geo-distributed

• Must measure if you want to manage

– metrics need to be collected and processed locally

– and distributed back to central facility

• How?

Streams!

Page 8: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 8

Collect Data

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center

Page 9: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 9

And Transport to Global Analytics

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

Page 10: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 10

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

Page 11: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 11

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

Page 12: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 12

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

Page 13: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 13

Analytics Doesn’t Care About Location

GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

Page 14: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 14

MapR Streams: Geo-distributed data

plus separation of concerns

Page 15: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 15

Why stream?

Page 16: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 16

Mechanism for a Global Data Fabric: Streaming

• Streaming data is becoming mainstream

• Innovative technologies are emerging to handle and process

streaming data

• Stream-1st architecture is a powerful approach with surprisingly

widespread advantages

Page 17: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 17

Images © E. Friedman

.

IoT Data: Sensors & Smart Parts

©WesAbrams

Page 18: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 18

Revolution in Manufacturing: Smart Tools

• Respond quickly to new requirements

• IoT enabled “smart tools” on

manufacturing floor

• Reconfigurable factory

• Removes barriers to communication:

engineering, design, analysis,

manufacture in one space

Image credit Bond Bryan Architecture, used with permission

Factory 2050: Boeing AMRC at University of Sheffield

Page 19: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 19

Streaming data has value beyond real-time insights

Page 20: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 20

Predictive Maintenance

• Streaming sensor data + long term maintenance histories

• Machine learning model detects anomalous pattern

• “Failure signature” warns of need for maintenance before

damage occurs

Image courtesy Mtell used with permission.

in Real World Hadoop by Dunning & Friedman © 2015

Page 21: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 21

Stream-1st Architecture

Real-time analytics

EMRPatient Facilities

management

Insurance audit

A

B

Medical tests

C

Medical test results

People often start with event

data streamed to a real-time

application (A)

Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly book Streaming Architecture

used with permission

Page 22: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 22

Heart of Stream-1st Architecture: Message Transport

Real-time analytics

EMRPatient Facilities

management

Insurance audit

A

B

Medical tests

C

Medical test results

The right messaging tool

supports other classes of use

cases (B & C)

Image © 2016 Ted Dunning & Ellen Friedman from Chap 1 O’Reilly book Streaming Architecture

used with permission

Page 23: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 23

Key capabilities:

Message Transport: Apache Kafka & MapR Streams

• Highly scalable

• High throughput, low latency

• Multiple producers & consumers: decoupled

• Durable messages

• Geo-distributed replication preserves offsets (unique to MapR Streams)

Consumer group

Messages

Producer

Consumer group

Consumer group

Producer

Image © 2016 Ted Dunning & Ellen Friedman from Chap 2 of O’Reilly book Streaming Architecture

used with permission

Page 24: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 24

Stream transport supports micro services

Page 25: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 25

Stream-1st Architecture: Basis for Micro-Services

Stream instead of database as the shared “truth”

POS 1..n

Fraud detector

Last card use

Updater

Card analytics

Other

card activity

Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission

Page 26: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 26

MapR Streams: efficient bi-directional,multi-master replication

Page 27: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 27

With MapR, Geo-Distributed Data Appears Local

streamData

sourceConsumer

Page 28: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 28

With MapR, Geo-Distributed Data Appears Local

stream

streamData

source

Consumer

Page 29: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 29

With MapR, Geo-distributed Data Appears Local

stream

streamData

source

ConsumerGlobal Data Center

Regional Data Center

Page 30: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 30

Unique to MapR: Manage Topics at Stream Level

• Many more topics on MapR cluster

• Topics are grouped together in Stream (different from Kafka)

• Policies set at the Stream level such as time-to-live, ACEs (controlled

access at this level is different than Kafka)

• Geo-distributed stream replication (different from Kafka)

Stream

Topic 1

Topic 3

Topic 2

Image © 2016 Ted Dunning & Ellen Friedman from Chap 5 of O’Reilly book Streaming Architecture used with permission

Page 31: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 31

Multi-Master Matters: MapR Table Replication

SF

DB

NY

DB

Source Source

SF

DB

NY

DB

Source Source

A

B

Better:

Bi-directional table replication

Page 32: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 32

Multi-Master Replication Matters

From O’Reilly report “Data Where You Want It: Geo-Distribution of Big Data and

Analytics” © 2017 by Ted Dunning & Ellen Friedman, used with permission

Page 33: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 33

Legacy Applications

MapR: Files, tables, streams in one technology

Data Center

Big Data 1.0 Applications Next-Gen Applications

Converged Data Platform

High Availability Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace

Real-Time Database Stream TransportWeb-Scale Storage

Page 34: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 34

Example

Files

Table

Streams

Directories

Cluster

Volume mount point

Page 35: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 35

Geo-distributed data with ease of management

Page 36: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 36

Remember this? Universal Pathname

Files

Table

Streams

Directories

Cluster

Volume mount point

Page 37: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 37

Global Namespace: Advantage for Geo-Distribution

• Enables program to refer to data anywhere

– On premise, in cloud, around the world

• Easier to manage: Supports separation of concerns

• Unique to MapR

Page 38: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 38

The Need for Cloud

• Fewer machines to purchase, flexibility regarding hardware

• Eases pressure on IT

• Cloud-bursting possible for short-term increase in computing

• Lets you optimize cluster usage (especially if you maintain

cloud neutrality)

Page 39: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 39

Cloud Neutrality for Optimization

Burst

Private

On-premise

data center

Core

4x cheaper for base load

4x cheaper for

peak loads

Page 40: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 40

Hybrid Cloud On-Premise Architecture

• Optimization of resources, but…

• Too complicated unless have a good platform for geo-distributed

data

• MapR is such a platform

Page 41: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 41

The Need for Containers

• To deploy exactly the same thing in many data-centers

• To get predictable behavior in production compared with testing

• Result: Docker-style containers becoming ubiquitous

Page 42: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 42

Key to Lightweight Containers: Persist State to MapR

Data platform

Stateful

Application

Stateful

Application

Stateless

Application

Container management system

• Works for files, tables, streams

• Run stateful applications in stateless containers

Page 43: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 43

Data where you want it, compute where you need it…

Page 44: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 44

... including processing at the IoT edge

Page 45: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 45

MapR Edge

• Small footprint cluster for remote processing

• Intended to run right next to the data producing device– Unified end-to-end security policy

– Reliable replication to cloud and data center, even with occasional connections

• Small but full MapR data services (files, tables, streams)– Normal data protection (snapshots, mirroring, replication)

– Normal management capabilities (volumes, fine-grained access control, monitoring)

Page 46: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 46

Why MapR Edge is Useful

Data

source

Data

source

Data

source

Report

Data

source• Designed to sit at IoT edge

• Intended to run right next to data- producing

device

Page 47: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 47

Who needs MapR Edge?

• Connected car industry

• Telecommunications industry

• Hospitals and medical testing facilities

• Anyone who benefits from global learning but needs local

action

© Ellen Friedman

© Ellen Friedman

©WesAbrams

Page 48: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 48

MapR Edge: Improves Time to Insight

Before MapR Edge After MapR Edge

• Oil & gas

• Medical device

• Car test & dev

48 hours

12 hours

24 hours

< 2 hours

< 15 minutes

< 5 minutes

Page 49: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 49

Use Case: Telecommunications

Callers

Towers

cdr data

Page 50: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 50

Streaming in Telecom

• Data collection & handling happens at different levels – tower, local data center, central data center)

• Batch: Can take 30 minutes per level

• Streaming: Latency drops to seconds or sub-seconds per level

• Ability to respond as events occur

• MapR Streams enables stream replication with offsets across data centers

Page 51: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 51

Telecom Reporting and Logging

Tower 2

Tower 1

Data

source

HQ

Aggregate

Data

source

Page 52: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 52

Data Center

RESThttps REST GW

Use Case: Automotive IoT

Car

CAN Bus

μRaw stream

Dispatcher

Data stream

Metrics

μ

Page 53: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 53

Global analytics

modelsGHQ

metrics

data center 1

data center 2

metrics

m1

m2

m3

m4

metrics

m1

m2

m3

m4

models

models

Global Machine Learning Foundation

Page 54: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 54

Learn globally, act locally

Page 55: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 55

Image © Ellen Friedman 2015, used with permission. From Chap 7 “Streaming

Architecture” book. Read free online: http://bit.ly/streams-ebook-ch7

.

Over 20% of world’s

shipping containers pass

through Singapore’s port.

Use Case: Container Shipping

Page 56: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 56

IoT Data for Container Shipping

Tokyo:

Sensors stream data to on-board cluster that reports to onshore cluster while in port

En route to Singapore:

MapR Streams geo-replication sends data to next port before ship arrives.

Problem in Sydney:

Real-time insights alert to “high humidity” in some containers

Singapore

Tokyo

Sydney

Corporate

HQ

A

B

C

Details in Chapter 7 “Streaming Architecture” book. Read free online here: http://bit.ly/streams-ebook-ch7

Figure used with permission.

Page 57: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 57

Additional Resources

O’Reilly report by Ted Dunning & Ellen Friedman © March 2017

Download free pdf courtesy of MapR:

http://bit.ly/mapr-geo-distribution-ebook-pdf

O’Reilly book by Ted Dunning & Ellen Friedman

© March 2016

Read free courtesy of MapR

https://mapr.com/streaming-architecture-using-

apache-kafka-mapr-streams/

Page 58: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 58

Please support women in tech – help build

girls’ dreams of what they can accomplish

© Ellen Friedman 2015#womenintech #datawomen

Page 59: Geo-Distributed Big Data and Analytics

© 2017 MapR Technologies 59

Q&A

@mapr

Maprtechnologies

[email protected]

ENGAGE WITH US

@ ted_dunning

@ Ellen_Friedman