Big Data Applications Made Easy: Fact Or Fiction?

Pivotal Confidential–Internal Use OnlyPivotal Confidential–Internal Use Only

Spring XD

Glenn Renfrogrenfro @pivotal.io@CPPWFS

Pivotal Confidential–Internal Use Only

Volume

Variety

Velocity

Veracity

420 Million Wearables

90% of enterprise data is

unstructured

60-100 sensors in each car

22 Billion sensors by 2020

86% suspect data

inaccuracy

30% revenue loss due to bad

data quality

500 million tweets each day

2.3 Trillion GBs of each day

Data

Data Points: McKinsey, Twitter, Gartner, IBM


Batch and Streaming

often handled by

multiple platforms

Fragmented Big Data

Ecosystem

Not all data Hadoop

bound

“One stop shop for

developing and deploying

Big Data Applications”

SPRING XDEXTREME DATA


Batch and Streaming

often handled by

multiple platforms

Fragmented Big Data

Ecosystem

Not all data Hadoop

bound

Portable on-prem, YARN, EC2, PCF, Mesos,

Docker etc.

Easy to Use, Extend and Integrate with other

Technologies

Built on proven Spring EAI and Batch projects

(Volume, Velocity, Veracity, and Variety)

Unified Stream and Batch Operations

Hadoop Batch Workflow Orchestration

Predictive Analytics and Model Scoring

Spring XD to Rescue


Jobs, Steps,

Readers, Writers

Ingestion, Export,

Orchestration, Hadoop

Controllers, REST,

WebSocket

Channels, Adapters,

Filters, Transformers

WEBINTEGRATION BATCH BIG DATA

SPRING CORE

FRAMEWORK SECURITY GROOVY REACTOR

DATA

RELATIONAL

DATA ACCESS

NON-RELATIONAL

DATA ACCESS

BOOT

Bootable, Minimal, Ops-Ready

GRAILSFull-stack, Web

XDStream, Taps,

Jobs

IO EXECUTION

IO FOUNDATION

IO COORDINATIONSPRING CLOUD


Spring XD - 10,000 Foot View


Streams

HTTPTailFileMail

TwitterGemfireSyslog

TCPUDPJMS

RabbitMQMQTTTrigger

Reactor TCP/UDP

FilterTransformer

Object-to-JSONJSON-to-Tuple

SplitterAggregatorHTTP Client

JPMML EvaluatorShell

GroovyPython

Java

FileHDFSJDBCTCPLogMail

RabbitMQGemfireSplunkMQTT

Dynamic RouterCounters


Create a stream with http as a source and hdfs

as a sink. The hdfs —rollover is set to a small

value so that we can read the file on hdfs.


Spring XD - Distributed Runtime

XD Container XD Container

XD Admin(leader)

XD ShellHTTP POST /streams/aStream “M1 | M2”

Message Bus

ZooKeeper

Container StateXD AdminXD Admin

Spring App Context

M1 M2


Spring XD - Analytics

• Counters and Gauges

• Simple & Field Value Counter (how many tweets for #java)

• Aggregate Counter (how many

tweets for #java in the week/day/hr)

• Gauge & Rich Gauge (how many

requests / minute?)

• Abstract API implemented in Redis

in-memory

• Predictive Model Evaluation

• JPMML

• Is this transaction fraudulent?

• What group does this user belong to?

• Interoperable with R, Rattle,

KNIME, RapidMiner, MADLib


Jobs

CSV to JDBC

FTP to HDFS

JDBC to HDFS

HDFS to JDBC

HDFS to MongoDB


REALTIMEVIEWS

BATCHVIEWS

Spring XD

MASTERDATASET

Spring BOOT

Spring BOOT

Spring BOOT

FILES

StreamProcessing

Analytics

Ingest

WorkflowOrchestration

Spring

XD

Export

XD>GemFire XD

PredictiveModeling

GemFire XD

SPEED

LAYER

BATCH

LAYER

SERVING

LAYER

PCF - BOSH Service PCF - Apps

MOBILE

SENSORS

SOCIAL


Unified runtime

for both Real-

time and Batch

use cases

Scalable,

Distributed and

Fault Tolerant

Runtime

Increased

Productivity through

out-of-the-box

components

Closed Loop

Analytics through

online (stream) and

offline (batch) data

Swiss-army knife of data

movement and data

pipelines

Repeatable ‘turnkey’

solution for next generation

data-centric use cases


Agility: Easy to Setup and Run

Writing HTTP Data

to HDFS

…that simple!

or

or

or


Spring XD on YARN

Spring XD Running

on

YARN!

Copies Files to

HDFSCreates

manifest.yml

Spring Boot App

‘xd-yarn start admin’

Spring Boot App

‘xd-yarn start container’

Spring Boot App


Even easier with PCF


Natural Fit: Reactive Streaming Pipelines

Moving Average

‘collect values every 500ms’

Non-Blocking

Backpressure

“take all these items I have whether you can

handle them or not”

“give me the next N available items”

OLD

NEWMicrobatching

‘either 1024b or 350ms; trigger downstream processing’


Deployment Manifest – Module Count

• http | doWork | hdfs

http

http

doWork

doWork

doWork

doWork

hdfs

hdfs

hdfs

stream deploy –name s1

--properties

module.http.count=2,

module.doWork.count=4,

module.hdfs.count=3


Deployment Manifest – Module Placement


http

http

doWork

doWork

doWork

doWork

hdfs

hdfs

hdfs


--properties

module.http.count=2,

module.doWork.count=4,

module.hdfs.count=3,

module.http.criteria =

groups.contains(‘WEB’)

WEB


Deployment Manifest – Data Partitioning


http

http

doWork

doWork

doWork

doWork

hdfs

hdfs

hdfs


--properties

...

module.http.producer

.partitionKeyExpression =

payload.customerId

WEB

doWork modules will always

process the same set of customer

IDs


Learn More

• Project: http://projects.spring.io/spring-xd/

• GitHub: https://github.com/spring-projects/spring-xd/

• Wiki: https://github.com/spring-projects/spring-xd/wiki

• Samples: https://github.com/spring-projects/spring-xd-samples

http://projects.spring.io/spring-xd/

https://github.com/spring-projects/spring-xd/

https://github.com/spring-projects/spring-xd/wiki

https://github.com/spring-projects/spring-xd-samples


A NEW PLATFORM FOR A NEW ERA

Big Data Applications Made Easy: Fact Or Fiction?

Data & Analytics

Transcript of Big Data Applications Made Easy: Fact Or Fiction?