Implementing the Lambda Architecture with Spring XD

52
© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission. Implementing the Lambda Architecture with Spring XD By Carlos Queiroz

description

Recorded at SpringOne2GX 2014. Speaker: Carlos Queiroz Big Data Track The lambda architecture has been proposed as a general purpose data system that aims to solve the problem of computing arbitrary functions on an arbitrary dataset in (near) real-time. In this talk we introduce the lambda architecture and show how it can be implemented with SpringXD, GemFireXD and Hadoop (HDFS and MapReduce) as the foundation of the architecture implementation. To validate the architecture we introduce a CDR (Call Detail Record) mining application as use case of the lambda architecture. We finalise by showing a demo of the CDR (Call Detail Record) mining application.

Transcript of Implementing the Lambda Architecture with Spring XD

Page 1: Implementing the Lambda Architecture with Spring XD

© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.

Implementing the Lambda Architecture with Spring XD

By Carlos Queiroz

Page 2: Implementing the Lambda Architecture with Spring XD

Me

Advanced Field Engineer

2

Carlos Queiroz

Page 3: Implementing the Lambda Architecture with Spring XD

Agenda

!

• Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details

3

Page 4: Implementing the Lambda Architecture with Spring XD

What is Lambda Architecture?

Implementation of a set of desired properties on general purpose big data systems. !

Generic architecture addressing common requirements in big data applications. !

Set of design patterns for dealing with historical and operational data.

4

Page 5: Implementing the Lambda Architecture with Spring XD

Approach is new, ideas not so…

5

Traditional / Relational

Data Sources

The Big Data Ecosystem

Stre

amin

g D

ata

Trad

ition

al

War

ehou

se

Analytics on Data at Rest

Data Warehouse

Analytics on Structured Data

Analytics onData in Motion

MapReduce like modelsNon-Traditional /Non-RelationalData Sources

Non-Traditional / Non-Relational

Data Feeds

Traditional / Relational

Data SourcesIn

tern

et-

Scal

e D

ata

Sets

Stream computing

Hadoop

Page 6: Implementing the Lambda Architecture with Spring XD

Why Lambda Architecture?

To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system

6

Page 7: Implementing the Lambda Architecture with Spring XD

Desired Properties of such systems

Fault-tolerance

Generic

Scale linearly, horizontally.

Extensible

Be able to achieve low latency updates when necessary.

Be able to “ask” arbitrary questions to the system

7

Page 8: Implementing the Lambda Architecture with Spring XD

How to support those properties?

Decompose the problem into pieces

8

Speed Layer

Batch Layer

Serving Layer

Page 9: Implementing the Lambda Architecture with Spring XD

Batch Layer

9

Master dataset Batch Layer

Batch view

Batch view

Batch view

runBatchLayer() { while(true) recomputeFunctions()}

Page 10: Implementing the Lambda Architecture with Spring XD

The Master dataset

10

Master dataset

Sourceof

Truth

Recompute

Writeonce

Protect

Readmany

Page 11: Implementing the Lambda Architecture with Spring XD

Master dataset requirements

11

Operation Requisite Comments

Writes Efficient appends of new data

Basically add new pieces of data. Easy to append new data

Writes Scalable storage Need to handle possibly, petabytes of data

Reads Support for parallel processing

Functions usually work on the entire dataset. Need to support handling large amounts of data

Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls)

Writes/Reads Costs for processing. Flexible storage

Storage costs money (a lot). Need flexibility on how to store and compress data.

Page 12: Implementing the Lambda Architecture with Spring XD

The Master dataset

12

HDFS - Hadoop Distributed File System

Page 13: Implementing the Lambda Architecture with Spring XD

Serving Layer

Makes the batch views “queryable”

13

Batch view

Batch view

Batch updates and random reads

Distrib

uted data

base

Page 14: Implementing the Lambda Architecture with Spring XD

Desired properties on the Serving layer

14

Batch view

Batch view

High ThroughputLow Latency

Page 15: Implementing the Lambda Architecture with Spring XD

Serving layer database requirements

15

Batch writable

Scalable

Random reads

Fault-tolerant

Page 16: Implementing the Lambda Architecture with Spring XD

Batch and serving layers

16

Master dataset Batch Layer

Batch view

Batch view

Batch view

analytical functions,immutable storage

Serving Layer

pre-computed analytical functions

(Views)

Only property missing - low latency updates

Page 17: Implementing the Lambda Architecture with Spring XD

Speed layer

Allows arbitrary functions computed on arbitrary data on (near) real-time.

17

Feed

Feed

Feed

Feed

Feed

Feed

Page 18: Implementing the Lambda Architecture with Spring XD

Speed layer

18

RealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime View

Narrow but more up to date view

Depending of functions complexity incremental computation approach is recommended

Page 19: Implementing the Lambda Architecture with Spring XD

Incremental computation

realtime view = function(new data, realtime view)

19

Page 20: Implementing the Lambda Architecture with Spring XD

Eventual Accuracy

!

Some computations are harder to compute !

For such cases approximations are used. Results are approximate to the correct answer !

Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly.

20

Page 21: Implementing the Lambda Architecture with Spring XD

Approximation & Randomisation

!

!

Approximation find an answer correct within some factor

(answer that is within 10% of correct result)

!

Randomisation allow a small probability of failure

(1 in 10,000 give the wrong answer)

!

!

21

Page 22: Implementing the Lambda Architecture with Spring XD

Synopses structures

• Sampling • Sketches • Histograms • Wavelets

!

Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki)

22

http://charuaggarwal.net/synopsis.pdf

Page 23: Implementing the Lambda Architecture with Spring XD

Stream processing (real-time function)

Run the realtime functions to update the realtime views

23

Data Ingestion Stream Processing (realtime function)

Page 24: Implementing the Lambda Architecture with Spring XD

One-at-a-time

Divide your processing into worker processes, and put queues between the worker processes

24Queues and workers paradigm

Page 25: Implementing the Lambda Architecture with Spring XD

Generalised one-at-a-time approach

• Works on a higher level • Stream computation defined as a graph (usually).

• Storm, InfoSphere Streams models • Filters and pipes

• Spring XD • At least once in case of failures.

25

cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less

Page 26: Implementing the Lambda Architecture with Spring XD

Micro-batching

• Small batches of events processed one at a time !

• Exactly one processing !

• Implementations: • Spark Streaming

26

batch #1

batch #2Data Ingestion

Page 27: Implementing the Lambda Architecture with Spring XD

Lambda Architecture

27

Ad-hoc queries

analytical functions

analytical functions,immutable storage

raw data

Data Ingestion

Speed Layer

Batch Layer Serving Layer

raw data

raw data

raw data

raw data

pre-computed analytical functions

Page 28: Implementing the Lambda Architecture with Spring XD

Principles of Lambda Architecture

Store data in it’s rawest form

Immutability and perpetuity !

Re-computation !

Query = function(all data)

28

Page 29: Implementing the Lambda Architecture with Spring XD

Implementing the Lambda Architecture

The ACM DEBS 2014 Grand Challenge1

29

To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data

1 http://www.cse.iitb.ac.in/debs2014/

Page 30: Implementing the Lambda Architecture with Spring XD

ACM DEBS 2014 Challenge Application

Scenario !

Analysis of energy consumption measurement

30

Short-term load forecasting makes load forecasts based on current load and what was learned over historical data

Load statistics for real-time demand management finds outliers based on the energy consumption

plug

plug

house_id

dev_id

plug_id

Page 31: Implementing the Lambda Architecture with Spring XD

Data model

31

Data source

Field Comments

ID UNIQUE IDENTIFIER

TIMESTAMP TIMESTAMP OF MEASUREMENT

VALUE MEASUREMENT

PROPERTY TYPE: 0 - WORK, 1 - LOAD

PLUG_ID UNIQUE IDENTIFIER OF A PLUG

HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD

HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

PREDICTED_LOAD PREDICTED LOAD

PL - House

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Outliers

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

PREDICTED_LOAD PREDICTED LOAD

PL - Plug

Page 32: Implementing the Lambda Architecture with Spring XD

Analytical models

32

L(si+2) = (avgLoad(si) +median(avgLoad(sj)))/2

Load prediction per house, per plug

OutliersFor each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour

sj = s(i+2–n⇤k)

It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days

Page 33: Implementing the Lambda Architecture with Spring XD

How others did

33

Storm-based implementation

Page 34: Implementing the Lambda Architecture with Spring XD

How others did

34

Oracle DB + Oracle Event Processing (OEP)

Page 35: Implementing the Lambda Architecture with Spring XD

How others did

35

http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc

SEEP: Real-Time Stateful Big Data Processing

Page 36: Implementing the Lambda Architecture with Spring XD

Implementation details

Fork from https://github.com/pivotalsoftware/gfxd-demo

36

Thanks to Jens Deppe and William Markito

Page 37: Implementing the Lambda Architecture with Spring XD

Our approach - ACM DEBS 2014 system architecture

37

RabbitMQ Spring XD

Hadoop

GemFire XD Web App(Spring boot)

Batch layer

Serving layer

Speed layerSensorEventSensor

EventSensorEventSensor

EventSensorEvent

Page 38: Implementing the Lambda Architecture with Spring XD

Data ingestion

38

Sensor Events

RabbitMQ

publishes

Stream App

Spring XD

subscribe

s

S

S

S

SS

SS

S

Page 39: Implementing the Lambda Architecture with Spring XD

Speed layer

39

Application flow

Removeduplicates

Filter

Fix missingvaluestransform

Store toGFXD

Sink

Findoutliers

Tap

Load Forecast

Tap

Loadpred. model

Processor

Store toGFXD

Sink

OutliersModelProcessor

not a flow

Store toGFXD

Sink

RT viewRT view

RT view

RT viewRT view

RT view

Load Forecast

Tap

Loadpred. model

Processor

Store toGFXD

Sink RT viewRT view

RT view

P

H

Page 40: Implementing the Lambda Architecture with Spring XD

Stream actions

40

Duplicatesfiltering

enrichment(compute time slice)

computemodel Produce RT View

Page 41: Implementing the Lambda Architecture with Spring XD

Spring XD streams

41

pumpin - sends all data to a master queue

sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS

findoutliers - Taps from master_ds stream to compute outlier model

loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams)

5 streams

Page 42: Implementing the Lambda Architecture with Spring XD

Batch Layer

42

Batch job that starts from Spring XD Uses cron to run every X hour/min/day?

Load pred. model

Job

Hadoop

Batch viewBatch view

Batch view

Hadoop based system

Store an immutable and constantly growing master dataset (of all datasets) !

Compute arbitrary functions (models) on the existing datasets. !

Essentially, runs the MR models.

Page 43: Implementing the Lambda Architecture with Spring XD

Spring XD jobs

• A unique job to run the MR model to compute the historical aspect of the models.

• MR job runs every “day/hour/min” • Job is completely independent from the other streams

43

1 job

Load raw_sensortable

computemodel

updateavg, outliers

table

Page 44: Implementing the Lambda Architecture with Spring XD

Serving layer (RT and Batch views)

44

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

PREDICTED_LOAD PREDICTED LOAD

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Gemfire XD

Get updates from MR and SP models

Holds both batch and real-time views

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

PREDICTED_LOAD PREDICTED LOAD

Page 45: Implementing the Lambda Architecture with Spring XD

Web UI

45

Web App(Spring boot)

view view

view

Page 46: Implementing the Lambda Architecture with Spring XD

Why Gemfire XD?

‣ Distributed database !

‣ Tightly integrated with Pivotal Hadoop !

‣ SQL support !

‣ Fault-tolerant !

‣ In-memory (fast access)

46

GemFire XD

Hadoop

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

Page 47: Implementing the Lambda Architecture with Spring XD

Why Pivotal HD (PHD)

47

‣ Hadoop based !

‣ Tightly integrated with Gemfire XD

GemFire XD

Hadoop

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

Page 48: Implementing the Lambda Architecture with Spring XD

Why Spring XD?

‣ Data Ingestion !

‣ Real-time Analytics !

‣ Workflow Orchestration !

‣ Integration

48

Page 49: Implementing the Lambda Architecture with Spring XD

Full Open source implementation

49

HBase??

Hadoop

SpringXD

Page 50: Implementing the Lambda Architecture with Spring XD

Is Lambda Architecture perfect?

• Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture

• Multiple batch layers • incremental batch layers

50

Page 51: Implementing the Lambda Architecture with Spring XD

Source code

51

https://bitbucket.org/caxqueiroz/debs2014-challenge

Page 52: Implementing the Lambda Architecture with Spring XD

Thank you!!

52