Implementing the Lambda Architecture with Spring XD
description
Transcript of Implementing the Lambda Architecture with Spring XD
© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.
Implementing the Lambda Architecture with Spring XD
By Carlos Queiroz
Me
Advanced Field Engineer
2
Carlos Queiroz
Agenda
!
• Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details
3
What is Lambda Architecture?
Implementation of a set of desired properties on general purpose big data systems. !
Generic architecture addressing common requirements in big data applications. !
Set of design patterns for dealing with historical and operational data.
4
Approach is new, ideas not so…
5
Traditional / Relational
Data Sources
The Big Data Ecosystem
Stre
amin
g D
ata
Trad
ition
al
War
ehou
se
Analytics on Data at Rest
Data Warehouse
Analytics on Structured Data
Analytics onData in Motion
MapReduce like modelsNon-Traditional /Non-RelationalData Sources
Non-Traditional / Non-Relational
Data Feeds
Traditional / Relational
Data SourcesIn
tern
et-
Scal
e D
ata
Sets
Stream computing
Hadoop
Why Lambda Architecture?
To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system
6
Desired Properties of such systems
Fault-tolerance
Generic
Scale linearly, horizontally.
Extensible
Be able to achieve low latency updates when necessary.
Be able to “ask” arbitrary questions to the system
7
How to support those properties?
Decompose the problem into pieces
8
Speed Layer
Batch Layer
Serving Layer
Batch Layer
9
Master dataset Batch Layer
Batch view
Batch view
Batch view
runBatchLayer() { while(true) recomputeFunctions()}
The Master dataset
10
Master dataset
Sourceof
Truth
Recompute
Writeonce
Protect
Readmany
Master dataset requirements
11
Operation Requisite Comments
Writes Efficient appends of new data
Basically add new pieces of data. Easy to append new data
Writes Scalable storage Need to handle possibly, petabytes of data
Reads Support for parallel processing
Functions usually work on the entire dataset. Need to support handling large amounts of data
Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls)
Writes/Reads Costs for processing. Flexible storage
Storage costs money (a lot). Need flexibility on how to store and compress data.
The Master dataset
12
HDFS - Hadoop Distributed File System
Serving Layer
Makes the batch views “queryable”
13
Batch view
Batch view
Batch updates and random reads
Distrib
uted data
base
Desired properties on the Serving layer
14
Batch view
Batch view
High ThroughputLow Latency
Serving layer database requirements
15
Batch writable
Scalable
Random reads
Fault-tolerant
Batch and serving layers
16
Master dataset Batch Layer
Batch view
Batch view
Batch view
analytical functions,immutable storage
Serving Layer
pre-computed analytical functions
(Views)
Only property missing - low latency updates
Speed layer
Allows arbitrary functions computed on arbitrary data on (near) real-time.
17
Feed
Feed
Feed
Feed
Feed
Feed
Speed layer
18
RealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime View
Narrow but more up to date view
Depending of functions complexity incremental computation approach is recommended
Incremental computation
realtime view = function(new data, realtime view)
19
Eventual Accuracy
!
Some computations are harder to compute !
For such cases approximations are used. Results are approximate to the correct answer !
Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly.
20
Approximation & Randomisation
!
!
Approximation find an answer correct within some factor
(answer that is within 10% of correct result)
!
Randomisation allow a small probability of failure
(1 in 10,000 give the wrong answer)
!
!
21
Synopses structures
• Sampling • Sketches • Histograms • Wavelets
!
Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki)
22
http://charuaggarwal.net/synopsis.pdf
Stream processing (real-time function)
Run the realtime functions to update the realtime views
23
Data Ingestion Stream Processing (realtime function)
One-at-a-time
Divide your processing into worker processes, and put queues between the worker processes
24Queues and workers paradigm
Generalised one-at-a-time approach
• Works on a higher level • Stream computation defined as a graph (usually).
• Storm, InfoSphere Streams models • Filters and pipes
• Spring XD • At least once in case of failures.
25
cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less
Micro-batching
• Small batches of events processed one at a time !
• Exactly one processing !
• Implementations: • Spark Streaming
26
batch #1
batch #2Data Ingestion
Lambda Architecture
27
Ad-hoc queries
analytical functions
analytical functions,immutable storage
raw data
Data Ingestion
Speed Layer
Batch Layer Serving Layer
raw data
raw data
raw data
raw data
pre-computed analytical functions
Principles of Lambda Architecture
Store data in it’s rawest form
Immutability and perpetuity !
Re-computation !
Query = function(all data)
28
Implementing the Lambda Architecture
The ACM DEBS 2014 Grand Challenge1
29
To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data
1 http://www.cse.iitb.ac.in/debs2014/
ACM DEBS 2014 Challenge Application
Scenario !
Analysis of energy consumption measurement
30
Short-term load forecasting makes load forecasts based on current load and what was learned over historical data
Load statistics for real-time demand management finds outliers based on the energy consumption
plug
plug
house_id
dev_id
plug_id
Data model
31
Data source
Field Comments
ID UNIQUE IDENTIFIER
TIMESTAMP TIMESTAMP OF MEASUREMENT
VALUE MEASUREMENT
PROPERTY TYPE: 0 - WORK, 1 - LOAD
PLUG_ID UNIQUE IDENTIFIER OF A PLUG
HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD
HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
PREDICTED_LOAD PREDICTED LOAD
PL - House
Field CommentsTS_START TIMESTAMP OF START TIME WINDOW
TS_STOP TIMESTAMP OF STO PTIME WINDOW
HOUSE_ID HOUSE ID
PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL
Outliers
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
HOUSEHOLD_ID
PLUG_ID
PREDICTED_LOAD PREDICTED LOAD
PL - Plug
Analytical models
32
L(si+2) = (avgLoad(si) +median(avgLoad(sj)))/2
Load prediction per house, per plug
OutliersFor each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour
sj = s(i+2–n⇤k)
It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days
How others did
33
Storm-based implementation
How others did
34
Oracle DB + Oracle Event Processing (OEP)
How others did
35
http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc
SEEP: Real-Time Stateful Big Data Processing
Implementation details
Fork from https://github.com/pivotalsoftware/gfxd-demo
36
Thanks to Jens Deppe and William Markito
Our approach - ACM DEBS 2014 system architecture
37
RabbitMQ Spring XD
Hadoop
GemFire XD Web App(Spring boot)
Batch layer
Serving layer
Speed layerSensorEventSensor
EventSensorEventSensor
EventSensorEvent
Data ingestion
38
Sensor Events
RabbitMQ
publishes
Stream App
Spring XD
subscribe
s
S
S
S
SS
SS
S
Speed layer
39
Application flow
Removeduplicates
Filter
Fix missingvaluestransform
Store toGFXD
Sink
Findoutliers
Tap
Load Forecast
Tap
Loadpred. model
Processor
Store toGFXD
Sink
OutliersModelProcessor
not a flow
Store toGFXD
Sink
RT viewRT view
RT view
RT viewRT view
RT view
Load Forecast
Tap
Loadpred. model
Processor
Store toGFXD
Sink RT viewRT view
RT view
P
H
Stream actions
40
Duplicatesfiltering
enrichment(compute time slice)
computemodel Produce RT View
Spring XD streams
41
pumpin - sends all data to a master queue
sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS
findoutliers - Taps from master_ds stream to compute outlier model
loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams)
5 streams
Batch Layer
42
Batch job that starts from Spring XD Uses cron to run every X hour/min/day?
Load pred. model
Job
Hadoop
Batch viewBatch view
Batch view
Hadoop based system
Store an immutable and constantly growing master dataset (of all datasets) !
Compute arbitrary functions (models) on the existing datasets. !
Essentially, runs the MR models.
Spring XD jobs
• A unique job to run the MR model to compute the historical aspect of the models.
• MR job runs every “day/hour/min” • Job is completely independent from the other streams
43
1 job
Load raw_sensortable
computemodel
updateavg, outliers
table
Serving layer (RT and Batch views)
44
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
PREDICTED_LOAD PREDICTED LOAD
Field CommentsTS_START TIMESTAMP OF START TIME WINDOW
TS_STOP TIMESTAMP OF STO PTIME WINDOW
HOUSE_ID HOUSE ID
PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL
Gemfire XD
Get updates from MR and SP models
Holds both batch and real-time views
Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION
HOUSE_ID HOUSE ID
HOUSEHOLD_ID
PLUG_ID
PREDICTED_LOAD PREDICTED LOAD
Web UI
45
Web App(Spring boot)
view view
view
Why Gemfire XD?
‣ Distributed database !
‣ Tightly integrated with Pivotal Hadoop !
‣ SQL support !
‣ Fault-tolerant !
‣ In-memory (fast access)
46
GemFire XD
Hadoop
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
Why Pivotal HD (PHD)
47
‣ Hadoop based !
‣ Tightly integrated with Gemfire XD
GemFire XD
Hadoop
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
column_name data_type
column_name data_type
Model: Datahas_many_and_belongs_to_many :teams
Table1
Why Spring XD?
‣ Data Ingestion !
‣ Real-time Analytics !
‣ Workflow Orchestration !
‣ Integration
48
Full Open source implementation
49
HBase??
Hadoop
SpringXD
Is Lambda Architecture perfect?
• Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture
• Multiple batch layers • incremental batch layers
50
Source code
51
https://bitbucket.org/caxqueiroz/debs2014-challenge
Thank you!!
52