Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

30
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids Advisor: Prof. Dr. Philippe O. A. Navaux Co-advisor: Prof. M.Sc. Eduardo Roloff . Otávio Moraes de Carvalho January 16, 2016 Institute of Informatics | Federal University of Rio Grande do Sul

Transcript of Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Page 1: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Distributed Near Real-Time Processingof Sensor Network Data Flowsfor Smart GridsAdvisor: Prof. Dr. Philippe O. A. NavauxCo-advisor: Prof. M.Sc. Eduardo Roloff.

Otávio Moraes de CarvalhoJanuary 16, 2016

Institute of Informatics | Federal University of Rio Grande do Sul

Page 2: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Table of contents.

1. Introduction

2. Background

3. Design

4. Implementation

5. Evaluation

6. Conclusion and Future work

2

Page 3: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Introduction.

• Motivation• Internet Ubiquity• Ubiquity of Sensors• Data velocity• Smart Grids

• Objective• Provide a scalable platform for distributed near real-time processing

of sensor networks data flows, focused on data profiles of SmartGrids

1. How to scale a distributed platform for IoT?2. How to provide insights in near real-time?3. How to test a platform like this?

3

Page 4: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Internet of Things.

• Pervasivity of sensors, that have ability to interact with eachother through unique addressing schemes, and cooperate with theirneighbours to reach common goals. [?]

Figure 1: Total units of connected devices - Gartner Inc. 2013 Forecast [?]4

Page 5: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Internet of Things.

Figure 2: IoT paradigm as the convergence of different visions [?]

5

Page 6: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Distributed Stream Processing Systems.

• Online applications that require real-time or near-real-timeprocessing functionalities are the main motivation.

• Low latency alternatives to Hadoop processing approach(MapReduce) are needed [?].

• Common requirements:1. Input streams with high up to very high data rates (> 10000

events/s).2. Relaxed latency constraints (up to a few seconds).3. Use cases require the correlation among historical and live data.4. Systems that elastically scale and to support diverse workloads.5. Low overhead fault tolerance supporting out of order events and

exactly once semantic.

6

Page 7: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Distributed Stream Processing Systems.

• The most prominent frameworks found on the state-of-the-art:

1. Apache Storm

2. Apache Spark Streaming

3. Apache Flink

7

Page 8: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Cloud Computing.

• According to NIST definition [?], Cloud Computing is a model thatconveniently provides on-demand network access to a shared poolof configurable computing resources that can be provisioned andreleased quickly without large management efforts andinteraction with the service provider.

Figure 3: Cloud Computing service models stack and their relationships

8

Page 9: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Big Data.

• NIST defines big data as ”Big data shall mean the data of whichthe data volume, acquisition speed, or data representation limitsthe capacity of using traditional relational methods to conducteffective analysis or the data which may be effectively processed withimportant horizontal zoom technologies”. [?]

• ”3Vs” model: [?]1. Volume, following the increasing generation and collection of masses

of data, data scale becomes increasingly big.2. Variety, indicates the various types of data, which include

semi-structured and unstructured data such as audio, video,webpage, and text, as well as traditional structured data.

3. Velocity, meaning the timeliness of big data, specifically, datacollection and analysis, etc. that must be rapidly and timelyconducted, so as to maximumly utilize the commercial value of bigdata.

9

Page 10: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Smart Grids.

• For 100 years, there has been no change in the basic structure ofthe electrical power grid. Experiences have shown that thehierarchical, centrally controlled grid of the 20th Century is ill-suitedto the needs of the 21st Century.

• Advanced Metering Infrastructure (AMI): Infrastructure forinformation gathering through smart meters. Drives the need forhigh throughput when using large number of IoT meters.

• Demand Side Management (DSM): Energy generation peakmanagement and reductions of the need for investments in powergeneration sources.

• Energy Consumption Forecasts: Provide a prediction of anamount of electricity consumed at a certain point of time. Thepurpose of electricity load forecasting is an efficient economic andquality planning of energy generation. Drives the need for lowprocessing latency.

10

Page 11: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Architecture.

• A found a few architectural patterns on the state-of-the-art:1. Lambda Architecture

2. Kappa Architecture

3. Liquid Architecture

11

Page 12: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Cyclic Architecture.

• We propose Cyclic architecture, which is a hybrid solution mixingarchitectural solutions from Kappa architecture and Liquidarchitecture.

Figure 4: An overview of the proposed Cyclic Architecture12

Page 13: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Dataset.

1. The dataset used to evaluate the platform originates from the 8thACM International Conference on Distributed Event-Based Systems(DEBS 2014).

2. The synthesized data file contains over 4055 Millions ofmeasurements for 2125 plugs distributed across 40 houses, for atotal amount of 136 GB.

3. Generated measurements cover a period of one month, from Sept.1st, 2013, 00:00:00, to Sept. 30th, 2013, 23:59:59. For our tests, weused a subset of this file, which have 100 Million measurements,using the same amount of plugs and houses, for a total amount of3.6 GB.

13

Page 14: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Dataset.

14

Page 15: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Forecasting Method.

• The select forecast method was chosen due to need of a model fitbetween the algorithm and the processing capabilities of adistributed stream processing framework. It represents a mixedapproach between MLP (Multilayer Perceptron) andAutoregressive Integrated Moving Average (ARIMA). [?].

• More specifically, the set of queries provide a forecast of the load for:(1) each house, i.e., house-based and (2) for each individual plug,i.e., plug-based. The forecast for each house and plug is madebased on the current load of the connected plugs and a plug specificprediction model.

• The aim of these queries is not provide the best prediction model,but at stressing the interplay between modules for model learningthat operate on long-term (historic) data with components thatapply the model on top of live, high velocity data.

15

Page 16: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Forecasting Method.

L(si+2) =avgL(si) + median(avgL(sj))

2 (1)

In the formula (1), avgL(si) represents the current average load for theslice si. The value of avgL(si), in case of plug-based prediction, iscalculated as the average of all load values reported by the given plugwith timestamps ∈ si. In case of a house-based prediction the avgL(si) iscalculated as a sum of average values for each plug within the house.

avgL(sj) is a set of average load value for all slices sj such that:

sj = si+2−n∗k (2)

where k is the number of slices in a 24 hour period, n is a natural numberwith values between 1 and floor( i+2

k ). The value of avgL(sj) is calculatedanalogously to avgL(si) in case of plug-based and house-based (sum ofaverages) variants.

16

Page 17: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Implementation.

Figure 5: An overview of the stack used to implement the Cyclic Architecture

17

Page 18: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Processing flow.

Figure 6: An overview of the data processing flow 18

Page 19: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Platform.

• In order to evaluate the system, we needed a platform for being ableto execute our tests. The platform was built relying on MicrosoftAzure to host our application, and it was configured using thefollowing settings:

19

Page 20: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Latency.

Figure 7: Best case scenario - Large batches with 8 processing nodes

20

Page 21: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Latency.

Figure 8: Worst case scenario - Small batches with 1 processing node

21

Page 22: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Throughput.

Figure 9: Average message throughput, by number of nodes, with 30 secondsbatch

22

Page 23: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Throughput.

Figure 10: Average message throughput, by batch sizes, with 8 processingnodes

23

Page 24: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Conclusion.

• A system for processing distributed near real-time data flows, withfocus on Smart Grids data profiles, was successfully design andimplemented.

• The build system is able to scale linearly up to 8 processingnodes. Which is important to process large numbers of smartmeters.

• The system is able to provide desirable latencies, which isimportant to provide load forecasts in time to be used. However, itwas found that tiny batch sizes could turn processing unstable.

• It was found that greater batch sizes improve throughput, inexpense of latencies, which start to increase proportionally.

24

Page 25: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Future work.

• Improvements on throughput by increasing the number of paralleldata input feeds into Apache Kafka.

• Deeper research on prediction forecasting and results on forecastaccuracy.

• Studies on fault-tolerance and system availability.• Abstraction layer for machine deployment and management,

using Apache YARN or Apache Mesos with Docker containers.

25

Page 26: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

Questions?

26

Page 27: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

References I.

L. Atzori et al.The Internet of Things: A Survey.Computer networks, 54(15):2787–2805, 2010.

T. Bylander and B. Rosen.A Perceptron-like Online Algorithm for Tracking the Median.In Neural Networks, 1997., International Conference on, volume 4,pages 2219–2224. IEEE, 1997.D. Laney.3-D Data Management: Controlling Data Volume.Velocity and Variety, META Group Original Research Note, 2001.

27

Page 28: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

References II.

I. Lee et al.The Internet of Things (IoT): Applications, Investments andChallenges for Enterprises.Business Horizons, 2015.P. Mell and T. Grance.The NIST definition of Cloud Computing.2011.A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt,S. Madden, and M. Stonebraker.A Comparison of Approaches to Large-Scale Data Analysis.In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages165–178. ACM, 2009.

28

Page 29: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

References III.

N. B. D. PWG.Nist big data interoperability framework.Reference Architecture, 2014.

29

Page 30: Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart Grids

D-Streams.

• Treat streaming computation as a series of deterministic batchcomputations on small time intervals.

• D-Streams bring traditional functional transformation operators andintroduce new stateful operators that work over multiple intervals.These include:

• Windowing• Incremental aggregation over sliding windows• Time-skewed joins

Figure 11: Comparison between a simple and a windowed DStream

30