High-Volume Data Collection and Real Time Analytics Using Redis

Post on 09-May-2015

10.337 views 0 download

description

In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics. * See more of my work at http://www.codehenge.net

Transcript of High-Volume Data Collection and Real Time Analytics Using Redis

Large-Scale Data Collection Using Redis

C. Aaron Cois, Ph.D. -- Tim PalkoCMU Software Engineering Institute

© 2011 Carnegie Mellon University

Us

C. Aaron Cois, Ph.D.

Software Architect, Team LeadCMU Software Engineering InstituteDigital Intelligence and Investigations Directorate

Tim Palko

Senior Software EngineerCMU Software Engineering InstituteDigital Intelligence and Investigations Directorate

© 2011 Carnegie Mellon University

@aaroncois

Overview

• Problem Statement• Sensor Hardware & System Requirements• System Overview– Data Collection– Data Modeling– Data Access– Event Monitoring and Notification

• Conclusions and Future Work

The Goal

Critical infrastructure/facility protection

via

Environmental Monitoring

Why?

Stuxnet• Two major components:

1) Send centrifuges spinning wildly out of control2) Record ‘normal operations’ and play them back to operators during the attack 1

• Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound

1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&

The Broader Vision

Quick, flexible out-of-band monitoring

• Set up monitoring in minutes• Versatile sensors, easily repurposed • Data communication is secure (P2P VPN) and

requires no existing systems other than outbound networking

A CMU research project called Sensor Andrew

• Features: – Open-source sensor platform– Scalable and generalist system supporting a

wide variety of applications– Extensible architecture• Can integrate diverse sensor types

The Platform

Sensor Andrew

Gateway

Gateway

Server

End Users

Sensor Andrew Overview

Nodes

What is a Node?

Environment Node Sensors• Light• Audio• Humidity• Pressure• Motion• Temperature• Acceleration

Power Node Sensors• Current• Voltage• True Power• Energy

A node collects data and sends it to a collector, or gateway

Radiation Node Sensors• Alpha particle

count per minute

Particulate Node Sensors• Small Part. Count• Large Part. Count

What is a Gateway?

• A gateway receives UDP data from all nodes registered to it

• An internal service:– Receives data continuously– Opens a server on a specified

port– Continually transmits UDP data

over this port

Gateway

Requirements

1. Collect data from nodes once per second2. Scale to 100 gateways each with 64 nodes3. Detect events in real-time4. Notify users about events in real-time5. Retain all data collected for years, at least

We need to..

What Is Big Data?

What Is Big Data?

“When your data sets become so large that you have to start

innovating around how to collect, store, organize, analyze and share it.”

Problems

Size Transmission

StorageRate

Problems

Size Transmission

StorageRate

Problems

Size Transmission

StorageRate

Problems

Size Transmission

StorageRate

Problems

Size Transmission

StorageRate

Problems

Size Transmission

StorageRateRetrieval

Collecting DataProblem:

Data cannot remain on the nodes or gateways due to security concerns.Limited infrastructure.

Constraints:

Store and retrieve immense amounts of data at a high rate.

?Gateway

8 GB / hour

Complex Analytics

We Tried PostgreSQL…

• Advantages:– Reliable, tested and scalable– Relational => complex queries => analytics

• Problems:– Performance problems reading while writing at a

high rate; real-time event detection suffers– ‘COPY FROM’ doesn’t permit horizontal scaling

Q: How can we decrease I/O load?

Q: How can we decrease I/O load?

A: Read and write collected data directly from memory

Enter Redis

Commonly used as a web application cache or pub/sub server

Redis is an in-memory NoSQL database

Redis

• Created in 2009• Fully In-memory key-value store– Fast I/O: R/W operations are equally fast– Advanced data structures

• Publish/Subscribe Functionality– In addition to data store functions– Separate from stored key-value data

Persistence

• Snapshotting– Data is asynchronously transferred from memory

to disk• AOF (Append Only File)– Each modifying operation is written to a file– Can recreate data store by replaying operations– Without interrupting service, will rebuild AOF as

the shortest sequence of commands needed to rebuild the current dataset in memory

Replication

• Redis supports master-slave replication• Master-slave replication can be chained• Be careful: – Slaves are writeable!– Potential for data inconsistency

• Fully compatible with Pub/Sub features

Redis Features Advanced Data Structures

List Set Sorted Set Hash

[A, B, C, D]

“A”

“B”

“C”

“D”

D

C

B

AA:3

C:1

D:2

B:4

{A, B, C, D} {C:1, D:2, A:3, D:4}

“A”

“B”

“C”

“D”

field1

field2

field3

field4

{field1:“A”, field2:“B”…}

{value:score} {key:value}

Our Data Model

Constraints

Our data store must:

– Hold time-series data

– Be flexible in querying (by time, node, sensor)

– Allow efficient querying of many records

– Accept data out of order

Tradeoffs: Efficiency vs. Flexibility

MotionAudioLight

PressureHumidity

AccelerationTemperature

MotionVS

Light

Audio

Pressure

Temperature

Humidity

Acceleration

One record per timestamp

One record per sensor data type

A

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Our Solution: Sorted Set

Score

Value

Datapoint sensor:env:1011357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Sorted Set

1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}

1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..}…

Sorted Set

1357542004000: {“temp”:523,..}1357542005000: {“temp”:523,..}1357542006000: {“temp”:527,..} <- fits nicely1357542007000: {“temp”:530,..}1357542008000: {“temp”:531,..}1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..}…

Know your data structure!A set is still a set…

Score

Value

Datapoint1357542004000{“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

Requirement Satisfied

RedisGateway

There is a disturbance in the Force..

Collecting Data

RedisGateway

“In Memory” Means Many Things

• The data store capacity is aggressively capped – Redis can only store as much data as the server

has RAM

Collecting Big Data

RedisGateway

We could throw away data…

• If we only cared about current values• However, our data– Must be stored for 1+ years for compliance– Must be able to be queried for historical/trend

analysis

We Still Need Long-term Data Storage

Solution? Migrate data to an archive with expansive storage capacity

Winning

Redis

Gateway

PostgreSQL

Archiver

Winning?

Redis

Gateway

PostgreSQL

Archiver

??

?Some Poor Client

Yes, Winning

Redis

Gateway

PostgreSQL

ArchiverAPI

Some Happy Client

Gateway

Redis

PostgreSQL

ArchiverAPI

Best of both worlds

Redis allows quick access to real-time data, for monitoring and event detection

PostgreSQL allows complex queries and scalable storage for deep and historical analysis

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

What is “significant”?

We Have the Data, Now What?

Incoming data must be monitored and analyzed, to detect significant events

What is “significant”?

What about new data types?

Gateway

Django App

App DB

API

New guy: provide a way to read the data andcreate rules

motion > x && pressure < y&& audio > z

Redis

PostgreSQL

Archiver

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

Redis

PostgreSQL

ArchiverAPI

New guy: read the rules and

data, trigger alarms

motion > x pressure < yaudio > z

All true?

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

Redis

PostgreSQL

ArchiverAPI

Event monitor services can be scaled independently

Getting The Message Out

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

• Notifications most efficiently should be a “push” instead of needing to poll

Getting The Message Out

Considerations

• Event monitor already has a job, avoid re-tasking as a notification engine

• Notifications most efficiently should be a “push” instead of needing to poll

• Notification system should be generalized, e.g. SMTP, SMS

If only…

Gateway

Event MonitorEvent

MonitorDjango

AppApp DB

ArchiverAPI

Redis Data

Redis Pub/Sub

WorkerWorkerNotification

Worker

SMTP

Pub/Sub with synchronized workers is an optimal solution to real-time event notifications.

No need to add another system, Redis offers pub/sub services as well!

PostgreSQL

Conclusions

• Redis is a powerful tool for collecting large amounts of data in real-time

• In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system

• Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub

Things to watch

• Data persistence– if Redis needs to restart, it takes 10-20 seconds

per gigabyte to re-load all data into memory 1

– Redis is unresponsive during startup

1 http://oldblog.antirez.com/post/redis-persistence-demystified.html

Future Work

• Improve scalability through:– Data encoding– Data compression– Parallel batch inserts for all nodes on a gateway

• Deep historical data analytics

Acknowledgements

• Project engineers Chris Taschner and Jeff Hamed @ CMU SEI

• Prof. Anthony Rowe & CMU ECE WiSE Labhttp://wise.ece.cmu.edu/

• Our organizationsCMU https://www.cmu.eduCERT http://www.cert.orgSEI http://www.sei.cmu.eduCylab https://www.cylab.cmu.edu

Thank You

Thank You

Questions?

Slides of Live Redis Demo

A Closer Look at Redis Data

redis> keys *

1)"sensor:environment:f80”2)"sensor:environment:f81”3)"sensor:environment:f82"4)"sensor:environment:f83"5)"sensor:environment:f84"6)"sensor:power:f85"7)"sensor:power:f86"8)"sensor:radiation:f87"9)"sensor:particulate:f88"

A Closer Look at Redis Data

redis> keys sensor:power:*

1)"sensor:power:f85"2)"sensor:power:f86”

A Closer Look at Redis Data

redis> zcount sensor:power:f85 –inf +inf

(integer) 3565958(45.38s)

A Closer Look at Redis Data

redis> zcount sensor:power:f85 1359728113000 +inf

(integer) 47

A Closer Look at Redis Dataredis> zrange sensor:power:f85 -1000 -1

1)"{\"long_energy1\": 73692453, \"total_secs\": 6784, \"energy\": [49, 175, 62, 0, 0, 0], \"c2_center\": 485, \"socket_state\": 1, \"node_type\": \"power\", \"c_p2p_low2\": 437, \"socket_state1\": 0, \"mac_address\": \"103\", \"c_p2p_low\": 494, \"rms_current\": 6, \"true_power\": 1158, \"timestamp\": 1359728143000, \"v_p2p_low\": 170, \"c_p2p_high\": 511, \"rms_current1\": 113, \"freq\": 60, \"long_energy\": 4108081, \"v_center\": 530, \"c_p2p_high2\": 719, \"energy1\": [37, 117, 100, 4, 0, 0], \"v_p2p_high\": 883, \"c_center\": 509, \"rms_voltage\": 255, \"true_power1\": 23235}”

2)…

Redis Python APIimport redis

pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0)r = redis.Redis(connection_pool=pool)

byindex = r.zrange(“sensor:env:f85”, -50, -1) # ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:…

byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000)

# ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:…

size = r.zcount(“sensor:env:f85”, "-inf", "+inf") # 237327L