Monitoring Distributed Data Streams

14
06/18/22 1 Monitoring Distributed Data Streams Assaf Schuster

description

Monitoring Distributed Data Streams. Assaf Schuster. Distributed Stream Networks. Financial Data Analysis. Traffic Monitoring Systems. Sensor Networks. Large scale and widespread networked systems Continuous production of data High volume Dynamic nature - PowerPoint PPT Presentation

Transcript of Monitoring Distributed Data Streams

Page 1: Monitoring Distributed Data Streams

04/20/23 1

Monitoring Distributed Data

Streams

Assaf Schuster

Page 2: Monitoring Distributed Data Streams

2

Large scale and widespread Large scale and widespread networked systemsnetworked systems

Continuous production of dataContinuous production of data High volumeHigh volume Dynamic natureDynamic nature

Required to detect a global propertyRequired to detect a global property Often in (near) real timeOften in (near) real time

Distributed Stream Distributed Stream NetworksNetworks

Page 3: Monitoring Distributed Data Streams

04/20/23 3

Web Page Frequency Web Page Frequency CountsCounts

Mirrored web siteMirrored web site Mirrors record the frequency of requests for pagesMirrors record the frequency of requests for pages Detect when the global frequency of requests for a Detect when the global frequency of requests for a

page exceeds a predetermined thresholdpage exceeds a predetermined threshold

America

Europe

Africa

100

50

25

0

Asia

Req #1Req #2Req #3

Page 4: Monitoring Distributed Data Streams

04/20/23 4

Air Quality MonitoringAir Quality Monitoring Sensors monitoring Sensors monitoring

the concentration of the concentration of airair pollutants.pollutants.

Each sensor holds a data vector comprising Each sensor holds a data vector comprising measured concentration of various pollutants measured concentration of various pollutants (CO(CO22, SO, SO22, O, O33, etc.)., etc.).

A function on the A function on the average average readings readings determines the Air Quality Index (AQI)determines the Air Quality Index (AQI)

Issue an alert in case the AQI exceeds a Issue an alert in case the AQI exceeds a given threshold.given threshold.

Page 5: Monitoring Distributed Data Streams

04/20/23 5

Sensor NetworksSensor Networks Sensors monitoring the temperature in a server Sensors monitoring the temperature in a server

room (machine room, conference room, etc.)room (machine room, conference room, etc.) Ensure uniform temp.: monitor variance of readingsEnsure uniform temp.: monitor variance of readings Alert in case variance exceeds a thresholdAlert in case variance exceeds a threshold

Temperature readings by Temperature readings by nn sensors sensors xx11, …, x, …, xnn

Each sensor holds a data vector Each sensor holds a data vector v vii = ( = (xxii22, , xxii ))TT

The The averageaverage data vector is data vector is v v ==

VarVar(all sensors) = (all sensors) =

2

1 1

1 1Tn n

i i

i i

x xn n

2

2

1 1

1 1n n

i i

i i

x xn n

Page 6: Monitoring Distributed Data Streams

04/20/23 6

Search Engine Search Engine Distributed datacenter/warehouseDistributed datacenter/warehouse

10Ks horizontal partitions10Ks horizontal partitions ““Our logs are larger than any other data by orders Our logs are larger than any other data by orders

of magnitude. They are our source of truthof magnitude. They are our source of truth..”” Sridhar Sridhar Ramaswamy. Ramaswamy. SIGMOD’08 keynote on “Extreme Data SIGMOD’08 keynote on “Extreme Data Mining”Mining”

Mining the logs: Compute pairs of keywords for Mining the logs: Compute pairs of keywords for which the correlation index is highwhich the correlation index is high

Thousands simultaneous tasksThousands simultaneous tasks ““Network bandwidth is a relatively scarce resource Network bandwidth is a relatively scarce resource

in our computing environmentin our computing environment”. ”. Dean and Dean and Ghemawat.Ghemawat. MapReduce paper, MapReduce paper, OSDIOSDI’’0404

Page 7: Monitoring Distributed Data Streams

Cloud Health MonitoringCloud Health Monitoring

04/20/23 7

Amazon Web Services  »  Service Health Dashboard Amazon S3 Availability Event: July 20, 2008Amazon S3 Availability Event: July 20, 2008

“At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health... At 9:41am PDT, we determined that servers within Amazon S3 were having problems… By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared…. “

Page 8: Monitoring Distributed Data Streams

Ad-Hoc Mobile P2P Ad-Hoc Mobile P2P NetworksNetworks

04/20/23 8

Peer-to-peer network invites drivers to get connectedCarTorrent could smarten up our daily commute, reducing accidents and bringing multimedia journey data to our fingertips

•Laura Parker •The Guardian, •Thursday January 17 2008

“The name BitTorrent has become part of most people's day-to-day vernacular, synonymous with downloading every kind of content via the internet's peer-to-peer networks. But if a team of US researchers have their way, we may all be talking about CarTorrent in the not too distant future…..

Researchers from the University of California Los Angeles are working on a wireless communication network that will allow cars to talk to each other, simultaneously downloading information in the shape of road safety warnings, entertainment content and navigational tools….”

Page 9: Monitoring Distributed Data Streams

04/20/23 9

Page 10: Monitoring Distributed Data Streams

Distributed Monitoring – Distributed Monitoring – State of the ArtState of the Art

Periodically send all data to a central locationPeriodically send all data to a central location High communicationHigh communication High latencyHigh latency

A tradeoffA tradeoff Expensive central resourcesExpensive central resources Power inefficientPower inefficient

Can we do better?Can we do better? Linear systems Linear systems Non-linear systems Non-linear systems

04/20/23 10

Page 11: Monitoring Distributed Data Streams

( )f x

ThresholdT

( ) , ( ) , 2

x yf x T f y T f T

2

x y yx

04/20/23 11

Monitoring Distributed Non-Linear Monitoring Distributed Non-Linear FunctionsFunctions

Page 12: Monitoring Distributed Data Streams

Given a 2X2 table , the mutual information is defined as Given a 2X2 table , the mutual information is defined as

11 1211 12

11 12 11 21 11 12 12 22

21 2221 22

21 22 11 21 21 22 12 22

( ) log log( )( ) ( )( )

log log( )( ) ( )( )

MI

1 2

0.9 0.03 0.04 0.02,

0.02 0.05 0.03 0.91

1 21 2( ) 0.104, ( ) 0.082, 0.494

2MI MI MI

The mutual information of the global table is much larger than The mutual information of the global table is much larger than the local values. As in the parabola case, there’s no way to infer the local values. As in the parabola case, there’s no way to infer about the global MI given the local ones.about the global MI given the local ones.

04/20/23 12

Mutual InformationMutual Information

Page 13: Monitoring Distributed Data Streams

04/20/23 13

Non-Linear FunctionsNon-Linear Functions

“…“…The link function is, of course, The link function is, of course, nonlinearnonlinear. So . So

we agonize over trading off optimization we agonize over trading off optimization performance with ability to use the massiveperformance with ability to use the massiveinfrastructure.infrastructure.…”…”

Sridhar RamaswamySridhar Ramaswamy. . SIGMOD’08 Keynote talk on “SIGMOD’08 Keynote talk on “Extreme Data Mining”Extreme Data Mining”Slide title: “Slide title: “10 top reasons why googlers do not sleep at night10 top reasons why googlers do not sleep at night””(Coffee is reason #5)(Coffee is reason #5)

Page 14: Monitoring Distributed Data Streams

Geometric Method – Idea Geometric Method – Idea

The behavior of a general function over The behavior of a general function over distributed data may be hard to seedistributed data may be hard to see Local indications may be misleadingLocal indications may be misleading Non-linearNon-linear

Looking at the *domain* of the function Looking at the *domain* of the function may be easiermay be easier

For long periods, the local inputs are For long periods, the local inputs are stationary, or do not change muchstationary, or do not change much

04/20/23 14