Integrating Distributed Data Streams

29
Integrating Distributed Data Streams Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7 th June 2007

description

Integrating Distributed Data Streams. Alasdair J G Gray Supervisors: M Howard Williams Werner Nutt 7 th June 2007. Overview. The problem Limits of current technology Proposed system Architecture Query Answering Performance Conclusions. Main sources: sensors Characteristics: - PowerPoint PPT Presentation

Transcript of Integrating Distributed Data Streams

Page 1: Integrating Distributed Data Streams

Integrating DistributedData Streams

Alasdair J G GraySupervisors: M Howard Williams

Werner Nutt

7th June 2007

Page 2: Integrating Distributed Data Streams

Overview

The problem Limits of current technology Proposed system

Architecture Query Answering Performance

Conclusions

Page 3: Integrating Distributed Data Streams

Streams of Data Main sources: sensors Characteristics:

Unbounded Append only Frequency

Managed by: Sensor networks Network/Grid

monitoring Ubiquitous/Pervasive

computing environments

Reading

Page 4: Integrating Distributed Data Streams

Streams are everywhere

InternetInternet

Page 5: Integrating Distributed Data Streams

GridGrid

Job progressBookkeeping

Monitoring data

Grid Monitoring

Resources supplied by various institutions

Resources publish status information

Scheduler must allocate jobs to resources

Bookkeeping tracks resource usage

Users track job progress

Page 6: Integrating Distributed Data Streams

Requirements

Ability to: Publish distributed streams of data Query multiple streams with no

knowledge of Existence of source streams Location of individual streams Access methods to individual streams

Scale to large numbers of users and sources

Page 7: Integrating Distributed Data Streams

Data Integration System Several distributed

data sources Users send query to

Mediator Mediator

Translates user query into sub-queries

Combines results of sub-queries

Only for stored data sources

DB1DB2

DB3

Mediator

Page 8: Integrating Distributed Data Streams

Stream Management System

Data streams into the server

Server applies long-standing queries to the streams

Answers streamed out

Users need to know which streams exist

Centralised server

Page 9: Integrating Distributed Data Streams

Solution

Need a system that combines: Ability to access multiple sources

without specific source knowledge

Data integration Ability to process streams of data

Stream processing

A Stream Integration System!

Page 10: Integrating Distributed Data Streams

Stream Integration System 1

Producer publishes streams of data

Consumer query for streams of data

Registry matches consumer requests with publications

Producer Producer Producer

Consumer Consumer

Registry

Page 11: Integrating Distributed Data Streams

Publishing Monitoring Data Stream data can be represented in terms

of relations with Keys: “what” and “where” Measurements: the “value” Timestamps: “when”For example, Network ThroughPut

One reading is a tuple in the relationNTP (from, to, tool, psize, latency, timestamp)

('hw', 'ral', 'ping', 32, 11.1, 2005-06-24-15:05:34)

NTP (from, to, tool, psize, latency, timestamp)

Page 12: Integrating Distributed Data Streams

Consuming Monitoring Data

Users are interested in how the grid changes over time. For example,

1. Latency for large packets sent from hw2. Links with a low latency as recorded by the

PingER tool

These can be expressed as SQL selection queries

)(: 1024''1 NTPq psizehwfrom

)(: 0.10''2 NTPq latencypingtool

Page 13: Integrating Distributed Data Streams

What is an Answer to a Query?

Global relations contain no tuples (virtual

relation) Need to translate into query over sources An answer stream should be

Sound Complete Duplicate free Weakly ordered: all tuples that share the same

key value will be in timestamp order Order in general is difficult in a distributed

setting Weak order sufficient for more complex

queries such as aggregates

Page 14: Integrating Distributed Data Streams

Λ from='hw' Λ tool='udp'Λ from='ral' Λ tool='ping'from='hw' Λ psize≥1024

Query Planning: Consumer Query

Satisfiability used to find relevant producers

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool=‘ping' Λ latency≤10.0

Page 15: Integrating Distributed Data Streams

Scalability is an Issue

Problem: Every consumer contacting every producer of interest does not scale

Even a small Grid of less than a dozen sites has problems

Grids may contain thousands of resourcesFor example,

Large Hadron Collider Computing Grid (LCG)

Page 16: Integrating Distributed Data Streams

Republishers Allow the System to Scale

A republisher Consumes answers to a

selection query Merges "trickles" into

streams Publishes

Answer stream Latest-state answer History

Problem: Choice in where to obtain information

Producer S1 Producer S2

Republisher

Page 17: Integrating Distributed Data Streams

Meta query plan contains choice

Query plan uses one of R1 or R3

Query Planning in the Presence of Republishers

Find all relevant publishers

Rank according to data provided

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool='ping' Λ latency≤10.0

Page 18: Integrating Distributed Data Streams

Weak Order is not Guaranteed

Tuples for same channel

(3) published before (8)

Arrive at consumer in wrong order

S2: from='hw' Λ tool='ping'

latency≤5.0 latency>5.0

q2: tool=‘ping' Λ latency≤10.0

slowlink

(3) (8)

(3) (8)

(8) (3)

Page 19: Integrating Distributed Data Streams

Generating Well Formed Query Plans

A publisher is relevant for a global query if

1. Conditions are satisfiable, and2. All measurements that agree on their key

values come from the same publisher

The measurement condition can be checked using entailment.

Previous example was well formed.

Page 20: Integrating Distributed Data Streams

Query Re-Planning

Queries are long-lived Set of publishers can change Query plans should reflect changes

Page 21: Integrating Distributed Data Streams

How does a new Republisher affect our Consumers?

Find consumers for which R4 is relevant

Compare R4 to publishers in Meta Query Plan

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool= 'ping' Λ latency≤10.0

Page 22: Integrating Distributed Data Streams

Planning a Republisher Query

Applying Consumer planning techniques results in a problem

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

Problem: Hierarchy contains cycles Republishers disconnected

from Producers

Page 23: Integrating Distributed Data Streams

Correctness: streams answer queries Cycle freeness: loops can lead to

duplicates Uniqueness: hierarchy defined for a

set of publishers Local planning: Publishers and

Consumers only need to communicate with the Registry

Desirable Properties for a Hierarchy

Page 24: Integrating Distributed Data Streams

Generating Well Formed Hierarchies

Need a stricter relevance criterion R1 can consume from R2 iff

1. Everything R2 offers is relevant to R1, and

2. R1 offers something R2 does not.

Can be checked by entailment Ensures

No loops in the hierarchy Republishers connected to the Producers

Page 25: Integrating Distributed Data Streams

Re-Planning, Re-Visited!

Stricter relevance criterion

Republishers only consume from publishers below them

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

R4: TRUE

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

R4 is not relevant for R1

Page 26: Integrating Distributed Data Streams

Republishers Effect on Latency

Tuple published by producer Tuple passes through some number

of republishers Tuple arrives at consumer

Republishers add to the time taken!

Page 27: Integrating Distributed Data Streams

Performance Measure

Number of Republishers

Average delivery time

(ms)

0 26

1 49

2 67

3 82

Page 28: Integrating Distributed Data Streams

Conclusions

Distributed streams of data are increasing be made available

Distributed users interested in multiple streams

Developed a system for Publishing distributed data streams Querying multiple stream sources

without source knowledge Republishers required to allow

system to scale

Page 29: Integrating Distributed Data Streams

Future Work

Increase complexity of query language

Integrate stored and stream sources