Towards Efficient Stream Processing in the Wide Area

40
Towards Efficient Stream Processing in the Wide Area Matvey Arye Siddhartha Sen, Ariel Rabkin, Michael J. Freedman Princeton University

description

Towards Efficient Stream Processing in the Wide Area. Matvey A rye Siddhartha Sen , Ariel Rabkin , Michael J. Freedman Princeton University. Our Problem Domain. Also Our Problem Domain. Use Cases. Network Monitoring Internet Service Monitoring Military Intelligence Smart Grid - PowerPoint PPT Presentation

Transcript of Towards Efficient Stream Processing in the Wide Area

Page 1: Towards Efficient Stream Processing in the Wide Area

Towards Efficient Stream Processing in the Wide Area

Matvey Arye Siddhartha Sen, Ariel Rabkin,

Michael J. Freedman

Princeton University

Page 2: Towards Efficient Stream Processing in the Wide Area

Our Problem Domain

Page 3: Towards Efficient Stream Processing in the Wide Area

Also Our Problem Domain

Page 4: Towards Efficient Stream Processing in the Wide Area

Use Cases

• Network Monitoring• Internet Service Monitoring• Military Intelligence• Smart Grid• Environmental Sensing• Internet of Things

Page 5: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Page 6: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Simpler queriesStanding queries

Real-time answers

Borealis/StreambaseSystem-S, Storm

High ingest timeFast query time

Oracle, SAP, IBM

Single Datacenter

Page 7: Towards Efficient Stream Processing in the Wide Area

Data Transfer

Page 8: Towards Efficient Stream Processing in the Wide Area

[Above the Clouds, Armbrust et. al.]

Trends in Cost/Performance2003-2008

Series1

CPU(16x)Storage(10x)Bandwidth(2.7x)

Page 9: Towards Efficient Stream Processing in the Wide Area

Aggregate At Local Datacenters

Page 10: Towards Efficient Stream Processing in the Wide Area

The World of Analytical ProcessingReal-Time Historical

Streaming OLAP Databases

Single Datacenter

Wide area JetStream

JetStream = Real-time + Historical + Wide Area

Page 11: Towards Efficient Stream Processing in the Wide Area

Large Caveat

• Preliminary work

• We want feedback and suggestions

Page 12: Towards Efficient Stream Processing in the Wide Area

Challenges

• Query placement and scheduling• Approximation of answers• Supporting User Defined Functions (UDFs)• Queries on historical data• Adaptation to network changes• Handling node failures

Page 13: Towards Efficient Stream Processing in the Wide Area

Motivating Example

• “Top-K domains served by a CDN”– Recall CDN is globally distributed– Services many domains

• Main Challenge: Minimize backhaul of data

Page 14: Towards Efficient Stream Processing in the Wide Area

How Is the Query Specified

Union Count Sort Limit

Page 15: Towards Efficient Stream Processing in the Wide Area

Problems

Single aggregation point

Runs on a single node

Union Count Sort Limit

Page 16: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Aggregate at local DC

CountPartial

DC2CountPartial

Less Data

Union Count Sort Limit

Page 17: Towards Efficient Stream Processing in the Wide Area

Count Partials

CountPartial Union Count

(Google,1)(Google,1)(Google,1)

(Google,1)

(Google,1)

(Google,5)

Page 18: Towards Efficient Stream Processing in the Wide Area

DC3Non-Distributed Computation

DC1

DC2

Union Count Sort Limit

Page 19: Towards Efficient Stream Processing in the Wide Area

DC3Split Count

DC1

DC2

Union Sort Limit

CountA-H

CountI-M

CountN-Z

Page 20: Towards Efficient Stream Processing in the Wide Area

DC3Split Union

DC1

DC2

Sort Limit

CountA-H

CountI-M

CountN-Z

Load Bal.

Load Bal.

Page 21: Towards Efficient Stream Processing in the Wide Area

DC3Do Partial Sort

DC1

DC2

Sort Limit

CountA-H

CountI-M

CountN-Z

Load Bal.

Load Bal.

SortPartial

SortPartial

SortPartial

Page 22: Towards Efficient Stream Processing in the Wide Area

DC3Push Limit Back

Load Bal.

CountA-H

SortPartial Limit

CountI-M

CountN-Z

Load Bal.

Sort Limit

DC1

DC2

SortPartial

SortPartial

Limit

Limit

Page 23: Towards Efficient Stream Processing in the Wide Area

DC3Single Host

Distributed Version

Load Bal.

CountA-H

SortPartial Limit

CountI-M

CountN-Z

Load Bal.

Sort Limit

DC1

DC2

SortPartial

SortPartial

Limit

Limit

Page 24: Towards Efficient Stream Processing in the Wide Area

What Is New

• Previous streaming systems – User guided transformations (System-S, Storm)– Simple transforms (Aurora)

• JetStream– More complex transforms – Transformation is network aware– Annotations for user defined functions

Page 25: Towards Efficient Stream Processing in the Wide Area

Joint Problems

• Transformations– Choosing which ones

• Placement– Network constrained– Heterogeneous nodes– Resource availability

• Decision has to be made at run-time

Page 26: Towards Efficient Stream Processing in the Wide Area

Tackling the Joint Problems

• Using heuristics

• Split into increasingly more local decisions– Global decisions are coarse grained• Example: Assign operators to DCs

– Localized decisions• Operate only on local part of subgraph• Have more current view of available resources• Do not affect other parts of of query graph placement

Page 27: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Bottlenecks Still Possible

CountPartial

DC2CountPartial

Possible Bottleneck

Union Count Sort Limit

Use Approximations when necessary

Page 28: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Adjusting Amount of ApproximationAs a reaction to network dynamism

CountPartial

DC2CountPartial

If bottleneck goes away, return to exact answers

Union Count Sort Limit

Page 29: Towards Efficient Stream Processing in the Wide Area

Approximation Challenges

• How to quantify error for approximations?– Uniform across approximation methods– Easy to understand– Integrates well with metrics for source/node failures

• How do we allow UDF approximation algorithms– Which exact operators can they replace– Quantifying the tradeoffs– Placement & Scheduling

Page 30: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Approximation Composition

CountPartial

DC2CountPartial

If we approximate count, how does that error affect sort & final answer?

Union Count Sort Limit

Error=eError=?

Page 31: Towards Efficient Stream Processing in the Wide Area

DC3

DC1

Approximations in Uneven Networks

CountPartial

DC2CountPartial

Do we need to approximate link DC1-DC3 if we approximate link DC2-DC3?

Union Count Sort Limit

Low Bandwidth Link; Needs Approximation

High Bandwidth Link

Page 32: Towards Efficient Stream Processing in the Wide Area

Discovering data trends?

• How has top-k changed over past hour?

• Current streaming systems don’t answer this– Except by using centralized DBs.

• JetStream proposes using storage at the edges

Page 33: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Google Yahoo<5Kb (10, 5ms) (100,20ms)50Kb-1Mb

(0, 0ms) (1, 4ms)

>1Mb (5,10ms (5, 30ms)

Minute 1 … 60

Page 34: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Day

Hour

Minute

1 … 31

1 … 24

1 … 60

All

01 … 12Month

Page 35: Towards Efficient Stream Processing in the Wide Area

Hypercube Data Structure

Day

Hour

Minute

1 … 31

1 … 24

1 … 60

All

01 … 12Month

Aggr

egat

e

Google Yahoo

<5Kb (90, 9ms) (500,20ms)

50Kb-1Mb

(0, 0ms) (5, 9ms)

>1Mb (5,10ms (10, 30ms)

Page 36: Towards Efficient Stream Processing in the Wide Area

Query: “Last Hour and a half”(without materializing intermediate nodes)

Day

Hour

Minute

1 … 31

1 …2

1 … 60

All

01 … 12Month

1 …30…

Page 37: Towards Efficient Stream Processing in the Wide Area

Query: “Last Hour and a half”by materializing intermediate nodes

Day

Hour

Minute

1 … 31

1 …2

1 … 60

All

01 … 12Month

1 …30…

Page 38: Towards Efficient Stream Processing in the Wide Area

Historical Queries

• Hypercubes have been used before– In the database literature

• What’s Novel– Storage at the edges (and in the network)– Time hierarchy

Page 39: Towards Efficient Stream Processing in the Wide Area

Challenges we talked about

• Query placement and scheduling• Approximation of answers• Supporting User Defined Functions (UDFs)• Queries on historical data• Adaptation to network changes• Handling node failures

Page 40: Towards Efficient Stream Processing in the Wide Area

Conclusion

JetStream Explores…+ Stream Processing+ Historical Data / Trend Analysis+ Wide Area

[email protected]