Chapter 10: Stream-based Data Management
description
Transcript of Chapter 10: Stream-based Data Management
Chapter 10: Stream-based Data Management
• Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core
• Authors: Navendu Jain, Lisa Amini, et. al.
Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core
• Problem– Problem Statement– Why is this problem important?– Why is this problem hard?
• Approaches– Approach description, key concepts– Contributions (novelty, improved)– Assumptions
Problem Statement
• Given– Stream data, continuous queries in large-scale distributed environments
– Streaming data application (Linear Road)
– Stream processing middleware (Stream Processing Core, SPC)
• Find:– Performance bottlenecks of streaming data applications
• Objectives– Understand the performance characteristics of the stream data
application
• Constraints– SPC is constantly overloaded with respect to the available resources.
– Processing elements are a mix of I/O-bound as well as CPU-bound.
– It is unrealistic for applications to store the full history of a stream in memory. Memory-bound.
Why is this problem important?
• High volume, continuous data are ubiquitous.– Text and transactional data– Digital audio, video, and image– Instant messages, network packet traces– Sensor data
• Stream processing applications become important in the networking and database community.
Why is this problem Hard?
• Stream data are– Large volume– High data rates– Generated by multiple distributed data sources– Rapidly updated
• Processing stream data requires– Filtering– Aggregation– Correlation
• A system supporting the stream data processing applications should consider– Scalability– Latency– Resource utilization
Novelty of Contribution
• Related Work– DataCutter, StreaMIT: Connections between applications are statically
determined.– TelegraphCQ, Aurora, Borealis, STREAM: provide support for stream
data manipulation from a database-centric perspective, but, process streams of tuples individually. (i.e., small-scale)
– Benchmarks: Previous works on Linear Road did not report any performance number
• Contributions– SPC is dynamic application composition.– Evaluate the SPC using the Linear Road application employing multiple
distributed configurations. Highly scalable implementation of the Linear Road application
– Study the behavior of the streaming infrastructure support for large-scale continuous and historical queries. Addressing performance bottlenecks and tuning them.
SPC Architecture
• Publish-subscribe model
– Each processing element (PE) that consumes and produces stream data specifies the characteristics of the streams.
– SPC dynamically determines the stream connections by matching stream descriptors as new applications and new data sources join and leave the system.
• Reusing streams– Results in significant resource
savings.– Discovers useful info. over an ever-
changing set of data sources.
Performance Challenges and Optimizations in SPC
• Challenges– The PEs consist of performing
• Small amount of processing on large volumes of data• Large amount of processing on lower volumes of data• Thus, a mix of I/O-bound & CPU-bound
– Impossible to store stream history in memory memory-bound
• Optimizations– SDO filtering: SPC can filter out unwanted objects saving
resources.– Events: PEs can subscribe to system events. Can adapt its
algorithm.– Dynamic copies of PEs
Linear Road Benchmark
• Simulates the traffic characteristics of a simple urban expressway system.
• Input to the Linear Road benchmark is stream data format.
• Requires stream-based data management system (SDMS) to process a set of continuous and historical queries.
Prototype Implementation
• Design principles– Modularity– Data Aggregation– Network and Data Locality– Flexible Programming Environment
• Linear Road in SPC– The figure shows the
query network
infrastructure
comprising 15 PEs.
Experiments
• Input data is increasing over time for stress-test
• Scalability
Experiments
• Analyzing Bottleneck PEs
• PE Placement Policy
Summary
• Paper’s focus– Understanding the performance characteristics of stream processing
applications in a distributed setup
• Ideas – Design and implementation of the Linear Road benchmark on the SPC
middleware.– Identify the main performance bottlenecks to achieve scalability and low
query response latency
• Contributions– Demonstrate a scalable distributed implementation of Linear Road– Highlight the importance of addressing performance bottlenecks
• Analytical Validation– Experiments– Prototyping
Assumptions, Rewrite today
• Assumptions– Restrict evaluation to SPC support for the Linear Road
application assuming that their design decisions are performance results are applicable to other streaming applications.
– The system is constantly overloaded with respect to the available resources.
– PEs are I/O, CPU, and memory bound.
• Rewrite today– Apply the ideas to other types of streaming applications.– More extensive experiments on performance tuning.