Effective Variation Management for Pseudo Periodical Streams

45
22/6/15 ACM SIGMOD 2007 Effective Variation Effective Variation Management for Pseudo Management for Pseudo Periodical Streams Periodical Streams Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University

description

Effective Variation Management for Pseudo Periodical Streams. Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University. Summary. Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion. - PowerPoint PPT Presentation

Transcript of Effective Variation Management for Pseudo Periodical Streams

Page 1: Effective Variation Management for Pseudo Periodical Streams

23/4/22

ACM SIGMOD 2007

Effective Variation Management for Effective Variation Management for Pseudo Periodical StreamsPseudo Periodical Streams

Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou

School of EECS

Peking University

Page 2: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 2/45ACM SIGMOD 2007

SummarySummary

Introduction

Related Work

Variation Management for Pseudo Periodical Stream

Experiments

Conclusion

Page 3: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 3/45ACM SIGMOD 2007

Pseudo Periodical StreamPseudo Periodical Stream

Pseudo Periodical StreamData seems to repeat in a certain period

Tiny variation exists between different periods

Common in the domain of medical, seismology

Typical stream variations: gradual evolutions rather than burst changes

Page 4: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 4/45ACM SIGMOD 2007

An Example of Pseudo Periodical StreamAn Example of Pseudo Periodical Stream

The respiratory data repeats about every 3.2 seconds

Reflects the evolution of the patient’s illness during five hours

Page 5: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 5/45ACM SIGMOD 2007

Variation Management on Data StreamVariation Management on Data Stream

Data streams are widely applied in many domainsStock market analysis

Road traffic control

Medical signal processing

Online variation management -- an important taskWhen did the variation occur? (Detect variations)

What is the variation ? / How does it change? (Describe variations)

Why it turns to change in this way ? (Help understanding variations )

Page 6: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 6/45ACM SIGMOD 2007

Major Technical ChallengesMajor Technical Challenges

Value TypeTraditional Algorithms: Discrete values (enumerative) or Time series (equidistant intervals)

Data stream: consecutive real number with variable sampling frequencies

Training Sets or ModelsSeveral training sets or predefined models

Data stream evolves and the models may not work soon

On the contrary, the system is required to generate such models as output

Page 7: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 7/45ACM SIGMOD 2007

Major Technical Challenges IIMajor Technical Challenges II

Variation TypeNot only on abnormal values and distribution

The structure in a period (shape)

Noises: unpredictable, random

In many applications, the variations are monitored manually

Our contribution: proposing a new method named Pattern Growth Graph (PGG) to detect and store variations over pseudo periodical streams

Page 8: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 8/45ACM SIGMOD 2007

SummarySummary

Introduction

Related Work

Variation Management for Pseudo Periodical Stream

Experiments

Conclusion

Page 9: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 9/45ACM SIGMOD 2007

Data Stream Management SystemsData Stream Management Systems

Data stream work can be loosely classified in two categories: DSMS and Online Data Mining

Data Stream Management Systems (DSMS)Such as STREAM, Aurora, TelegraphCQ…

Mainly focus on completing predefined SQL queries

Not try to find the data features, or to monitor the variations

Page 10: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 10/45ACM SIGMOD 2007

Online data miningOnline data mining

Variation management is an important part of online data mining

Three classes according to the algorithmsSymbolic Approaches

Mathematic Transformation

Predefined Models

Symbolic Approaches: Tarzan and SAXSpace: Put the entire time series/data stream in memory

Precision is not good for SAX

Page 11: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 11/45ACM SIGMOD 2007

Mathematic TransformationMathematic Transformation

Mathematic Transformation: Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT)

Require the data length fixed, as well as the sampling frequency (equidistant intervals)

Haar wavelet transform can only perform on 2n data items, e.g, the data length must be 1024 or 2048

Predefined Models: Using Zigzag to detect events in financial streams (SIGMOD 04)

Too domain specific

Users can not provide such models in advance – actually they would like them as the output

Page 12: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 12/45ACM SIGMOD 2007

SummarySummary

Introduction

Related Work

Variation Management for Pseudo Periodical Stream

Experiments

Conclusion

Page 13: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 13/45ACM SIGMOD 2007

Task Specification by Respiration StreamTask Specification by Respiration Stream

Variation : Online detect the stream variation in one pass

Wave: The smallest unit concerned is not a single point, but values in a certain period represented as a wave

Alarms: F is actually the noise caused by body movements

Summary: A summary with acceptable error bound is very helpful

0

200

400

600

800

1000

1200

1400

1 801 1601 2401 3201

Respriation

A B C D E F G H

Time (S)

0 3 6 9 12

Page 14: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 14/45ACM SIGMOD 2007

System FrameworkSystem Framework

01

23

45

6

0123

456

Wave-Pattern Matching

Full Matched Increase

Frequency

Partially Matched Grow Pattern

New PatternUnmatched

OutPut

Wave Stream

Pattern Growth Graph

Wave Splitting

Online Variation Management

Online Update

Stream View

Pattern Evolutions

Data Stream

Send Alarms

Page 15: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 15/45ACM SIGMOD 2007

Wave Splitting IWave Splitting I

Variation: the difference from old data

Detected by comparing the old data and coming stream

Waste too much resources if comparing at each coming item

Just comparing at each wave -- much more efficient

How to divide the stream according to the data features?

Page 16: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 16/45ACM SIGMOD 2007

Wave Splitting IIWave Splitting II

Fixed length window will accumulate error

Observation: The waves start and end at valley points that are smaller than a certain value

Page 17: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 17/45ACM SIGMOD 2007

Upper Bound of Valley PointsUpper Bound of Valley Points

User define

Update with the average value of past valley points

NVUN

iib /)(

1

Page 18: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 18/45ACM SIGMOD 2007

Valley SectionsValley Sections

Valley Section: Approximate flat section represents the time interval between two events

It is also worth to study as one part of the wave

Take the last point of the section as the cut point

Page 19: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 19/45ACM SIGMOD 2007

Two Problems in Online Matching ITwo Problems in Online Matching I

Problem 1: The data stream’s sampling frequency is usually high (>100Hz), waves should be simplified

Problem 2: How to compare two waves with different time lengths, and may not have data at same time point?

A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items

B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items

Page 20: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 20/45ACM SIGMOD 2007

Two Problems in Online Matching IITwo Problems in Online Matching II

Solution 1: Piecewise Liner Representation

Make Problem 2 more difficult: patterns are simplified as segments, how to compare segments and points?

Page 21: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 21/45ACM SIGMOD 2007

Wave-pattern MatchingWave-pattern Matching

In real applications, two sequences are assumed to match if their paths roughly coincide

PLR segments record paths of old data

Testing whether the incoming stream items are on the paths

The intensity of variations can be determined by the number of matching items

0 0.5 1.0 1.5 0 0.5 1.0 1.5

Time (s) Time (s)

ECG

Page 22: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 22/45ACM SIGMOD 2007

Record the PatternsRecord the Patterns

Observation: Many patterns just have few partial segments changed

Most stream variations are gradual evolutions rather than burst mutations

Recording by a simple list not only ignores their relationship but also causes storage redundancy

Utilize the similarity among patterns and reuse the unchanged parts

Pattern Growth Graph (PGG) is designed to store patterns and the variation history

Page 23: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 23/45ACM SIGMOD 2007

Pattern Growth GraphPattern Growth Graph

Implemented as bi-directional linked list

Only generate new segments on the un-matched data

New patterns seems to grow from the old one

0

0.5

1

1.5

1 51 101 151

Wave

Pattern 1

Pattern 2

12 3 4 5 6 7

81' 2'

3' 4'

1 2 3 4 5 6 7 8End

Pattern 1( Base Pattern)

1 ' 3 '' 2 ' 4

EndPattern 2

(Growth Pattern)

Start

Page 24: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 24/45ACM SIGMOD 2007

Construct Full Wave-patternConstruct Full Wave-pattern

New Problem: Wave-Pattern matching needs full pattern to compare, while PGG only stores the new parts

Fortunately we can construct the full pattern by propagating the pointers

\\\\8 9 2”1 1’ 1” 2’3’Final

\8 ( Collision! ) 7Start8 9 2”1 1’ 1” 2’3’Step 2

\83’1 9 2”1’ 1” 2’Step 1

End92’1’ 2”1”Step 0

右 2左 2右 1左 1模式8

EndPattern 1

Pattern 2

91 2 3 4 5 6 7

1' 2' 3'

Pattern 3 End1" 2"

CollisionStart

Page 25: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 25/45ACM SIGMOD 2007

Problems for PGG sizeProblems for PGG size

Waves in data stream: N PGG size: k

Time complexity of PGG based matching algorithm is O (k*n)

In the worst case, each incoming wave introduces a new pattern: overall time cost is O (n2)

When PGG becomes larger, the algorithm is time-consuming

PGG is not allowed to take “forgetting functions”Hard to delete in PGG

Some uncommon patterns may have higher domain significance

Page 26: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 26/45ACM SIGMOD 2007

Rank the PatternsRank the Patterns

Observation: The most frequent pattern and its similar patterns have the highest possibility to match the incoming wave

Matching probability factor

The patterns with smaller probability are not deleted, but have lower priority to be compared

When one pattern get a match, system not only increase its own rank, also its “families”

Page 27: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 27/45ACM SIGMOD 2007

Reconstruct the Stream View with PGGReconstruct the Stream View with PGG

Queries on traditional DSMSpredefined, hard to conduct when data items passed by

Answer “the patient's ECG in the past five hours”

Record all patterns’ occurrence time in PGG

Reconstruct the stream view with PGG patterns

Only consumes about 4% storage space of the original stream, but can provide an approximate stream view within 5% relative error bound

Page 28: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 28/45ACM SIGMOD 2007

Track Pattern EvolutionTrack Pattern Evolution

To answer “Why will it change in this way ?”

User selects an interesting pattern, PGG can track the source of it

Page 29: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 29/45ACM SIGMOD 2007

False AlarmFalse Alarm

A successful system needs to reduce the false alarms introduced by noises

The major problem: noises are caused by many sources, they have various styles and are hard to be modeled

Page 30: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 30/45ACM SIGMOD 2007

Noise ReorganizationNoise Reorganization

A short cut: considering the pattern’s evolution history

Some strategies to reduce false alarms on medical stream:

Unusual values in growth patterns: the patients’ condition has been exacerbated -- Warning

New pattern, it matches successive waves: the underlying pathology mechanism might have some fundamental changes -- Warning

A series of new patterns and they all un-match the previous/following waves -- suspected as noises

Page 31: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 31/45ACM SIGMOD 2007

System FrameworkSystem Framework

01

23

45

6

0123

456

Wave-Pattern Matching

Full Matched Increase

Frequency

Partially Matched Grow Pattern

New PatternUnmatched

OutPut

Wave Stream

Pattern Growth Graph

Wave Splitting

Online Variation Management

Online Update

Stream View

Pattern Evolutions

Data Stream

Send Alarms

Page 32: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 32/45ACM SIGMOD 2007

SummarySummary

Introduction

Related Work

Variation Management for Pseudo Periodical Stream

Experiments

Conclusion

Page 33: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 33/45ACM SIGMOD 2007

Experimental SetupExperimental Setup

Data SetMedical streams: Six real pathology signals including ECG, respiration... (over 25,000,000 data points)

Earthquake waves: The pacific earthquake wave data from the NGA project. (100,000 data points)

Sunspot data: All the sunspot records between the year 1850 and 2001 (55,000 data points)

Environment: Intel Pentium 4 3.0GHz CPU with 1GB RAM, Windows XP Professional, JDK 1.5.0…

Page 34: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 34/45ACM SIGMOD 2007

Effect of Rank FunctionEffect of Rank Function

At the beginning, the effect is insignificant.

After three million data points, the naive algorithm’s performance decreases rapidly

In the end, the rank algorithm outperforms by about 300%

Page 35: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 35/45ACM SIGMOD 2007

Reconstruct the Stream ViewReconstruct the Stream View

ECG data stream (more than 10M data items) can be represented with only 420 patterns

The amazing compressing result is achieved due to two factorsThe PLR simplify can reduce the size of patterns to about 20%

PGG further reduces it to about 3.31% by compressing the repeating and similar patterns (Patterns only need 0.3%, the rest 3% stores the occurrence time of the patterns)

Page 36: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 36/45ACM SIGMOD 2007

Compared with Other MethodsCompared with Other Methods

Compared PGG with SAX (symbolic approaches), Discrete Haar Wavelet Transformation (mathematic transformation) and Zigzag (predefined models)

The processing efficiency is average 60K—70K items/sec

Much higher than real application needs

Page 37: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 37/45ACM SIGMOD 2007

Variation Detection & Noise RecognitionVariation Detection & Noise Recognition

Two important measurements:Sensitivity (High Positive Rate): The algorithm send alarms at meaningful variations

Selectivity (Low Negative Rate): The algorithm does not send false alarms on noises

The two measurements are conflict Increasing sensitivity to find more variations will inevitably cause more false alarms

In a medical environment, sensitivity is much more important -- missing a meaningful variation may cost the patient’s life

Page 38: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 38/45ACM SIGMOD 2007

Best Results of Best Results of Sensitivity Sensitivity on Respiration Stream on Respiration Stream

Zigzag sends false alarm at almost every noise section

DWT and SAX nearly cannot distinguish real variations from noises

Page 39: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 39/45ACM SIGMOD 2007

Results of Noise Recognition on Other StreamResults of Noise Recognition on Other Stream

For other stream, we take precision as the main measurement

PGG performs accurately and stably

Zigzag is volatile with different datasets:Good on three blood pressure signals (ABP, CVP and ICP, meaningful variations are outliners)

Poorly on PLETH (meaningful variations are of inner structures)

Page 40: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 40/45ACM SIGMOD 2007

DiscussionDiscussion

Zigzag: focuses on extreme data points, strongly influenced by outliers

SAX: good at finding in a long period using frequency statistics -- more suitable for time series

DWT: only effective for signals with strict periods

With the effective data structure, PGG discovers and records as much features of the data stream as possible

The recorded information helps distinguish between meaningful variations and noises

Page 41: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 41/45ACM SIGMOD 2007

SummarySummary

Introduction

Related Work

Variation Management for Pseudo Periodical Stream

Experiments

Conclusion

Page 42: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 42/45ACM SIGMOD 2007

ConclusionConclusion

Streams are split as waves and represented by PLR patterns

Detect variations by online wave-pattern matching

Pattern Growth Graph stores the variation history

Reconstruct the stream view with high accuracy

Effectively distinguish meaningful variations from noises

Page 43: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 43/45ACM SIGMOD 2007

The System InterfaceThe System Interface

Page 44: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 44/45ACM SIGMOD 2007

Future WorkFuture Work

Extend PGG to multiple streams

Implement the PGG method in other application domains such as weather forecasting and financial analysis

Combine with other methods, like Zigzag…

Page 45: Effective Variation Management for Pseudo Periodical Streams

•23/4/22 45/45ACM SIGMOD 2007

Thank You Very Much!Thank You Very Much!

Please give me questions and suggestions