Time Series - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2014spring/lectures/... ·...

Post on 23-Mar-2021

1 views 0 download

Transcript of Time Series - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2014spring/lectures/... ·...

Time Series Nonlinear Forecasting; Visualization; Applications

CSE 6242 / CX 4242 Mar 27, 2014

Duen Horng (Polo) ChauGeorgia Tech

Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Last TimeSimilarity search"

• Euclidean distance"• Time-warping"

Linear Forecasting"

• AR (Auto Regression) methodology"• RLS (Recursive Least Square)

= fast, incremental least square

2

This TimeLinear Forecasting"

• Co-evolving time sequences"Non-linear forecasting"

• Lag-plots + k-NN"Visualization and Applications

3

Co-Evolving Time Sequences• Given: A set of correlated time sequences • Forecast ‘Repeated(t)’

Num

ber

of p

acke

ts

0

23

45

68

90

Time Tick

1 4 6 9 11

sentlostrepeated

??

Solution:

Q: what should we do?

Solution:

Least Squares, with • Dep. Variable: Repeated(t) • Indep. Variables:

• Sent(t-1) … Sent(t-w); • Lost(t-1) …Lost(t-w); • Repeated(t-1), Repeated(t-w)

• (named: ‘MUSCLES’ [Yi+00])N

umbe

r of

pa

cket

s

023456890

Time Tick

1 4 6 9 11

sentlostrepeated

Forecasting - Outline

• Auto-regression • Least Squares; recursive least squares • Co-evolving time sequences • Examples • Conclusions

Examples - Experiments• Datasets

– Modem pool traffic (14 modems, 1500 time-ticks; #packets per time unit)

– AT&T WorldNet internet usage (several data streams; 980 time-ticks)

• Measures of success – Accuracy : Root Mean Square Error (RMSE)

Accuracy - “Modem”

MUSCLES outperforms AR & “yesterday”

RMSE

0

1

2

3

4

Modems1 2 3 4 5 6 7 8 9 10 11 12 13 14

ARyesterdayMUSCLES

Accuracy - “Internet”RMSE

0

0.35

0.7

1.05

1.4

Streams1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

ARyesterdayMUSCLES

MUSCLES consistently outperforms AR & “yesterday”

Linear forecasting - Outline

• Auto-regression • Least Squares; recursive least squares • Co-evolving time sequences • Examples • Conclusions

Conclusions - Practitioner’s guide

• AR(IMA) methodology: prevailing method for linear forecasting

• Brilliant method of Recursive Least Squares for fast, incremental estimation.

Resources: software and urls

• MUSCLES: Prof. Byoung-Kee Yi: http://www.postech.ac.kr/~bkyi/ or christos@cs.cmu.edu

• R http://cran.r-project.org/

Books

• George E.P. Box and Gwilym M. Jenkins and Gregory C. Reinsel, Time Series Analysis: Forecasting and Control, Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)

• Brockwell, P. J. and R. A. Davis (1987). Time Series: Theory and Methods. New York, Springer Verlag.

Additional Reading

• [Papadimitriou+ vldb2003] Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003

• [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

Outline

• Motivation • ... • Linear Forecasting • Non-linear forecasting • Conclusions

Chaos & non-linear forecasting

Reference:

[ Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.]

Detailed Outline

• Non-linear forecasting – Problem – Idea – How-to – Experiments – Conclusions

Recall: Problem #1

Given a time series {xt}, predict its future course, that is, xt+1, xt+2, ...Time

Value

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t)

Lag-plotARIMA: fails

How to forecast?

• ARIMA - but: linearity assumption

Lag-plotARIMA: fails

How to forecast?

• ARIMA - but: linearity assumption "

"

• ANSWER: ‘Delayed Coordinate Embedding’ = Lag Plots [Sauer92]

~ nearest-neighbor search, for past incidents

General Intuition (Lag Plot)

xt-1

xtLag = 1, k = 4 NN

General Intuition (Lag Plot)

xt-1

xt

New Point

Lag = 1, k = 4 NN

General Intuition (Lag Plot)

xt-1

xt

4-NNNew Point

Lag = 1, k = 4 NN

General Intuition (Lag Plot)

xt-1

xt

4-NNNew Point

Lag = 1, k = 4 NN

General Intuition (Lag Plot)

xt-1

xt

4-NNNew Point

Interpolate these…

Lag = 1, k = 4 NN

General Intuition (Lag Plot)

xt-1

xt

4-NNNew Point

Interpolate these…

To get the final prediction

Lag = 1, k = 4 NN

Questions:

• Q1: How to choose lag L? • Q2: How to choose k (the # of NN)? • Q3: How to interpolate? • Q4: why should this work at all?

Q1: Choosing lag L

• Manually (16, in award winning system by [Sauer94])

Q2: Choosing number of neighbors k

• Manually (typically ~ 1-10)

Q3: How to interpolate?

How do we interpolate between the k nearest neighbors? A3.1: Average A3.2: Weighted average (weights drop with distance - how?)

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Xt-1

xt

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Xt-1

xt

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Xt-1

xt

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best ([Sauer94] - first place in the Santa Fe forecasting competition)

Xt-1

xt

Q4: Any theory behind it?

A4: YES!

Theoretical foundation

• Based on the ‘Takens theorem’ [Takens81] • which says that long enough delay vectors can

do prediction, even if there are unobserved variables in the dynamical system (= diff. equations)

Detailed Outline

• Non-linear forecasting – Problem – Idea – How-to – Experiments – Conclusions

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t)

Lag-plot

Datasets

Logistic Parabola: xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t)

Lag-plotARIMA: fails

Logistic Parabola

Timesteps

Value

Our Prediction from here

Logistic Parabola

Timesteps

Value

Comparison of prediction to correct values

Datasets

LORENZ: Models convection currents in the air dx / dt = a (y - x) dy / dt = x (b - z) - y dz / dt = xy - c z

Value

LORENZ

Timesteps

Value

Comparison of prediction to correct values

Datasets

Time

Value

• LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Laser

Timesteps

Value

Comparison of prediction to correct values

Conclusions

• Lag plots for non-linear forecasting (Takens’ theorem)

• suitable for ‘chaotic’ signals

References

• Deepay Chakrabarti and Christos Faloutsos F4: Large-Scale Automated Forecasting using Fractals CIKM 2002, Washington DC, Nov. 2002.

• Sauer, T. (1994). Time series prediction using delay coordinate embedding. (in book by Weigend and Gershenfeld, below) Addison-Wesley.

• Takens, F. (1981). Detecting strange attractors in fluid turbulence. Dynamical Systems and Turbulence. Berlin: Springer-Verlag.

References

• Weigend, A. S. and N. A. Gerschenfeld (1994). Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley. (Excellent collection of papers on chaotic/non-linear forecasting, describing the algorithms behind the winners of the Santa Fe competition.)

Overall conclusions

• Similarity search: Euclidean/time-warping; feature extraction and SAMs

• Linear Forecasting: AR (Box-Jenkins) methodology;

• Non-linear forecasting: lag-plots (Takens)

Must-Read Material• Byong-Kee Yi, Nikolaos D. Sidiropoulos,

Theodore Johnson, H.V. Jagadish, Christos Faloutsos and Alex Biliris, Online Data Mining for Co-Evolving Time Sequences, ICDE, Feb 2000.

• Chungmin Melvin Chen and Nick Roussopoulos, Adaptive Selectivity Estimation Using Query Feedbacks, SIGMOD 1994

Thanks

Deepay Chakrabarti (CMU) "

"

Spiros Papadimitriou (CMU) "

"

Prof. Byoung-Kee Yi (Pohang U.)

""

Prof. Dimitris Gunopulos (UCR) "

"

Mengzhi Wang (CMU)

Time Series Visualization + Applications

47

Why Time Series Visualization?Time series is the most common data type"

• But why is time series so common?

48

How to build time series visualization?

Easy way: use existing tools, libraries"

• Google Public Data Explorer (Gapminder)http://goo.gl/HmrH"

• Google acquired Gapminder http://goo.gl/43avY(Hans Rosling’s TED talk http://goo.gl/tKV7)"

• Google Annotated Time Line http://goo.gl/Upm5W"

• Timeline, from MIT’s SIMILE projecthttp://simile-widgets.org/timeline/"

• Timeplot, also from SIMILEhttp://simile-widgets.org/timeplot/"

• Excel, of course49

How to build time series visualization?

The harder way:"

• R (ggplot2)"• Matlab"• gnuplot"• ..."

The even harder way:"

• D3, for web"• JFreeChart (Java)"• ...

50

Time Series VisualizationWhy is it useful?"

When is visualization useful?"

(Why not automate everything? Like using the forecasting techniques you learned last time.)

51

Time Series User Tasks• When was something greatest/least?

• Is there a pattern?

• Are two series similar?

• Do any of the series match a pattern?

• Provide simpler, faster access to the series

• Does data element exist at time t ?

• When does a data element exist?

• How long does a data element exist?

• How often does a data element occur?

• How fast are data elements changing?

• In what order do data elements appear?

• Do data elements exist together?

Muller & Schumann 03 citing MacEachern 95

horizontal axis is time

http://www.patspapers.com/blog/item/what_if_everybody_flushed_at_once_Edmonton_water_gold_medal_hockey_game/

http://www.patspapers.com/blog/item/what_if_everybody_flushed_at_once_Edmonton_water_gold_medal_hockey_game/

Gantt Chart Useful for project

How to create in Excel: http://www.youtube.com/watch?v=sA67g6zaKOE

ThemeRiver Stacked graph Streamgraph

http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html

http://bl.ocks.org/mbostock/3943967

TimeSearcher support queries

http://hcil2.cs.umd.edu/video/2005/2005_timesearcher2.mpg

GeoTimehttp://www.youtube.com/watch?v=inkF86QJBdA

59