Data analysis03 timeasa-variable

38
http://publicationslist.org/junio Data Analysis Time as a variable: time-series analysis Prof. Dr. Jose Fernando Rodrigues Junior ICMC-USP

description

Revised version - spell checked.

Transcript of Data analysis03 timeasa-variable

Page 1: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Data AnalysisTime as a variable: time-series analysis

Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP

Page 2: Data analysis03 timeasa-variable

http://publicationslist.org/junio

What is it about?Time series are an incredibly common kind of dataStock marketCPU utilizationMeteorology - daily rainfall, wind speed, and temperatureSociology - crime figures, employment figuresSoftware engineering – number of errorsNetworks – number of nodes, and edges

Page 3: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider a data set with the concentration (ppm) of carbon

dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959The plot shows two

common features intime series:Trend: a steady, long-

term linear growthSeasonality: a regular

periodic pattern – on 12month cycle

Page 4: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider the data set with the price of long-distance phone

calls in the US over the last century

The plot shows a strongnonlinear trend

The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses

Page 5: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider the data set with the price of long-distance phone

calls in the US over the last century

The plot shows a strongnonlinear trend

The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses

This example asks for closer inspection:• Has the long-distance call service changed along

time?• Were the prices adjusted for inflation?• What are the uncharacteristically low prices for a

couple of years in the late 1970s? Did the breakupof the AT&T system have anything to do with it?

Page 6: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider the data set with the development of the Japanese

stock market as represented by the Nikkei Stock Index overthe last 40 years shown with a 31-point Gaussian smoothingfilter

The plot shows a change inthe behavior after 1990(the big Japanese bubble),after which a long-termincreasing trend turned intoan oscillatory decreasingtrend

The seasonality alsochanged significantly afterthen

Page 7: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider a data set with the number of daily calls placed in a

call center for a time period slightly longer than two years

This example is way morechallenging with its complexstructure

Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity

In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior

Page 8: Data analysis03 timeasa-variable

http://publicationslist.org/junio

First examplesConsider a data set with the number of daily calls placed in a

call center for a time period slightly longer than two years

This example is way morechallenging with is complexstructure

Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity

In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior

As time-series commonly counts on long-term data, it is important to certify that the data acquisition was homogeneous along the period, otherwise the series

may change its behavior in ways that becomes hard to make sense

Page 9: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Main componentsAs we have seen, the main components observed are:Trend: linear or non-linear, with a characteristic magnitudeSeasonality: additive, for example, every 12 months the sales

increase by 3 million; or multiplicative, for example, every 12months the sales increase by 1.4 times what was observed in the lastcycle

Noise: some form of random variation, quite commonOther: change in behavior, special outliers, missing data, and anything

remarkable

Page 10: Data analysis03 timeasa-variable

http://publicationslist.org/junio

AssumptionsStandard methods of time-series analysis make a number of

assumptions, all of them are violated in real-worldscenarios:Data points have been taken at equally spaced time steps, with

no missing data points: demands interpolation in case of missingpoints, or re-sampling in case of insufficient sampling

The time series is sufficiently long (at least 50 points): requiressmoothing methods to define a continuous curve, even wherethere are no points

The series is stationary, it has no trend, no seasonality, and thecharacter (amplitude and frequency) of any noise does not changewith time: may require breaking the series into multiplesegments to be analyzed separately

Page 11: Data analysis03 timeasa-variable

http://publicationslist.org/junio

SmoothingJust as with two-variable data, it is useful to fit a curve

according to the available data (actually, a time series is aspecial case of two-variable data)

Smoothing helps in:Reducing noise Interpolating missing/insufficient values

Page 12: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Running averagesThe method know as running (moving, rolling, or floating)

average is straightforward: for any odd number of consecutivepoints, replace the centermost value with the average ofthe other points

The smoothed point si is given by:

where xi are the data pointsFor example, for a 5-point (k=2) moving average, consider

point x10 = 4, and points x8 = 4, x9 = 7, x11 = 2, x12 = 9, sos10 = 1/5*(4+7+4+2+9)=1/5*26 = 5.2

And so forth for any point

Page 13: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Weighted running averagesRunning averages do not work well in the presence of

outliers, what may distort the curveThe weighted running averages techniques lessens this

problem by using weights to associate more importanceto points at the center of the moving window

The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)

Or they can be defined by a function, in this case the Gaussianis the first choice

Page 14: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Weighted running averagesRunning averages do not work well in the presence of

outliers, what may distort the curveThe weighted running averages techniques lessens this

problem by using weights to associate more importanceto points at the center of the moving window

The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)

Or they can be defined by a function, in this case the Gaussianis the first choice

In either case, the choice of weights must be peaked at the center, drop toward the edges, and add up to 1

Page 15: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Running averagesFor example: considering synthetic data (filled line) and an 11-

point moving averageThe plot shows that the simple

technique could reasonablyrepresent the data, butwhenever an outlier (spike)appears, the curve is abruptlydistorted until the outlierleaves the window

The weighted version of thetechnique presented betterresults, instead of abruptdistortions, it showssmoothed peaks that pointout the original outliers

Page 16: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Single exponential smoothingRunning averages are intrinsically local and may not capture the global

behavior of the seriesAn improved method is exponential smoothing, which, in its single

form, departs from a simple recursive definition

푠 = 훼푥 + 1 − 훼 푠

with 0 ≤ 훼 ≤ 1, and 푠 = 푥 , or 푠 = ∑ 푥 , for n initial values

That is, the next i-th smoothed point is a mix between the actual xi pointand the previous smoothed si-1 point, where 훼 can be defined with trialand error

By mathematical induction, this recursion leads to the exponentialexpression: 푠 = 훼∑ (1 − 훼) 푥

which can provide any smooth si value as a function of all the previous ivalues x

Page 17: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Single exponential smoothingThe single exponential smoothing provides good smoothing curves and,

for some cases, forecasting

It is limited, though, for series that present trend or seasonality,situations when the technique cannot be accurately used forprediction

There two exponential smoothing techniques that are more advanced Double exponential smoothing for series with trend but without

seasonality Triple exponential smoothing for series with trend and seasonality, this

technique is called Holt–Winters method

The Holt–Winters method is a powerful technique able to reproducethe full behavior of additive or multiplicative time series

Page 18: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Double and triple exponential smoothingDouble exponential smoothing

Additive triple exponential smoothing

Multiplicative triple exponential smoothing

Trend factor

Trend factor

Seasonality factor

Trend factor

Seasonality factor

Forecasting

Forecasting

Page 19: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Double and triple exponential smoothingDouble exponential smoothing

Additive triple exponential smoothing

Multiplicative triple exponential smoothing

Trend factor

Trend factor

Seasonality factor

Trend factor

Seasonality factor

Forecasting

Forecasting

Exponential smoothing depends on mixing parameters,which are required by software packages:

• Single exponential smoothing:훼• Double exponential smoothing:훼훽• Triple exponential smoothing:훼훽훾

More on time-series analysis:http://www.statsoft.com/textbook/time-series-analysis/

Page 20: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Triple exponential smoothingFor example, the additive Holt–Winters plot for a dataset with the

number of US monthly international flight passengersThe years 1949 through 1957 were used to “train” the algorithm,

and the years 1958 through 1960 were forecastedNote how well the forecast agrees with the actual data

Page 21: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

As mentioned, time-series are mainly characterized by trendsand seasonality

Trend is analyzed by means of smoothing, function fitting(modeling), and plotting

Seasonality can benefit from techniques correlation andcorrelogram

Page 22: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

The correlation between two time series is obtained asfollows:For each point xi in the two series, multiply their response values (yi),

considering their deviation from the meanSum up all the productsNormalize

The correlation for two identical series is 1, and it is -1 forseries that are exactly inverted one in relation to theother

Page 23: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

Seasonality:Formally defined as the correlation between each i-th element and

the (i+k)-th element – k is usually called the lagMeasured by the Autocorrelation Function - ACF, i.e., the correlation

between the two terms xi and xi+k

If the measurement error is not too large, seasonality can bevisually identified as a pattern that repeats every k moments intime

Page 24: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

If seasonality is present, then the behavior of the seriesshould repeat at every k time units, where k is named lag

The problem, hence, is: how to identify analytically what isthe lag of the series?The answer is: compare the time series with its own self, but

shifted by increasing values (lags) of k; for each value calculatethe correlation

Hence, the autocorrelation of a given series at lag k is givenby

Normalization according to lag 0,that is, to the correlation of theseries with itself

Page 25: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax

Page 26: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogram

Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax

According to this algorithm: Initially (lag 0), the two signals are perfectly aligned and the

correlation is 1Then, as we shift the signals they slowly move out of phase and

the correlation dropsHow quickly it drops tells us how much “memory” there is

in the data: If quickly, we know that, after a few steps, the signal has lost all

memory of its recent past If slowly, then we know that we are dealing with a process that

is relatively steady over longer periods of time

Page 27: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogramThe correlogram refers to the plot “lag x correlation” of a

given time seriesFor example: consider a data set with the number of daily calls

placed in a call center for a time period slightly longer than twoyears – as presented earlier

Time series (Auto) correlogram – axis x 0<=lag<=500

Page 28: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Autocorrelation and correlogramThe correlogram refers to the plot lag x correlation of a given

time seriesFor example: consider a data set with the number of daily calls

placed in a call center for a time period slightly longer than twoyears – as presented earlier

Time series (Auto) correlogram

From the correlogram we can observe that:The series has a long “memory” (long cycles): it takes the

correlation almost 100 days to fall to zero, indicating that thefrequency of calls changes more or less once per quarter but notmore frequently

There is a pronounced secondary peak at a lag of 365 days: thecall center data is highly seasonal and repeats itself on a yearlybasis, when the series repeats its response behavior (highcorrelation)

There is a small but regular sawtooth structure; if we lookclosely, we will find that the first peak of the sawtooth is at a lag of7 days and that all repeating ones occur at multiples of 7 - this isthe signature of the high-frequency component that we see in theplot of the series; that is, the traffic to the call center exhibits asecondary seasonal component with 7-day periodicity, thetraffic depends on the day of the week

Page 29: Data analysis03 timeasa-variable

http://publicationslist.org/junio

Example

Page 30: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in HawaiiConsider again the data set with the concentration (ppm) of

carbon dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959

Page 31: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii Which can be better numerically analyzed if the horizontal axis be expressed as

incremental monthly indexes, and if the graph goes through the origin (verticaltranslation of -315)

This can be achieved in Gnuplot with:

plot "data" using 0:($2-315) with lines

Page 32: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii The series has a trend that seems to be a power-law of the form b(x/a)k with k

bigger than 1 as the curve is convex downward, a first guess is k=2 and b=35 anda=350 (upper rightmost part of the series)

This can be achieved in Gnuplot with:

plot “data” using 0:($2-315) with lines, 35*(x/350)**2

Page 33: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii By trial and error, a better guess for k is 1.35 This can be achieved in Gnuplot with:

plot "data" using 0:($2-315) with lines, 35*(x/350)**1.35

Page 34: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii To verify the accuracy of the model function, we can plot the residual by subtracting

the trend from the data

This can be achieved in Gnuplot with:

plot "data" using 0:($2-315 - 35*($0/350)**1.35) with lines

Page 35: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations

that can be captured by sines, as the series starts at (0,0); also the series is monthly-based with a cycle of one year, so a guess is that the data is the same every 12points; the amplitude is around 3, as we can observe in the former plots

We can compare the residual and our seasonality mode in Gnuplot with:

plot "data" using 0:($2-f($0)) with lines, 3*sin(2*pi*x/12) with lines

Page 36: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations that

can be captured by sines as the series starts at (0,0); also the series is monthly-basedwith a cycle of one year, so a guess is that the data is the same every 12 points; theamplitude is around 3, as we can observe in the former plots

We can compare the residual and our seasonality mode in Gnuplot with:

plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l

At this point the model is given by the power-law function plus the sinefunction

f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12)plot "data" using 0:2 with lines, f(x)

which is pretty close the actual phenomenon

Page 37: Data analysis03 timeasa-variable

http://publicationslist.org/junio

CO2 measurements above Mauna Loa in HawaiiWith the final model, it becomes possible to predict future values

for the series

Page 38: Data analysis03 timeasa-variable

http://publicationslist.org/junio

References Philipp K. Janert, Data Analysis with Open Source Tools,

O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/