Data analysis03 timeasa-variable

http://publicationslist.org/junio

Data AnalysisTime as a variable: time-series analysis

Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP


What is it about?Time series are an incredibly common kind of dataStock marketCPU utilizationMeteorology - daily rainfall, wind speed, and temperatureSociology - crime figures, employment figuresSoftware engineering – number of errorsNetworks – number of nodes, and edges


First examplesConsider a data set with the concentration (ppm) of carbon

dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959The plot shows two

common features intime series:Trend: a steady, long-

term linear growthSeasonality: a regular

periodic pattern – on 12month cycle


First examplesConsider the data set with the price of long-distance phone

calls in the US over the last century

The plot shows a strongnonlinear trend

The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses


First examplesConsider the data set with the price of long-distance phone

calls in the US over the last century

The plot shows a strongnonlinear trend

The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses

This example asks for closer inspection:• Has the long-distance call service changed along

time?• Were the prices adjusted for inflation?• What are the uncharacteristically low prices for a

couple of years in the late 1970s? Did the breakupof the AT&T system have anything to do with it?


First examplesConsider the data set with the development of the Japanese

stock market as represented by the Nikkei Stock Index overthe last 40 years shown with a 31-point Gaussian smoothingfilter

The plot shows a change inthe behavior after 1990(the big Japanese bubble),after which a long-termincreasing trend turned intoan oscillatory decreasingtrend

The seasonality alsochanged significantly afterthen


First examplesConsider a data set with the number of daily calls placed in a

call center for a time period slightly longer than two years

This example is way morechallenging with its complexstructure

Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity

In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior


First examplesConsider a data set with the number of daily calls placed in a

call center for a time period slightly longer than two years

This example is way morechallenging with is complexstructure

Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity

In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior

As time-series commonly counts on long-term data, it is important to certify that the data acquisition was homogeneous along the period, otherwise the series

may change its behavior in ways that becomes hard to make sense


Main componentsAs we have seen, the main components observed are:Trend: linear or non-linear, with a characteristic magnitudeSeasonality: additive, for example, every 12 months the sales

increase by 3 million; or multiplicative, for example, every 12months the sales increase by 1.4 times what was observed in the lastcycle

Noise: some form of random variation, quite commonOther: change in behavior, special outliers, missing data, and anything

remarkable


AssumptionsStandard methods of time-series analysis make a number of

assumptions, all of them are violated in real-worldscenarios:Data points have been taken at equally spaced time steps, with

no missing data points: demands interpolation in case of missingpoints, or re-sampling in case of insufficient sampling

The time series is sufficiently long (at least 50 points): requiressmoothing methods to define a continuous curve, even wherethere are no points

The series is stationary, it has no trend, no seasonality, and thecharacter (amplitude and frequency) of any noise does not changewith time: may require breaking the series into multiplesegments to be analyzed separately


SmoothingJust as with two-variable data, it is useful to fit a curve

according to the available data (actually, a time series is aspecial case of two-variable data)

Smoothing helps in:Reducing noise Interpolating missing/insufficient values


Running averagesThe method know as running (moving, rolling, or floating)

average is straightforward: for any odd number of consecutivepoints, replace the centermost value with the average ofthe other points

The smoothed point si is given by:

where xi are the data pointsFor example, for a 5-point (k=2) moving average, consider

point x10 = 4, and points x8 = 4, x9 = 7, x11 = 2, x12 = 9, sos10 = 1/5*(4+7+4+2+9)=1/5*26 = 5.2

And so forth for any point


Weighted running averagesRunning averages do not work well in the presence of

outliers, what may distort the curveThe weighted running averages techniques lessens this

problem by using weights to associate more importanceto points at the center of the moving window

The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)

Or they can be defined by a function, in this case the Gaussianis the first choice


Weighted running averagesRunning averages do not work well in the presence of

outliers, what may distort the curveThe weighted running averages techniques lessens this

problem by using weights to associate more importanceto points at the center of the moving window

The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)

Or they can be defined by a function, in this case the Gaussianis the first choice

In either case, the choice of weights must be peaked at the center, drop toward the edges, and add up to 1


Running averagesFor example: considering synthetic data (filled line) and an 11-

point moving averageThe plot shows that the simple

technique could reasonablyrepresent the data, butwhenever an outlier (spike)appears, the curve is abruptlydistorted until the outlierleaves the window

The weighted version of thetechnique presented betterresults, instead of abruptdistortions, it showssmoothed peaks that pointout the original outliers


Single exponential smoothingRunning averages are intrinsically local and may not capture the global

behavior of the seriesAn improved method is exponential smoothing, which, in its single

form, departs from a simple recursive definition

푠 = 훼푥 + 1 − 훼 푠

with 0 ≤ 훼 ≤ 1, and 푠 = 푥 , or 푠 = ∑ 푥 , for n initial values

That is, the next i-th smoothed point is a mix between the actual xi pointand the previous smoothed si-1 point, where 훼 can be defined with trialand error

By mathematical induction, this recursion leads to the exponentialexpression: 푠 = 훼∑ (1 − 훼) 푥

which can provide any smooth si value as a function of all the previous ivalues x


Single exponential smoothingThe single exponential smoothing provides good smoothing curves and,

for some cases, forecasting

It is limited, though, for series that present trend or seasonality,situations when the technique cannot be accurately used forprediction

There two exponential smoothing techniques that are more advanced Double exponential smoothing for series with trend but without

seasonality Triple exponential smoothing for series with trend and seasonality, this

technique is called Holt–Winters method

The Holt–Winters method is a powerful technique able to reproducethe full behavior of additive or multiplicative time series


Double and triple exponential smoothingDouble exponential smoothing

Additive triple exponential smoothing

Multiplicative triple exponential smoothing

Trend factor

Trend factor

Seasonality factor

Trend factor

Seasonality factor

Forecasting

Forecasting


Double and triple exponential smoothingDouble exponential smoothing

Additive triple exponential smoothing

Multiplicative triple exponential smoothing

Trend factor

Trend factor

Seasonality factor

Trend factor

Seasonality factor

Forecasting

Forecasting

Exponential smoothing depends on mixing parameters,which are required by software packages:

• Single exponential smoothing:훼• Double exponential smoothing:훼훽• Triple exponential smoothing:훼훽훾

More on time-series analysis:http://www.statsoft.com/textbook/time-series-analysis/


Triple exponential smoothingFor example, the additive Holt–Winters plot for a dataset with the

number of US monthly international flight passengersThe years 1949 through 1957 were used to “train” the algorithm,

and the years 1958 through 1960 were forecastedNote how well the forecast agrees with the actual data


Autocorrelation and correlogram

As mentioned, time-series are mainly characterized by trendsand seasonality

Trend is analyzed by means of smoothing, function fitting(modeling), and plotting

Seasonality can benefit from techniques correlation andcorrelogram



The correlation between two time series is obtained asfollows:For each point xi in the two series, multiply their response values (yi),

considering their deviation from the meanSum up all the productsNormalize

The correlation for two identical series is 1, and it is -1 forseries that are exactly inverted one in relation to theother



Seasonality:Formally defined as the correlation between each i-th element and

the (i+k)-th element – k is usually called the lagMeasured by the Autocorrelation Function - ACF, i.e., the correlation

between the two terms xi and xi+k

If the measurement error is not too large, seasonality can bevisually identified as a pattern that repeats every k moments intime



If seasonality is present, then the behavior of the seriesshould repeat at every k time units, where k is named lag

The problem, hence, is: how to identify analytically what isthe lag of the series?The answer is: compare the time series with its own self, but

shifted by increasing values (lags) of k; for each value calculatethe correlation

Hence, the autocorrelation of a given series at lag k is givenby

Normalization according to lag 0,that is, to the correlation of theseries with itself



Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax



Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax

According to this algorithm: Initially (lag 0), the two signals are perfectly aligned and the

correlation is 1Then, as we shift the signals they slowly move out of phase and

the correlation dropsHow quickly it drops tells us how much “memory” there is

in the data: If quickly, we know that, after a few steps, the signal has lost all

memory of its recent past If slowly, then we know that we are dealing with a process that

is relatively steady over longer periods of time


Autocorrelation and correlogramThe correlogram refers to the plot “lag x correlation” of a

given time seriesFor example: consider a data set with the number of daily calls

placed in a call center for a time period slightly longer than twoyears – as presented earlier

Time series (Auto) correlogram – axis x 0<=lag<=500


Autocorrelation and correlogramThe correlogram refers to the plot lag x correlation of a given

time seriesFor example: consider a data set with the number of daily calls

placed in a call center for a time period slightly longer than twoyears – as presented earlier

Time series (Auto) correlogram

From the correlogram we can observe that:The series has a long “memory” (long cycles): it takes the

correlation almost 100 days to fall to zero, indicating that thefrequency of calls changes more or less once per quarter but notmore frequently

There is a pronounced secondary peak at a lag of 365 days: thecall center data is highly seasonal and repeats itself on a yearlybasis, when the series repeats its response behavior (highcorrelation)

There is a small but regular sawtooth structure; if we lookclosely, we will find that the first peak of the sawtooth is at a lag of7 days and that all repeating ones occur at multiples of 7 - this isthe signature of the high-frequency component that we see in theplot of the series; that is, the traffic to the call center exhibits asecondary seasonal component with 7-day periodicity, thetraffic depends on the day of the week


Example


CO2 measurements above Mauna Loa in HawaiiConsider again the data set with the concentration (ppm) of

carbon dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959


CO2 measurements above Mauna Loa in Hawaii Which can be better numerically analyzed if the horizontal axis be expressed as

incremental monthly indexes, and if the graph goes through the origin (verticaltranslation of -315)

This can be achieved in Gnuplot with:

plot "data" using 0:($2-315) with lines


CO2 measurements above Mauna Loa in Hawaii The series has a trend that seems to be a power-law of the form b(x/a)k with k

bigger than 1 as the curve is convex downward, a first guess is k=2 and b=35 anda=350 (upper rightmost part of the series)


plot “data” using 0:($2-315) with lines, 35*(x/350)**2


CO2 measurements above Mauna Loa in Hawaii By trial and error, a better guess for k is 1.35 This can be achieved in Gnuplot with:

plot "data" using 0:($2-315) with lines, 35*(x/350)**1.35


CO2 measurements above Mauna Loa in Hawaii To verify the accuracy of the model function, we can plot the residual by subtracting

the trend from the data


plot "data" using 0:($2-315 - 35*($0/350)**1.35) with lines


CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations

that can be captured by sines, as the series starts at (0,0); also the series is monthly-based with a cycle of one year, so a guess is that the data is the same every 12points; the amplitude is around 3, as we can observe in the former plots

We can compare the residual and our seasonality mode in Gnuplot with:

plot "data" using 0:($2-f($0)) with lines, 3*sin(2*pi*x/12) with lines


CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations that

can be captured by sines as the series starts at (0,0); also the series is monthly-basedwith a cycle of one year, so a guess is that the data is the same every 12 points; theamplitude is around 3, as we can observe in the former plots

We can compare the residual and our seasonality mode in Gnuplot with:

plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l

At this point the model is given by the power-law function plus the sinefunction

f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12)plot "data" using 0:2 with lines, f(x)

which is pretty close the actual phenomenon


CO2 measurements above Mauna Loa in HawaiiWith the final model, it becomes possible to predict future values

for the series


References Philipp K. Janert, Data Analysis with Open Source Tools,

O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/

Data analysis03 timeasa-variable

Education

Transcript of Data analysis03 timeasa-variable