Data analysis03 timeasa-variable
-
Upload
jose-f-rodrigues-jr -
Category
Education
-
view
133 -
download
0
description
Transcript of Data analysis03 timeasa-variable
http://publicationslist.org/junio
Data AnalysisTime as a variable: time-series analysis
Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP
http://publicationslist.org/junio
What is it about?Time series are an incredibly common kind of dataStock marketCPU utilizationMeteorology - daily rainfall, wind speed, and temperatureSociology - crime figures, employment figuresSoftware engineering – number of errorsNetworks – number of nodes, and edges
http://publicationslist.org/junio
First examplesConsider a data set with the concentration (ppm) of carbon
dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959The plot shows two
common features intime series:Trend: a steady, long-
term linear growthSeasonality: a regular
periodic pattern – on 12month cycle
http://publicationslist.org/junio
First examplesConsider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strongnonlinear trend
The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses
http://publicationslist.org/junio
First examplesConsider the data set with the price of long-distance phone
calls in the US over the last century
The plot shows a strongnonlinear trend
The single-log plot (inset)shows that the data follow apower-law distribution(logarithmic linear) – a usualbehavior of growth/decayprocesses
This example asks for closer inspection:• Has the long-distance call service changed along
time?• Were the prices adjusted for inflation?• What are the uncharacteristically low prices for a
couple of years in the late 1970s? Did the breakupof the AT&T system have anything to do with it?
http://publicationslist.org/junio
First examplesConsider the data set with the development of the Japanese
stock market as represented by the Nikkei Stock Index overthe last 40 years shown with a 31-point Gaussian smoothingfilter
The plot shows a change inthe behavior after 1990(the big Japanese bubble),after which a long-termincreasing trend turned intoan oscillatory decreasingtrend
The seasonality alsochanged significantly afterthen
http://publicationslist.org/junio
First examplesConsider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
This example is way morechallenging with its complexstructure
Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity
In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior
http://publicationslist.org/junio
First examplesConsider a data set with the number of daily calls placed in a
call center for a time period slightly longer than two years
This example is way morechallenging with is complexstructure
Actually, it is not clear whetherthe high-frequency variation inthe plot is noise or has someform of regularity
In an initial analysis, not manyconclusions can be drawn fromthe plot – apparently, notrend, no seasonality, andno change in behavior
As time-series commonly counts on long-term data, it is important to certify that the data acquisition was homogeneous along the period, otherwise the series
may change its behavior in ways that becomes hard to make sense
http://publicationslist.org/junio
Main componentsAs we have seen, the main components observed are:Trend: linear or non-linear, with a characteristic magnitudeSeasonality: additive, for example, every 12 months the sales
increase by 3 million; or multiplicative, for example, every 12months the sales increase by 1.4 times what was observed in the lastcycle
Noise: some form of random variation, quite commonOther: change in behavior, special outliers, missing data, and anything
remarkable
http://publicationslist.org/junio
AssumptionsStandard methods of time-series analysis make a number of
assumptions, all of them are violated in real-worldscenarios:Data points have been taken at equally spaced time steps, with
no missing data points: demands interpolation in case of missingpoints, or re-sampling in case of insufficient sampling
The time series is sufficiently long (at least 50 points): requiressmoothing methods to define a continuous curve, even wherethere are no points
The series is stationary, it has no trend, no seasonality, and thecharacter (amplitude and frequency) of any noise does not changewith time: may require breaking the series into multiplesegments to be analyzed separately
http://publicationslist.org/junio
SmoothingJust as with two-variable data, it is useful to fit a curve
according to the available data (actually, a time series is aspecial case of two-variable data)
Smoothing helps in:Reducing noise Interpolating missing/insufficient values
http://publicationslist.org/junio
Running averagesThe method know as running (moving, rolling, or floating)
average is straightforward: for any odd number of consecutivepoints, replace the centermost value with the average ofthe other points
The smoothed point si is given by:
where xi are the data pointsFor example, for a 5-point (k=2) moving average, consider
point x10 = 4, and points x8 = 4, x9 = 7, x11 = 2, x12 = 9, sos10 = 1/5*(4+7+4+2+9)=1/5*26 = 5.2
And so forth for any point
http://publicationslist.org/junio
Weighted running averagesRunning averages do not work well in the presence of
outliers, what may distort the curveThe weighted running averages techniques lessens this
problem by using weights to associate more importanceto points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussianis the first choice
http://publicationslist.org/junio
Weighted running averagesRunning averages do not work well in the presence of
outliers, what may distort the curveThe weighted running averages techniques lessens this
problem by using weights to associate more importanceto points at the center of the moving window
The weights wj can be defined manually, for instance, for a 5-point window is could be (1/9, 2/9, 1/3, 2/9, 1/9)
Or they can be defined by a function, in this case the Gaussianis the first choice
In either case, the choice of weights must be peaked at the center, drop toward the edges, and add up to 1
http://publicationslist.org/junio
Running averagesFor example: considering synthetic data (filled line) and an 11-
point moving averageThe plot shows that the simple
technique could reasonablyrepresent the data, butwhenever an outlier (spike)appears, the curve is abruptlydistorted until the outlierleaves the window
The weighted version of thetechnique presented betterresults, instead of abruptdistortions, it showssmoothed peaks that pointout the original outliers
http://publicationslist.org/junio
Single exponential smoothingRunning averages are intrinsically local and may not capture the global
behavior of the seriesAn improved method is exponential smoothing, which, in its single
form, departs from a simple recursive definition
푠 = 훼푥 + 1 − 훼 푠
with 0 ≤ 훼 ≤ 1, and 푠 = 푥 , or 푠 = ∑ 푥 , for n initial values
That is, the next i-th smoothed point is a mix between the actual xi pointand the previous smoothed si-1 point, where 훼 can be defined with trialand error
By mathematical induction, this recursion leads to the exponentialexpression: 푠 = 훼∑ (1 − 훼) 푥
which can provide any smooth si value as a function of all the previous ivalues x
http://publicationslist.org/junio
Single exponential smoothingThe single exponential smoothing provides good smoothing curves and,
for some cases, forecasting
It is limited, though, for series that present trend or seasonality,situations when the technique cannot be accurately used forprediction
There two exponential smoothing techniques that are more advanced Double exponential smoothing for series with trend but without
seasonality Triple exponential smoothing for series with trend and seasonality, this
technique is called Holt–Winters method
The Holt–Winters method is a powerful technique able to reproducethe full behavior of additive or multiplicative time series
http://publicationslist.org/junio
Double and triple exponential smoothingDouble exponential smoothing
Additive triple exponential smoothing
Multiplicative triple exponential smoothing
Trend factor
Trend factor
Seasonality factor
Trend factor
Seasonality factor
Forecasting
Forecasting
http://publicationslist.org/junio
Double and triple exponential smoothingDouble exponential smoothing
Additive triple exponential smoothing
Multiplicative triple exponential smoothing
Trend factor
Trend factor
Seasonality factor
Trend factor
Seasonality factor
Forecasting
Forecasting
Exponential smoothing depends on mixing parameters,which are required by software packages:
• Single exponential smoothing:훼• Double exponential smoothing:훼훽• Triple exponential smoothing:훼훽훾
More on time-series analysis:http://www.statsoft.com/textbook/time-series-analysis/
http://publicationslist.org/junio
Triple exponential smoothingFor example, the additive Holt–Winters plot for a dataset with the
number of US monthly international flight passengersThe years 1949 through 1957 were used to “train” the algorithm,
and the years 1958 through 1960 were forecastedNote how well the forecast agrees with the actual data
http://publicationslist.org/junio
Autocorrelation and correlogram
As mentioned, time-series are mainly characterized by trendsand seasonality
Trend is analyzed by means of smoothing, function fitting(modeling), and plotting
Seasonality can benefit from techniques correlation andcorrelogram
http://publicationslist.org/junio
Autocorrelation and correlogram
The correlation between two time series is obtained asfollows:For each point xi in the two series, multiply their response values (yi),
considering their deviation from the meanSum up all the productsNormalize
The correlation for two identical series is 1, and it is -1 forseries that are exactly inverted one in relation to theother
http://publicationslist.org/junio
Autocorrelation and correlogram
Seasonality:Formally defined as the correlation between each i-th element and
the (i+k)-th element – k is usually called the lagMeasured by the Autocorrelation Function - ACF, i.e., the correlation
between the two terms xi and xi+k
If the measurement error is not too large, seasonality can bevisually identified as a pattern that repeats every k moments intime
http://publicationslist.org/junio
Autocorrelation and correlogram
If seasonality is present, then the behavior of the seriesshould repeat at every k time units, where k is named lag
The problem, hence, is: how to identify analytically what isthe lag of the series?The answer is: compare the time series with its own self, but
shifted by increasing values (lags) of k; for each value calculatethe correlation
Hence, the autocorrelation of a given series at lag k is givenby
Normalization according to lag 0,that is, to the correlation of theseries with itself
http://publicationslist.org/junio
Autocorrelation and correlogram
Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax
http://publicationslist.org/junio
Autocorrelation and correlogram
Autocorrelation basic algorithm:1.Let k = 02.Start with two copies of the series (original and copy)3.Subtract the mean from all values in both series4.Multiply the values at corresponding time steps with each other5.Sum up the results for all time steps6.Normalize with the variance of the original series this is thecorrelation for lag k, that is, c(k)7.Shift the copy by 1 time step8.Let k k+19.Continue in step 2 while k < kmax
According to this algorithm: Initially (lag 0), the two signals are perfectly aligned and the
correlation is 1Then, as we shift the signals they slowly move out of phase and
the correlation dropsHow quickly it drops tells us how much “memory” there is
in the data: If quickly, we know that, after a few steps, the signal has lost all
memory of its recent past If slowly, then we know that we are dealing with a process that
is relatively steady over longer periods of time
http://publicationslist.org/junio
Autocorrelation and correlogramThe correlogram refers to the plot “lag x correlation” of a
given time seriesFor example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than twoyears – as presented earlier
Time series (Auto) correlogram – axis x 0<=lag<=500
http://publicationslist.org/junio
Autocorrelation and correlogramThe correlogram refers to the plot lag x correlation of a given
time seriesFor example: consider a data set with the number of daily calls
placed in a call center for a time period slightly longer than twoyears – as presented earlier
Time series (Auto) correlogram
From the correlogram we can observe that:The series has a long “memory” (long cycles): it takes the
correlation almost 100 days to fall to zero, indicating that thefrequency of calls changes more or less once per quarter but notmore frequently
There is a pronounced secondary peak at a lag of 365 days: thecall center data is highly seasonal and repeats itself on a yearlybasis, when the series repeats its response behavior (highcorrelation)
There is a small but regular sawtooth structure; if we lookclosely, we will find that the first peak of the sawtooth is at a lag of7 days and that all repeating ones occur at multiples of 7 - this isthe signature of the high-frequency component that we see in theplot of the series; that is, the traffic to the call center exhibits asecondary seasonal component with 7-day periodicity, thetraffic depends on the day of the week
http://publicationslist.org/junio
Example
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in HawaiiConsider again the data set with the concentration (ppm) of
carbon dioxide (CO2) in the atmosphere, as measured by theobservatory on Mauna Loa on Hawaii, recorded at monthlyintervals since 1959
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii Which can be better numerically analyzed if the horizontal axis be expressed as
incremental monthly indexes, and if the graph goes through the origin (verticaltranslation of -315)
This can be achieved in Gnuplot with:
plot "data" using 0:($2-315) with lines
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii The series has a trend that seems to be a power-law of the form b(x/a)k with k
bigger than 1 as the curve is convex downward, a first guess is k=2 and b=35 anda=350 (upper rightmost part of the series)
This can be achieved in Gnuplot with:
plot “data” using 0:($2-315) with lines, 35*(x/350)**2
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii By trial and error, a better guess for k is 1.35 This can be achieved in Gnuplot with:
plot "data" using 0:($2-315) with lines, 35*(x/350)**1.35
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii To verify the accuracy of the model function, we can plot the residual by subtracting
the trend from the data
This can be achieved in Gnuplot with:
plot "data" using 0:($2-315 - 35*($0/350)**1.35) with lines
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations
that can be captured by sines, as the series starts at (0,0); also the series is monthly-based with a cycle of one year, so a guess is that the data is the same every 12points; the amplitude is around 3, as we can observe in the former plots
We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" using 0:($2-f($0)) with lines, 3*sin(2*pi*x/12) with lines
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in Hawaii The model seems fine but for the seasonality, which consists of regular oscillations that
can be captured by sines as the series starts at (0,0); also the series is monthly-basedwith a cycle of one year, so a guess is that the data is the same every 12 points; theamplitude is around 3, as we can observe in the former plots
We can compare the residual and our seasonality mode in Gnuplot with:
plot "data" u 0:($2-f($0)) w l, 3*sin(2*pi*x/12) w l
At this point the model is given by the power-law function plus the sinefunction
f(x) = 315 + 35*(x/350)**1.35 + 3*sin(2*pi*x/12)plot "data" using 0:2 with lines, f(x)
which is pretty close the actual phenomenon
http://publicationslist.org/junio
CO2 measurements above Mauna Loa in HawaiiWith the final model, it becomes possible to predict future values
for the series
http://publicationslist.org/junio
References Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/