NEW Time Series Paper

26
1 Annual IL Tornado Count Katie Ruben April 22, 2016 As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados, meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the “temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a tornadic thunderstorm to occur[1].” Tornado’s are an important aspect of living in certain areas of the country. They can cause death, injury, property damage, and also high anxiety in many people who choose to live in areas prone to Tornados. In particular, my project will deal with looking at the number of annual tornados that have occurred in Illinois since 1950. Meteorologists are interested in improving their understanding of the causes of tornados as well as when they are to occur. The data used during this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406 tornados over those 65 years. In particular, I am interested in forecasting the number of tornados that will occur in subsequent years based on the time series data I have found. In order to analyze the Illinois Tornado Count times series data I will first look to see if the data is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending on this outcome, it will help me determine which set of time series models I will want to continue with. Depending on my original data set, I will want to perform transformations that reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA subset. Once several potential models have been chosen, I will fit these models by estimating parameters using the maximum likelihood method. In addition, perform a residual analysis on my fitted models and make sure, to the best of my ability, that the models are from the normal distribution, are independent, and have constant variance. In order to achieve this, I will look at the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function for independence, and finally the BP Test for constant variance. I will continue building an appropriate model by looking for outliers and adjusting my models based on residual analysis. The final step is to perform a forecasting of my data set into the future. I will compare my forecast with my actual data set to see how accurate my model has become. [1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from http://www.spc.noaa.gov/faq/tornado/ [2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)

Transcript of NEW Time Series Paper

Page 1: NEW Time Series Paper

1

Annual IL Tornado Count Katie Ruben

April 22, 2016 As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados, meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the “temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a tornadic thunderstorm to occur[1].” Tornado’s are an important aspect of living in certain areas of the country. They can cause death, injury, property damage, and also high anxiety in many people who choose to live in areas prone to Tornados. In particular, my project will deal with looking at the number of annual tornados that have occurred in Illinois since 1950. Meteorologists are interested in improving their understanding of the causes of tornados as well as when they are to occur. The data used during this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406 tornados over those 65 years. In particular, I am interested in forecasting the number of tornados that will occur in subsequent years based on the time series data I have found. In order to analyze the Illinois Tornado Count times series data I will first look to see if the data is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending on this outcome, it will help me determine which set of time series models I will want to continue with. Depending on my original data set, I will want to perform transformations that reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA subset. Once several potential models have been chosen, I will fit these models by estimating parameters using the maximum likelihood method. In addition, perform a residual analysis on my fitted models and make sure, to the best of my ability, that the models are from the normal distribution, are independent, and have constant variance. In order to achieve this, I will look at the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function for independence, and finally the BP Test for constant variance. I will continue building an appropriate model by looking for outliers and adjusting my models based on residual analysis. The final step is to perform a forecasting of my data set into the future. I will compare my forecast with my actual data set to see how accurate my model has become. [1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from http://www.spc.noaa.gov/faq/tornado/ [2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)

Page 2: NEW Time Series Paper

2

1   Background

Tornado’s are an important aspect of living in certain areas of the country. They can cause death, injury, property damage, and also high anxiety in many people who choose to live in areas prone to Tornados. In particular, this project will deal with looking at the number of annual tornados that have occurred in Illinois since 1950. Meteorologists are interested in improving their understanding of the causes of tornados as well as when they are to occur. The data used during this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data being investigated contains a tornado count from 1950 to 2015 for every state in the United States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406 tornados over those 66 years. In particular, I am interested in forecasting the number of tornados that will occur in subsequent years based on the time series data found. As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados, meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the “temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a tornadic thunderstorm to occur [1].” Prior to starting this time series data analysis, we will split our 66 year observations into two sets; training data and validation data. The training data set will contain 60 years (1950-2009), while the validation data set will contain 6 years (2010-2015). The validation set contains 9% of the total observed years. Keep in mind that over these 66 years there were 2,406 tornado sightings in Illinois. In this paper, we begin by using our time series data set to perform preliminary transformations on the training set to ensure stationarity. If the data set shows non-stationary behavior, I will go through several different transformations in section 2 of this paper. Section 2 will also contain the model identification process for several time series models as well as estimation and residual analysis. Since the training data contains 56 observations, the ideal maximum lag recommended by the autocorrelation of residuals is 𝑘 = ln(56) ≈ 4. This will become important as we work through this data set. In section 3, we will focus on model validation and choosing which of our models is most accurate as well forecasting. In section 4 we will focus on a discussion of our results from this project on IL tornado counts between 1950-2015.

Page 3: NEW Time Series Paper

3

2   Training Data Transformations 2.1.1 Training Data

To begin our model building process, we start by examining the training data set. The time series plot of our Illinois Annual Tornado Count is shown in figure 1. This plot suggests that there is extremely large variance as well as an explosive behavior being demonstrated as time passes. This suggests that our time series data set is non-stationary. However, we need to conduct some formal testing.

Figure 1: Training Data Time Series & Scatter Plot

The Dickey Fuller Test is used in order to determine if the data set is stationary or non-stationary. The null hypotheses states that 𝛼 = 1 then there is a unit root and the time series in non-stationary. The alternative hypotheses states that 𝛼 < 1 then the time series is stationary. If the time series is non-stationary, then it is suggested to take the difference. Throughout this paper, we will be concerned with a significance level of .05.

Time Series Plot of Annual Tornados in IL

Time

Ann

ual T

orna

dos

in IL

1950 1960 1970 1980 1990 2000 2010

020

4060

80100

120

0 20 40 60 80 100 120

020

4060

80100

120

Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados

Previous Year Tornado Count

Torn

ado

Cou

nt th

is Y

ear

Page 4: NEW Time Series Paper

4

2.1.2   Transformed Training Data

2.1.2.1 Log(Training Data) Before, we check for stationarity, we need to try to eliminate the large variance seen in our data. To do so, we take the natural logarithmic transformation of the training time series data. The plot is shown in Figure 2. As seen if Figure 2, there still exists large variance, as well as an explosive behavior. It is suggested now to look at a Box Cox Transformation of the logarithm transformation to see if we can come up with a better representation for our data set. However, when using the Box Cox transformation in R, we get that 𝜆 = 0. This suggests that no transformation is needed.

Figure 2: Natural Logarithm Transformation Training Data Plot

2.1.2.3  Difference on Logarithm of Preliminary Data

The final transformation used to attempt to remove the explosive behavior is to difference the training data. As seen in Figure 3, the time series plot with this transformation looks much better. The explosive behavior has dissipated. Looking at the scatter plot of this transformed data in Figure 4, we see that 𝑌1  𝑣𝑠. 𝑌167  shows a negative correlation, 𝑌1  𝑣𝑠. 𝑌168 shows either a slight negative correlation or no correlation and 𝑌1  𝑣𝑠. 𝑌169 shows no correlation. Investigation of these plots suggests that we may have a time series model of order 1. We will conduct formal model selections next.

Time Series Plot of Annual Tornados in IL

Time

log(

Ann

ual T

orna

dos

in IL

)

1950 1960 1970 1980 1990 2000 2010

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Page 5: NEW Time Series Paper

5

Figure 3: Difference Training Data Time Series Plots

Figure 4: Difference Training Data Scatter Plots There no longer seems to be an apparent explosive behavior in the times series plot when taking the difference log transform. This suggests stationarity in our transformed training data. However, a formal Dickey Fuller Test must be applied. In doing so, we get a 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = .01 <.05 = 𝛼 for lag = 1,2 for non-constant, constant, and linear trends. Therefore, since the p-value is less than the significance level we reject the null hypothesis and our model is suggested to be stationary. A sample R output is shown in Appendix A.

Time Series Plot of Annual Tornados in IL Diff(log(t.data))

Time

Ann

ual T

orna

dos

in IL

1950 1960 1970 1980 1990 2000 2010

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Scatterplot of # IL Tornados

Previous Year Tornado Count

Torn

ado

Cou

nt th

is Y

ear

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Scatterplot of # of IL Torndaos

2 Years ago Tornado Count

Torn

ado

Cou

nt th

is Y

ear

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Scatterplot of # of IL Torndaos

3 Years ago Tornado Count

Torn

ado

Cou

nt th

is Y

ear

Page 6: NEW Time Series Paper

6

2.1.3 Preliminary Model Building Now that we have figured out how to eliminate the large explosive behavior in our data set, we can begin to look at finding preliminary models to build an appropriate times series model from. In order to do this, we will look at the ACF plot for potential Moving Average models, PACF plot for potential Autoregressive models, EACF chart for potential Mixed Process models, and ARMA subset selections. Figure 5 shows the ACF, PACF and EACF plots for our difference model chosen above.

Figure 5: ACF, PACF, & EACF

The ACF plot suggests that with our data set, a Moving Average of order 1 may be a potential model. The PACF plot suggests that a Autoregressive of order 1 may be a potential model. One aspect to keep in mind is that the PACF should show lags that exponentially decay theoretically. The PACF plot for this data set does not follow this exponential decaying pattern. Therefore, an Autoregressive model may not be the best suited model. The EACF plot suggests again an AR(1) We can also determine the best potential model by looking at the ARMA subset based on BIC or AIC values. This output is displayed in Figure 6. The maximum number of lags allowed is 𝑘 = ln(61) ≈ 4 based on the autocorrelation of residuals recommendation from literature on the topic. This output suggests that the best model for my data would be MA(1) with an intercept term. This is the same suggestion made by the ACF plot. The second best suggestion would be an ARMA(1,1) process. Throughout the rest of this project, I will work with the following processes; ARI(1,1), IMA(1,1), and ARIMA(1,1,1).

5 10 15

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

Series diff(log(t.data))

Lag

ACF

5 10 15

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

Lag

Par

tial A

CF

Series diff(log(t.data))

Page 7: NEW Time Series Paper

7

Figure 6: ARMA Subset BIC 2.1.3.1 Estimations Using maximum likelihood estimates, we were able to come up with suitable models for ARI(1,1), IMA(1,1), and ARIMA(1,1,1). When working with time series models, it is best to always choose a simple model to explain your data. Figure 7 displays the following estimates for each of these model. Note that the intercept terms were not significant enough at a level of .05, so they don’t need to be included in the models. We determined the significance by looking at the output in R for the estimations and found that when looking at the ratio of the intercept coefficient to the standard error of that coefficient, the value was close to zero when compared to the critical value of 1.96. This indicated that the intercept of the coefficient was not significantly different from zero since the ratio of estimation per standard error was less than 1.96. Note that my models were estimated using the log data.

ARI(1,1): 𝑌1 = −0.4447 𝑌167 + 𝑒1 IMA(1,1):  𝑌1 = 𝑒1 − (−.5491) 𝑒167

ARIMA(1,1,1): 𝑌1 = .3370 𝑌167 + 𝑒1 − (−.8658)(𝑒167)

Figure 7: Model Estimates

2.1.3.1.1 Outliers Before proceeding further, we must determine if there exist outliers for each of our potential models. In R, we ran the additive outlier and innovational outlier commands. Both commands in R, (AO and IO detect), for each model confirmed that there did not exist an outlier in any of the three models. Therefore, we can continue with our residual analysis.

BIC

(Intercept)

test-lag1

test-lag2

test-lag3

test-lag4

error-lag1

error-lag2

error-lag3

error-lag4

17

14

10

6.9

3.4

0.45

-2.1

-4.7

Page 8: NEW Time Series Paper

8

2.1.3.2 Residual Analysis The next step is to look at the residuals of our three models. From the residuals, we can talk about normality, constant error variance, and independence. Make note that from the original training data, there was large variance. Throughout transformations we were able to fix the stationarity, but so far we assume that the residuals will show there still exists large variance. Thus, we are also assuming there still exists non-normality and non-independent. However, we will conduct a formal test on our three models for each of these characteristics. As seen in Figure 8, the QQ plots do not suggest strong normality. In all three models there appears to be heavy tails and the QQ normal line does not align along our data points as well as we would wish. In our opinion, the AR(1) model has the best looking QQ plot to display normality. To verify this conclusion, we will conduct a KS test and Shapiro Wilks test that can be found in Appendix A. With a significance level of .05, we must fail to reject the null hypothesis in every test for normality using the KS and Shaprio Wilks tests for each of our three models. In each model, the p-value is greater than the significance level. This means that we are able to assume that our data is from the normal distribution.

Figure 8: QQ Plots of Models Next we will look at constant error variance. This can be seen in Figure 9. As seen in the three plots, there appears to be large variances across the horizontal line y=0. However, the plot for each model does appear to resemble white noise. Thus, we can assume there is possible constant variance for each model. I was unable to perform a BP or BF test on this data because I did have the necessary x-variable to regress my residuals on and hence R would not produce these tests for me.

-2 -1 0 1 2

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

ARI(1,1) QQ Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-1.0

-0.5

0.0

0.5

1.0

1.5

IMA(1,1) QQ Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-2 -1 0 1 2

-1.0

-0.5

0.0

0.5

1.0

ARIMA(1,1,1) QQ Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 9: NEW Time Series Paper

9

Figure 9: Error Variance Analysis of Models

Finally, we will look to see if our three models are independent. To do this, we will use a Runs Test for each model. Based on the runs test seen in Appendix A for each model, we conclude that our transformed data is assumed to be independent because the p-value is greater than the significance level of .05. Therefore, we fail to reject the null hypothesis where the null hypothesis states that the data is independent. Another method to test if the data is independently distributed is to look at the Ljung-Box test. The null hypothesis of this test is that the data is independently distributed. This means that the Ljung-Box test is testing whether the autocorrelations of the time series are different from zero or not. Based on the results found in Appendix A, we can conclude that we fail to reject the null hypothesis. The p-value is greater than the significance level for each of our three models. Finally, we can confirm independence once more, we can look at the ACF plot of the residuals for each model as seen in Figure 10. Since the lags are all within the blue cut off lines, we assume that the residuals resemble white noise and thus our residuals are independent.

Figure 10: ACF Residuals

ARI(1,1)

Time

Sta

ndar

dize

d re

sidu

als

1950 1960 1970 1980 1990 2000 2010

-2-1

01

2

IMA(1,1)

Time

Sta

ndar

dize

d re

sidu

als

1950 1960 1970 1980 1990 2000 2010-1

01

2

ARIMA(1,1,1)

Time

Sta

ndar

dize

d re

sidu

als

1950 1960 1970 1980 1990 2000 2010

-10

12

5 10 15

-0.2

-0.1

0.0

0.1

0.2

Sample ACF of Residuals from ARI(1,1) Model

Lag

ACF

5 10 15

-0.2

-0.1

0.0

0.1

0.2

Sample ACF of Residuals from IMA(1,1) Model

Lag

ACF

5 10 15

-0.2

-0.1

0.0

0.1

0.2

Sample ACF of Residuals from ARIMA(1,1,1) Model

Lag

ACF

Page 10: NEW Time Series Paper

10

Therefore, we have been able to transform our training data into a data set that shows normality, constant variance and independence. This is normally not an easy task. However, since my data set only consisted of 66 years it was doable for such a small data set. 3 Model Validation 3.1.1 Confirmation of Models (Over fitting & Parameter Redundancy) Now we will look at confirming that the three suggested models are good models for our data set by extending the parameters for each. If the estimate of the additional parameter is not significantly different from zero and the estimates for the original model do not change significantly from their original estimates, then we can confirm that our model is a good fit. We will be concerned with a significance level of .05. For which if the ratio of the estimated coefficient per its standard deviation is less than the critical value of 1.96, then we will assume that the coefficient is not significantly different from zero.

Model ARI(1,1) ARI(2,1)

𝝓𝟐 : s.e. −. 𝟏𝟏𝟐𝟒. 𝟏𝟐𝟗𝟖

= −. 𝟖𝟔𝟓𝟗 < 𝟏. 𝟗𝟔   ∴ 𝐧𝐨𝐧-­‐‑𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭

𝝓𝟏 : s.e. -0.4447 0.1155 -0.4970 0.1298 𝝓𝟐 : s.e. -0.1124 0.1298 𝝓𝟑 : s.e. 𝝓𝟒 : s.e. 𝑨𝑰𝑪 125.67 126.93

Model IMA(1,1) IMA(1,2)

𝜽𝟐 : s.e.

−. 𝟎𝟒𝟕𝟓. 𝟏𝟕𝟐𝟖

= −. 𝟐𝟕𝟒𝟖 < 𝟏. 𝟗𝟔   non-𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭

𝜽𝟏 : s.e. -0.5491 0.1358 -0.5527 0.1352 𝜽𝟐 : s.e. -0.0475 0.1728 𝜽𝟑 : s.e. 𝜽𝟒 : s.e. 𝑨𝑰𝑪 124.04 125.97

Model ARIMA(1,1,1) ARIMA(1,1,2) ARIMA(2,1,1) 𝝓𝟏 : s.e. 0.3370 0.1698 -.9009 0.2298 0.3189 0.1461 𝜽𝟏 : s.e. -0.8658 0.1027 .3474 0.2712 -0.9114 .0714 𝜽𝟐 : s.e. -.4464 0.2057 𝝓𝟐 : s.e. .2174 .1388 𝑨𝑰𝑪 124.97 127.71 124.52

𝜽𝟐 : s.e. −. 𝟒𝟒𝟔𝟒. 𝟐𝟎𝟓𝟕

= −𝟐. 𝟏𝟕𝟎𝟏 > 𝟏. 𝟗𝟔   𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭

𝝓𝟐 : s.e. . 𝟐𝟏𝟕𝟒. 𝟏𝟑𝟖𝟖

= 𝟏. 𝟓𝟔𝟔 < 𝟏. 𝟗𝟔   𝐧𝐨𝐧 − 𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭

If we continue to increase the order for the ARIMA model, the AIC value will continue to get larger. Generally, a small AIC value, indicates the best model. It appears that ARIMA(1,1,1) will be the best model for a mixed process based off of AIC values.

Page 11: NEW Time Series Paper

11

In addition, when looking at our three suggested models with the over fitting procedure, it appears that ARI(1,1) and IMA(1,1) are confirmed to be good models for this time series. With ARIMA(1,1,1), we were unable to confirm this process as a good model due to ARIMA(1,1,2) having a significant coefficient for 𝜃8. In addition, the ARIMA estimates for the over fitting were not close to the original estimates for this model. Therefore, once again, we can cannot confirm this model as a good fit for our data. Therefore, as we continue with forecasting, I will be looking at the models ARI(1,1) and IMA(1,1) for my log data set. 3.1.2 Forecasting The final procedure is to identify which of the models remaining would be the best predictor of annual tornados. We will be forecasting values for my testing data set. Recall, that we initially pulled out 6 years from the end of our data. Now we can test if our model is accurate. As seen in figure 11, we were able to make our predictions using R, which are displayed with the red dots. These predictions were made by using the one step forward forecasting procedure discussed in class. As you can tell, it is not perfect. This is partly due to the fact that this data set only contains 66 data points and these 66 points may need to be modeled using a different technique. However, we are using the techniques demonstrated in class.

Figure 11: Log data Forecast

In order to determine which model has better prediction capabilities, we will look at MSE, MAP, and PMAD. The smaller the values, the better the prediction abilities. Therefore, by looking at the table below we can determine that IMA(1,1) would have the best predicting capabilities based on this recommendation.

Forecasting with ARI(1,1)

Time

log(t.data)

0 10 20 30 40 50 60

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Forecasting with IMA(1,1)

Time

log(t.data)

0 10 20 30 40 50 60

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Page 12: NEW Time Series Paper

12

ARI(1,1) IMA(1,1) MSE 0.0434644 MSE 0.03533288 MAP 0.0450561 MAP 0.03748878

PMAD 0.04383421 PMAD 0.03675122 Smaller Values

As you can see, the values are not as small as we would like. However, with the data at hand, this seems to be pretty good. All coding used to perform my predictions are seen Appendix A. Now that we have chosen IMA(1,1) as our best model, we will transform the data back to its original meaning. The data in Figure 12 resembles the predictions made when working on the log data. The data in Figure 13, resembles the predictions made when transforming the data back to its original meaning. Both figures also contain the prediction intervals for the corresponding data sets.

Figure 12: Log data Prediction Values & Interval

Figure 13: Original data Prediction Values & Interval

Figure 13 shows that three of our 6 values were predicted fairly close to the actually value. However, there is a lot of variability still in our model when predicting. When looking at the prediction intervals for our original data set after transforming back, we see that they are very large. This means that the predicting capability of IMA(1,1) is not extremely good. The complete list of the 95% prediction intervals for the entire original data set is seen in Appendix A. Figure 14 displays the final graph our time series data that compares the original data to the 6 predicted values.

Page 13: NEW Time Series Paper

13

Figure 14: Time Series Plot with Predictions (Original Data)

IMA(1,1):  𝑌1 = 𝑒1 − (−.5491) 𝑒167

4 Discussion The goal of this analysis was to use knowledge of Time Series models to predict the future count of tornados in IL every year. We started with 2,406 tornado sightings in Illinois from 1950 to 2015. However, we broke our data into yearly data. Therefore, we had 66 data points. This was a fairly small data set to perform a Time Series analysis on. Keep this in mind as we continue to discuss our results. We began our analysis with a training data set. This training data set contained the first 60 years of our analysis. We set aside the last 6 years for our testing data set which became extremely important when we performed our forecasting’s. Our first goal was to ensure that our data was stationary. In order to do this, we had to perform the log transformation as well as take the difference of our data. The log transformation helped to reduce the variability seen in our original data set while the differencing allowed us to remove the explosive behavior seen in the original time series plot. We were able to ensure stationarity of our data set by performing the Dickey Fuller Test. Once we had a stationary data set, we were able to begin the estimation process. We looked at the ACF, PACF, EACF, and best subset selection chart in order to determine which models

0 10 20 30 40 50 60

020

4060

80100

120

Time Series Plot of Annual Tornados in IL

1950 - 2015

Ann

ual T

orna

dos

in IL

Page 14: NEW Time Series Paper

14

would be best. We came to the conclusion that an ARI(1,1), IMA(1,1) and ARIMA(1,1,1) would all be suitable models at this point. Next, we performed a residual analysis of all three models. As discussed in the paper, all three models were shown to have normality, constant error variance, and were independent. This is primarily due to the small sample size of our data set. When a data set is small or extremely large, these three characteristics are a lot easier to achieve. However, when we performed an over fitting on all three of these models, the ARIMA(1,1,1) was proved to be non-sufficient for this data set. Therefore, as we continued forward with our project, we focused only on the ARI(1,1) and IMA(1,1). Finally, we forecasted values for our testing data set using both ARI(1,1) and IMA(1,1). In doing so, we calculated the MSE, MAP, and PMAD for each of the models. We found that the IMA(1,1) model had the smallest numerical value in all three of these tests. This meant that for our data set, IMA(1,1) was the best model. However, make note that the values for these criterion are not as small as we would have wished. The smaller the value, the better the predictions. As seen in the final time series plot shown in figure 14, our predictions are far from perfect. In all cases, it seems that our predictions are being overestimated from the actual values. In some cases, for example in 2012, this overestimation is drastic. In order to further improve our models, we may need to try other time series models than those that were discussed in class. In addition, to better predict tornados in Illinois we may have wanted to break down our data set into quarters of the year. Clearly tornados are more frequent in the spring and summer months. In using a different division of time for this data set, we would have had a larger number of data points for which we could create different time series models from. When looking at the data set we did use for this project, the large variance over time in the count of tornados could be due to the number of people who are actually out in Illinois counting them. In the early years of this data set, tornado counts may be skewed down as people may not have been tracking them as much as we do in 2015. In addition, the number of tornados increasing over time could be due to global warming or environmental effects. In the end, this analysis shows that the model chosen to represent this data was relevant, but could have been better. As stated before, the predictions were continually overestimated. In the future, we would like to go back and test other potential models that were not discussed in this course in order to better predict the annual tornado count in Illinois.

Page 15: NEW Time Series Paper

15

Appendix A Reference for Model Building

A.1 Training Data Transformation Codes * All codes used for this project are appended at the very end of this paper.

A.2 Model Selection

Dickey Fuller Test on Diff(log(t.data)

Page 16: NEW Time Series Paper

16

Model Estimations

A.2.1 Residual Analysis Codes

Normality Test: Hb: data  is  normal Hl: data  is  not  normal Since  the  p − value  in  all  three  cases  is >.05.        Therefore, fail  to  reject  Hb.

Page 17: NEW Time Series Paper

17

Independence Test: Hb: data  is  Independent Hl: data  is  not  Independent Since  the  p − value  in  all  three  cases  is >.05.        Therefore, fail  to  reject  Hb.

Ljung-Box Test: Hb: data  is  independent  𝑟7 = 𝑟8 = ⋯ = 𝑟{ = 0 Hl: data  is  not  independent Since  the  p − value  in  all  three  cases  is > .05.        Therefore, fail  to  reject  Hb.

Page 18: NEW Time Series Paper

18

A.3 Model Validation

Over fitting Models

Page 19: NEW Time Series Paper

19

A.4 Forecasting

Page 20: NEW Time Series Paper

20

Code: ### Illinois Tornado Annual Count Updated (Left out 6 data points) library(TSA) library(fUnitRoots) data1<-read.csv(file="IL Total Data.csv",header=FALSE,sep=",") x1<-data1[,2] y1<-data1[,1] y_train<-y1[1:60] y_test<-y1[61:66] t.data<-ts(y_train,freq=1,start=c(1950,1)) t.data1<-ts(y_test,freq=1,start=c(2010,1)) k.data<-ts(y1,freq=1,start=c(1950,1)) #Original Time Series Plot plot(t.data,ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in IL',type='o') plot(y=t.data,x=zlag(t.data),ylab='Tornado Count this Year',xlab='Previous Year Tornado Count',main='Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados') #log Transform plot(log(t.data),ylab='log(Annual Tornados in IL)',xlab='Time',main='Time Series Plot of Annual Tornados in IL',type='o') acf(log(t.data)) pacf(log(t.data)) eacf(log(t.data)) AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 x x o o o o o o o o o o o o 1 x o o o o o o o o o o o o o 2 o o o o o o o o o o o o o o 3 x o o o o o o o o o o o o o 4 o o o o o o o o o o o o o o 5 x o o o o o o o o o o o o o 6 x o o o o o o o o o o o o o 7 o o o o o o o o o o o o o o #First Difference Log plot(diff(log(t.data)),ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in IL Diff(log(t.data))',type='o') #Test for Stationarity adfTest(diff(log(t.data))),lags=1,type=c("nc")) adfTest(diff(log(t.data)),lags=1,type = c("c")) adfTest(diff(log(t.data)),lags = 1, type = c("ct"))

Page 21: NEW Time Series Paper

21

#Model Building acf(diff(log(t.data))) #Suggests MA(1) pacf(diff(log(t.data))) #Suggests AR(1) eacf(diff(log(t.data))) # suggests AR(1) AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 x o o o o o o o o o o o o o 1 o o o o o o o o o o o o o o 2 x o o o o o o o o o o o o o 3 x o o o o o o o o o o o o o 4 x o o o o o o o o o o o o o 5 x x o o o o o o o o o o o o 6 x o o o o o o o o o o o o o 7 o o x x x o o o o o o o o o #Best Subset suggests MA(1) as best, then ARMA(1,1) sub1<-armasubsets(diff(log(t.data)),nar=4,nma=4,y.name='test', ar.method='ols') plot(sub1) #Scatter PLot Comparison par(mfrow = c(1, 3),pty = "s") plot(y=diff(log(t.data)),x=zlag(diff(log(t.data))),ylab='Tornado Count this Year',xlab='Previous Year Tornado Count',main='Scatterplot of # IL Tornados') abline(0,0) plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=2),ylab='Tornado Count this Year',xlab='2 Years ago Tornado Count',main='Scatterplot of # of IL Torndaos') abline(0,0) plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=3),ylab='Tornado Count this Year',xlab='3 Years ago Tornado Count',main='Scatterplot of # of IL Torndaos') abline(0,0) #Fitting Models AR1<-arima(log(t.data), order = c(1, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') MA1<-arima(log(t.data), order = c(0, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA11<-arima(log(t.data), order = c(1, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') #No outliers were detected. detectAO(ARMA11) detectIO(ARMA11) #Resdiual Anaysis tsdiag(AR1,gof=4,omit.initial=F) tsdiag(MA1,gof=4,omit.initial=F) tsdiag(ARMA11,gof=4,omit.initial=F) #Normality op <- par(mfrow = c(1, 3),pty = "s") qqnorm(residuals(AR1),main='ARI(1,1) QQ Plot') qqline(residuals(AR1),col='red') qqnorm(residuals(MA1),main='IMA(1,1) QQ Plot') qqline(residuals(MA1),col='red') qqnorm(residuals(ARMA11),main='ARIMA(1,1,1) QQ Plot') qqline(residuals(ARMA11),col='red')

Page 22: NEW Time Series Paper

22

#Formal Testing ks.test(residuals(AR1),"pnorm") shapiro.test(residuals(AR1)) ks.test(residuals(MA1),"pnorm") shapiro.test(residuals(MA1)) ks.test(residuals(ARMA11),"pnorm") shapiro.test(residuals(ARMA11)) #Constant Variance op <- par(mfrow = c(1, 3),pty = "s") plot(rstandard(AR1),ylab='Standardized residuals',main='ARI(1,1)',type='o') abline(0,0,col="red",lwd=2) plot(rstandard(MA1),ylab='Standardized residuals',main='IMA(1,1)',type='o') abline(0,0,col="red",lwd=2) plot(rstandard(ARMA11),ylab='Standardized residuals',main='ARIMA(1,1,1)',type='o') abline(0,0,col="red",lwd=2) #Independence #ACF Plot op <- par(mfrow = c(1, 3),pty = "s") acf(residuals(AR1),main='Sample ACF of Residuals from ARI(1,1) Model') acf(residuals(MA1),main='Sample ACF of Residuals from IMA(1,1) Model') acf(residuals(ARMA11),main='Sample ACF of Residuals from ARIMA(1,1,1) Model') #Lijung Box.test(residuals(AR1),lag=4, type="Ljung-Box",fitdf=1) Box.test(residuals(MA1),lag=4, type="Ljung-Box",fitdf=1) Box.test(residuals(ARMA11),lag=4, type="Ljung-Box",fitdf=2) # Runs runs(residuals(AR1)) runs(residuals(MA1)) runs(residuals(ARMA11)) #Over fitting Parameter Redundancy AR2<-arima(log(t.data), order = c(2, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') MA2<-arima(log(t.data), order = c(0, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA12<-arima(log(t.data), order = c(1, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA21<-arima(log(t.data), order = c(2, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') #Predictions/Forecasting #ARI(1,1) Predictions pred1<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(1,1,0),method='ML') pred1[i]<-predict(est_1,n.ahead=1)$pred} t.pred1<-ts(pred1,freq=1,start=c(2010,1)) t.pred1 Time Series: Start = 2010 End = 2015 Frequency = 1 [1] 3.926734 4.112738 3.839367 3.759854 3.944391 4.077988

Page 23: NEW Time Series Paper

23

log(y_test) [1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107 plot(ts(log(y1)),type="o",main="Forecasting with ARI(1,1)",ylab="log(t.data)") points(ts(pred1,start=c(61),frequency=1),col="red") MSE=mean((log(y_test)-pred1)^2) MAP=mean(abs((log(y_test)-pred1)/(log(y_test)))) PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test))) #IMA(1,1) Predictions pred1<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(0,1,1),method='ML') pred1[i]<-predict(est_1,n.ahead=1)$pred} pred4<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(1,1,0),method='ML') pred4[i]<-predict(est_1,n.ahead=1)$se} t.pred2<-ts(pred1,freq=1,start=c(2010,1)) t.pred2 Time Series: Time Series: Start = 2010 End = 2015 Frequency = 1 [1] 3.884078 4.067417 3.799722 3.891222 3.891483 4.041335 log(y_test) [1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107 plot(ts(log(y1)),type="o",main="Forecasting with IMA(1,1)",ylab="log(t.data)") points(ts(pred1,start=c(61),frequency=1),col="red") MSE=mean((log(y_test)-pred1)^2) MAP=mean(abs((log(y_test)-pred1)/(log(y_test)))) PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test))) #Based off of our predictions for ARI(1,1), IMA(1,1); IMA(1,1) was the best model. #Prediction Intervals for IMA(1,1) lower<-pred1-qnorm(0.975,0,1)*pred4 upper<-pred1+qnorm(0.975,0,1)*pred4 data.frame(Year=c(2010:2015),lower,upper) #Transform Model Back y_test [1] 49 73 32 55 49 69 kk<-exp(pred1 + (1/2)*(pred4)^2) [1] 61.39905 73.55173 56.25562 61.43353 61.24097 70.94283

Page 24: NEW Time Series Paper

24

#100(1-alpha)% prediction intervals # Create lower and upper prediction interval bounds lower1<-pred1-qnorm(0.975,0,1)*pred4 upper1<-pred1+qnorm(0.975,0,1)*pred4 data.frame(Years=c(1:66),lower1,upper1) #Original 95% Prediction Intervals data.frame(Years=c(1:66),exp(lower1),exp(upper1)) Years exp.lower1. exp.upper1. 1 1 12.74597 185.4787 2 2 15.43213 221.0484 3 3 11.82101 168.9437 4 4 13.08390 183.2885 5 5 13.21798 181.5241 6 6 15.48158 209.1434 7 7 12.74597 185.4787 8 8 15.43213 221.0484 9 9 11.82101 168.9437 10 10 13.08390 183.2885 11 11 13.21798 181.5241 12 12 15.48158 209.1434 13 13 12.74597 185.4787 14 14 15.43213 221.0484 15 15 11.82101 168.9437 16 16 13.08390 183.2885 17 17 13.21798 181.5241 18 18 15.48158 209.1434 19 19 12.74597 185.4787 20 20 15.43213 221.0484 21 21 11.82101 168.9437 22 22 13.08390 183.2885 23 23 13.21798 181.5241 24 24 15.48158 209.1434 25 25 12.74597 185.4787 26 26 15.43213 221.0484 27 27 11.82101 168.9437 28 28 13.08390 183.2885 29 29 13.21798 181.5241 30 30 15.48158 209.1434 31 31 12.74597 185.4787 32 32 15.43213 221.0484 33 33 11.82101 168.9437 34 34 13.08390 183.2885 35 35 13.21798 181.5241 36 36 15.48158 209.1434 37 37 12.74597 185.4787 38 38 15.43213 221.0484 39 39 11.82101 168.9437 40 40 13.08390 183.2885 41 41 13.21798 181.5241 42 42 15.48158 209.1434 43 43 12.74597 185.4787 44 44 15.43213 221.0484 45 45 11.82101 168.9437 46 46 13.08390 183.2885 47 47 13.21798 181.5241 48 48 15.48158 209.1434 49 49 12.74597 185.4787 50 50 15.43213 221.0484 51 51 11.82101 168.9437 52 52 13.08390 183.2885

Page 25: NEW Time Series Paper

25

53 53 13.21798 181.5241 54 54 15.48158 209.1434 55 55 12.74597 185.4787 56 56 15.43213 221.0484 57 57 11.82101 168.9437 58 58 13.08390 183.2885 59 59 13.21798 181.5241 60 60 15.48158 209.1434 61 61 12.74597 185.4787 62 62 15.43213 221.0484 63 63 11.82101 168.9437 64 64 13.08390 183.2885 65 65 13.21798 181.5241 66 66 15.48158 209.1434 #Convert back to Original TS PLOT IMA(1,1) plot(y1,ylab='Annual Tornados in IL',xlab='1950 - 2015',main='Time Series Plot of Annual Tornados in IL',type='o') points(ts(kk,start=c(61),frequency=1),col="red",type='o')

Page 26: NEW Time Series Paper

26

References [1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from http://www.spc.noaa.gov/faq/tornado/ [2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)