Panel Data

10

Click here to load reader

description

panel data

Transcript of Panel Data

Page 1: Panel Data

The types of data that are generally available for empirical analysis, namely, time series, cross section, and panel.

In time series data we observe the values of one or more variables over a period of time (e.g., GDP for several quarters or years).

In cross-section data, values of one or more variables are collected for several sample units, or entities, at the same point in time.

In panel data the same cross-sectional unit (say a family or a firm or a state) is surveyed over time. In short, panel data have space as well as time dimensions.

There are other names for panel data, such as pooled data (pooling of time series and cross-sectional observations), combination of time series and cross-section data, micropanel data, longitudinal data (a study over time of a variable or group of subjects), event history analysis (e.g., studying the movement over time of subjects through successive states or conditions), cohort analysis.

the following advantages of panel data

1. Since panel data relate to individuals, firms, states, countries, etc., over time, there is bound to be heterogeneity in these units. The techniques of panel data estimation can take such heterogeneity explicitly into account by allowing for individual-specific variables, such as individuals, firms, states, and countries.

2. By combining time series of cross-section observations, panel data give “more informative data, more variability, less collinearity among variables, more degrees of freedom and more efficiency.

3. By studying the repeated cross section of observations, panel data are better suited to study the dynamics of change. Spells of unemployment, job turnover, and labor mobility are better studied with panel data.

4. Panel data can better detect and measure effects that simply cannot be observed in pure cross-section or pure time series data. For example, the effects of minimum wage laws on employment and earnings can be better studied if we include successive waves of minimum wage increases in the federal and/or state minimum wages.

5. Panel data enables us to study more complicated behavioral models. For example, phenomena such as economies of scale and technological change can be better handled by panel data than by pure cross-section or pure time series data.

Page 2: Panel Data

6. By making data available for several thousand units, panel data can minimize the bias that might result if we aggregate individuals or firms into broad aggregates.

BALANCED PANEL : If each cross-sectional unit has the same number of time series observations, then such a panel (data) is called a balanced panel.

If the number of observations differs among panel members, we call such a panel an unbalanced panel.

In a short panel the number of cross-sectional subjects, N, is greater than the number of time periods, T.

In a long panel, it is T that is greater than N.

1. Pooled OLS model. We simply pool total of all the observations and estimate a “grand” regression, neglecting the cross-section and time series nature of our data.

The simplest way is to pool all the observations together and run the OLS regression modelHowever, the problem with this approach is that pooled OLS is ignoring the heterogeneity or individuality that exists among different variables.

2. The fixed effects least squares dummy variable (LSDV) model. Here we pool total of all the observations, but allow each cross-section unit (i.e., variable in our example) to have its own (intercept) dummy variable.

3. The fixed effects within-group model. Here also we pool total of all the observations, but for each variable we express each variable as a deviation from its mean value and then estimate an OLS regression on such mean-corrected or “de-meaned” values.

4. The random effects model (REM). Unlike the LSDV model, in which we allow each variable to have its own (fixed) intercept value, we assume that the intercept values are a random drawing from a much bigger population of variables.

Stationary

Page 3: Panel Data

A stochastic process is said to be stationary if its mean and variance are constant over time and the value of the covariance between the two time periods depends only on the distance or gap or lag between the two time periods and not the actual time at which the covariance is computed.

Stochastic process is known as a weakly stationary, or covariance stationary, or second-order stationary, or wide sense, stochastic process.

In short, if a time series is stationary, its mean, variance,and autocovariance (at various lags) remain the same no matter at what point we measure them; that is, they are time invariant. Such a time series will tend to return to its mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have a broadly constant amplitude.

If a time series is not stationary in the sense just defined, it is called a nonstationary time series (keep in mind we are talking only about weak stationarity). In other words, a nonstationary time series will have a time varying mean or a time-varying variance or both.

a purely random, or white noise, process. We call a stochastic process purely random if it has zero mean, constant variance σ2, and is serially uncorrelated

RANDOM WALK MODEL -- in stationary time series, one often encounters nonstationary time series, the classic example being the random walk model (RWM) It is often said that asset prices, such as stock prices or exchange rates, follow a random walk; that is, they are nonstationary. two types of random walks: (1) random walk without drift (i.e., no constant or intercept term) and (2) random walk with drift (i.e., a constant term is present).

UNIT ROOT TEST tests whether a time series variable is non-stationary using an autoregressive model. A well-known test that is valid in large samples is the augmented Dickey–Fuller test. The optimal finite sample tests for a unit root in autoregressive models were developed by Denis Sargan and 

Alok Bhargava. Another test is the Phillips–Perron test. These tests use the existence of a unit root as the null hypothesis.

Page 4: Panel Data

1. Regression analysis based on time series data implicitly assumes that the underlying time series are stationary. The classical t tests, F tests, etc. are based on this assumption.

2. In practice most economic time series are nonstationary.

3. A stochastic process is said to be weakly stationary if its mean, variance, and autocovariances are constant over time (i.e., they are timeinvariant).

4. At the informal level, weak stationarity can be tested by the correlogram of a time series, which is a graph of autocorrelation at various lags. For stationary time series, the correlogram tapers off quickly, whereas for nonstationary time series it dies off gradually. For a purely random series, the autocorrelations at all lags 1 and greater are zero.

5. At the formal level, stationarity can be checked by finding out if the time series contains a unit root. The Dickey–Fuller (DF) and augmented Dickey–Fuller (ADF) tests can be used for this purpose.

6. An economic time series can be trend stationary (TS) or difference stationary (DS). A TS time series has a deterministic trend, whereas a DS time series has a variable, or stochastic, trend. The common practice of including the time or trend variable in a regression model to detrend the data is justifiable only for TS time series. The DF and ADF tests can be applied to determine whether a time series is TS or DS.

7. Regression of one time series variable on one or more time series variables often can give nonsensical or spurious results. This phenomenon is known as spurious regression. One way to guard against it is to find out if the time series are cointegrated.

8. Cointegration means that despite being individually nonstationary, a linear combination of two or more time series can be stationary. The EG, AEG, and CRDW tests can be used to find out if two or more time series are cointegrated.9. Cointegration of two (or more) time series suggests that there is a long-run, or equilibrium, relationship between them.10. The error correction mechanism (ECM) developed by Engle and Granger is a means of reconciling the short-run behavior of an economic variable with its long-run behavior.11. The field of time series econometrics is evolving. The established results and tests are in some cases tentative and a lot more work remains.An important question that needs an answer is why some economic time series are stationary and some are nonstationary.

Forecasting

• In general, forecasting is the act of predicting the future• In econometrics, forecasting is the estimation of the expected value

of a dependent variable for observations that are not part of the same data set

• In most forecasts, the values being predicted are for time periods in the future, but cross-sectional predictions of values for countries or people not in the sample are also common

• To simplify terminology, the words prediction and forecast will be used interchangeably in this chapter

– Some authors limit the use of the word forecast to out-of-sample prediction for a time series

Page 5: Panel Data

• Econometric forecasting generally uses a single linear equation to predict or forecast

• Our use of such an equation to make a forecast can be summarized into two steps:

1. Specify and estimate an equation that has as its dependent variable the item that we wish to forecast:

2. Obtain values for each of the independent variables for the observations for which we want a forecast and substitute them into our forecasting equation:

• The forecasts generated in the previous section are quite simple, however, and most actual forecasting involves one or more additional questions—for example:

1. Unknown Xs: It is unrealistic to expect to know the values for the independent variables outside the sample

What happens when we don’t know the values of the independent variables for the forecast period?

2. Serial Correlation: If there is serial correlation involved, the forecasting equation may be estimated with GLS

• How should predictions be adjusted when forecasting equations are estimated with GLS?

3. Confidence Intervals: All the previous forecasts were single values, but such single values are almost never exactly right, so maybe it would be more helpful if we forecasted a confidence interval instead

• How can we develop these confidence intervals?4. Simultaneous Equations Models: many economic and business

equations are part of simultaneous models• How can we use an independent variable to forecast a

dependent variable when we know that a change in value of the dependent variable will change, in turn, the value of the independent variable that we used to make the forecast?

•Conditional Forecasting (Unknown X Values for the Forecast

Period)• Unconditional forecast: all values of the independent variables are

known with certainty– This is rare in practice

Page 6: Panel Data

• Conditional forecast: actual values of one or more of the independent variables are not known

– This is the more common type of forecast• The careful selection of independent variables can sometimes help

avoid the need for conditional forecasting• This opportunity can arise when the dependent variable can be

expressed as a function of leading indicators:– A leading indicator is an independent variable the

movements of which anticipate movements in the dependent variable

– The best known leading indicator, the Index of Leading Economic Indicators, is produced each month

• The techniques we use to test hypotheses can also be adapted to create forecasting confidence intervals

• Given a point forecast, all we need to generate a confidence interval around that forecast are tc, the critical t-value (for the desired level of confidence), and SF, the estimated standard error of the forecast:

The critical t-value, tc, can be found in Statistical Table (for a two-tailed

test with T-K-1 degrees of freedom)• Lastly, the standard error of the forecast, SF, for an equation with

just one independent variable, equals the square root of the forecast error variance:

)where:s2 = the estimated variance of the error term

T = the number of observations in the sampleXT+1 = the forecasted value of the single independent variable = the arithmetic mean of the observed Xs in the sample

ARIMA• ARIMA is a highly refined curve-fitting device that uses current

and past values of the dependent variable to produce often accurate short-term forecasts of that variable

Page 7: Panel Data

– Examples of such forecasts are stock market price predictions created by brokerage analysts (called “chartists” or “technicians”) based entirely on past patterns of movement of the stock prices

• If ARIMA models thus essentially ignores economic theory (by ignoring “traditional” explanatory variables), why use them?

• The use of ARIMA is appropriate when: little or nothing is known about the dependent variable being

forecasted, the independent variables known to be important cannot be

forecasted effectively all that is needed is a one or two-period forecast

–The ARIMA approach combines two different specifications (called processes) into one equation:

1. An autoregressive process (AR):expresses a dependent variable as a function of past values of the dependent variableThis is similar to the serial correlation error term function and the dynamic model

2. a moving average process (MA):expresses a dependent variable as a function of past values of the error termSuch a function is a moving average of past error term observations that can be added to the mean of Y to obtain a moving average of past values of Y

the independent variables known to be important cannot be forecasted effectively all that is needed is a one or two-period forecast

An ARCH (AUTOREGRESSIVE CONDITIONALLY HETEROSCEDASTIC) model is a model for the variance of a time series.  ARCH models are used to describe a changing, possibly volatile variance.  Although an ARCH model could possibly be used to describe a gradually increasing variance over time, most often it is used in situations in which there may be short periods of increased variation.  (Gradually increasing variance connected to a gradually increasing mean level might be better handled by transforming the variable.)

ARCH models were created in the context of econometric and finance problems having to do with the amount that investments or stocks increase (or decrease) per time period, so there’s a tendency to describe them as models for that type of variable.

Page 8: Panel Data

An ARCH model could be used for any series that has periods of increased or decreased variance. 

A GARCH (GENERALIZED AUTOREGRESSIVE CONDITIONALLY HETEROSCEDASTIC) model uses values of the past squared observations and past variances to model the variance at time t.  As an example, a GARCH(1,1) is

σ2t=α0+α1y2t−1+β1σ2t−1

In the GARCH notation, the first subscript refers to the order of the y2 terms on the right side, and the second subscript refers to the order of the σ2 terms.

VAR models (VECTOR AUTOREGRESSIVE MODELS) are used for multivariate time series. The structure is that each variable is a linear function of past lags of itself and past lags of the other variables.

Uses of Dummy Variable