bai and ng 2002

32
Econometrica, Vol. 70, No. 1 (January, 2002), 191–221 DETERMINING THE NUMBER OF FACTORS IN APPROXIMATE FACTOR MODELS By Jushan Bai and Serena Ng 1 In this paper we develop some econometric theory for factor models of large dimen- sions. The focus is the determination of the number of factors r , which is an unresolved issue in the rapidly growing literature on multifactor models. We rst establish the con- vergence rate for the factor estimates that will allow for consistent estimation of r . We then propose some panel criteria and show that the number of factors can be consistently estimated using the criteria. The theory is developed under the framework of large cross- sections (N ) and large time dimensions (T ). No restriction is imposed on the relation between N and T . Simulations show that the proposed criteria have good nite sample prope rties in many congur ation s of the panel data encoun tered in pract ice. Keywords: F actor analysis, asset pricing, principal components, model selection. 1 introduction The idea that variations in a large number of economic variables can be modeled by a small number of reference variables is appealing and is used in man y economic ana lyse s. For exa mpl e, ass et ret urns are often modeled as a function of a small number of factors. Stock and Watson (1989) used one ref- erence variable to model the comovements of four main macroeconomic aggre- gates. Cross-country variations are also found to have common components; see Gregory and Head (1999) and Forni, Hallin, Lippi, and Reichlin (2000b). More recently, Stock and Watson (1999) showed that the forecast error of a large num- ber of macroeconomic variables can be reduced by including diffusion indexes, or factors, in structural as well as nonstructural forecasting models. In demand analysis, Engel curves can be expressed in terms of a nite number of factors. Lewbel (1991) showed that if a demand system has one common factor, budget shares should be independent of the level of income. In such a case, the num- ber of factors is an object of economic interest since if more than one factor is found, homothetic preferences can be rejected. Factor analysis also provides a convenient way to study the aggregate implications of microeconomic behavior, as shown in Forni and Lippi (1997). Central to both the theoretical and the empirical validity of factor models is the correct specication of the number of factors. To date, this crucial parameter 1 We thank three anonymous referees for their very constructive comments, which led to a much improved presentation. The rst author acknowledges nancial support from the National Science Fo unda tion under Grant SBR-9709508. We would like to than k parti cipan ts in the econ ometr ics seminars at Harvard-MIT, Cornell University, the University of Rochester, and the University of Pennsylvania for help suggestions and comments. Remaining errors are our own. 191

Transcript of bai and ng 2002

Page 1: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 1/31

Econometrica, Vol. 70, No. 1 (January, 2002), 191–221

DETERMINING THE NUMBER OF FACTORS INAPPROXIMATE FACTOR MODELS

By Jushan Bai and Serena Ng1

In this paper we develop some econometric theory for factor models of large dimen-sions. The focus is the determination of the number of factors r , which is an unresolved

issue in the rapidly growing literature on multifactor models. We first establish the con-vergence rate for the factor estimates that will allow for consistent estimation of  r . We

then propose some panel criteria and show that the number of factors can be consistentlyestimated using the criteria. The theory is developed under the framework of large cross-

sections (N ) and large time dimensions (T ). No restriction is imposed on the relationbetween N  and T . Simulations show that the proposed criteria have good finite sample

properties in many configurations of the panel data encountered in practice.

Keywords: Factor analysis, asset pricing, principal components, model selection.

1 introduction

The idea that variations in a large number of economic variables can bemodeled by a small number of reference variables is appealing and is used inmany economic analyses. For example, asset returns are often modeled as afunction of a small number of factors. Stock and Watson (1989) used one ref-

erence variable to model the comovements of four main macroeconomic aggre-gates. Cross-country variations are also found to have common components; seeGregory and Head (1999) and Forni, Hallin, Lippi, and Reichlin (2000b). Morerecently, Stock and Watson (1999) showed that the forecast error of a large num-ber of macroeconomic variables can be reduced by including diffusion indexes,or factors, in structural as well as nonstructural forecasting models. In demandanalysis, Engel curves can be expressed in terms of a finite number of factors.Lewbel (1991) showed that if a demand system has one common factor, budgetshares should be independent of the level of income. In such a case, the num-ber of factors is an object of economic interest since if more than one factor is

found, homothetic preferences can be rejected. Factor analysis also provides aconvenient way to study the aggregate implications of microeconomic behavior,as shown in Forni and Lippi (1997).

Central to both the theoretical and the empirical validity of factor models isthe correct specification of the number of factors. To date, this crucial parameter

1 We thank three anonymous referees for their very constructive comments, which led to a muchimproved presentation. The first author acknowledges financial support from the National Science

Foundation under Grant SBR-9709508. We would like to thank participants in the econometricsseminars at Harvard-MIT, Cornell University, the University of Rochester, and the University of 

Pennsylvania for help suggestions and comments. Remaining errors are our own.

191

Page 2: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 2/31

192 jushan bai and serena ng

is often assumed rather than determined by the data.2 This paper develops aformal statistical procedure that can consistently estimate the number of factorsfrom observed data. We demonstrate that the penalty for overfitting must be

a function of both N  and T  (the cross-section dimension and the time dimen-sion, respectively) in order to consistently estimate the number of factors. Con-sequently the usual AIC and BIC, which are functions of  N  or T  alone, donot work when both dimensions of the panel are large. Our theory is developedunder the assumption that both N  and T  converge to infinity. This flexibility isof empirical relevance because the time dimension of datasets relevant to factoranalysis, although small relative to the cross-section dimension, is too large tojustify the assumption of a fixed T .

A small number of papers in the literature have also considered the problemof determining the number of factors, but the present analysis differs from these

works in important ways. Lewbel (1991) and Donald (1997) used the rank of amatrix to test for the number of factors, but these theories assume either N  or T is fixed. Cragg and Donald (1997) considered the use of information criteria whenthe factors are functions of a set of observable explanatory variables, but the datastill have a fixed dimension. For large dimensional panels, Connor and Korajczyk(1993) developed a test for the number of factors in asset returns, but their testis derived under sequential limit asymptotics, i.e., N  converges to infinity with afixed T  and then T  converges to infinity. Furthermore, because their test is basedon a comparison of variances over different time periods, covariance stationarityand homoskedasticity are not only technical assumptions, but are crucial for the

validity of their test. Under the assumption that N  → for fixed T , Forni andReichlin (1998) suggested a graphical approach to identify the number of factors,but no theory is available. Assuming N T  → with

√ N /T  → , Stock and

Watson (1998) showed that a modification to the BIC can be used to selectthe number of factors optimal for forecasting a single series. Their criterion isrestrictive not only because it requires N  T , but also because there can befactors that are pervasive for a set of data and yet have no predictive abilityfor an individual data series. Thus, their rule may not be appropriate outside of the forecasting framework. Forni, Hallin, Lippi, and Reichlin (2000a) suggesteda multivariate variant of the AIC but neither the theoretical nor the empirical

properties of the criterion are known.We set up the determination of factors as a model selection problem. In con-

sequence, the proposed criteria depend on the usual trade-off between good fitand parsimony. However, the problem is nonstandard not only because accountneeds to be taken of the sample size in both the cross-section and the time-series dimensions, but also because the factors are not observed. The theory wedeveloped does not rely on sequential limits, nor does it impose any restrictionsbetween N  and T . The results hold under heteroskedasticity in both the time and

2 Lehmann and Modest (1988), for example, tested the APT for 5, 10, and 15 factors. Stock and

Watson (1989) assumed there is one factor underlying the coincident index. Ghysels and Ng (1998)tested the affine term structure model assuming two factors.

Page 3: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 3/31

approximate factor models 193

the cross-section dimensions. The results also hold under weak serial and cross-section dependence. Simulations show that the criteria have good finite sampleproperties.

The rest of the paper is organized as follows. Section 2 sets up the preliminariesand introduces notation and assumptions. Estimation of the factors is consideredin Section 3 and the estimation of the number of factors is studied in Section 4.Specific criteria are considered in Section 5 and their finite sample propertiesare considered in Section 6, along with an empirical application to asset returns.Concluding remarks are provided in Section 7. All the proofs are given in theAppendix.

2 factor models

Let Xit be the observed data for the ith cross-section unit at time t, for i =1 N , and t = 1 T . Consider the following model:

Xit = iF t + eit (1)

where F t is a vector of common factors, i is a vector of factor loadings associatedwith F t , and eit is the idiosyncratic component of  Xit . The product

iF t is calledthe common component of  Xit . Equation (1) is then the factor representation of the data. Note that the factors, their loadings, as well as the idiosyncratic errorsare not observable.

Factor analysis allows for dimension reduction and is a useful statistical tool.

Many economic analyses fit naturally into the framework given by (1).1. Arbitrage pricing theory. In the finance literature, the arbitrage pricing the-

ory (APT) of Ross (1976) assumes that a small number of factors can be used toexplain a large number of asset returns. In this case, Xit represents the return of asset i at time t F t represents the vector of factor returns, and eit is the idiosyn-cratic component of returns. Although analytical convenience makes it appealingto assume one factor, there is growing evidence against the adequacy of a singlefactor in explaining asset returns.3 The shifting interest towards use of multifactormodels inevitably calls for a formal procedure to determine the number of fac-tors. The analysis to follow allows the number of factors to be determined even

when N  and T  are both large. This is especially suited for financial applicationswhen data are widely available for a large number of assets over an increasinglylong span. Once the number of factors is determined, the factor returns F t canalso be consistently estimated (up to an invertible transformation).

2. The rank of a demand system. Let p be a price vector for J  goods and ser-vices, eh be total spending on the J  goods by household h. Consumer theorypostulates that Marshallian demand for good j  by consumer h is Xjh = gj peh.Let wjh = Xjh/eh be the budget share for household h on the j th good. The

3 Cochrane (1999) stressed that financial economists now recognize that there are multiple sources

of risk, or factors, that give rise to high returns. Backus, Forsei, Mozumdar, and Wu (1997) madesimilar conclusions in the context of the market for foreign assets.

Page 4: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 4/31

194 jushan bai and serena ng

rank of a demand system holding prices fixed is the smallest integer r  such thatwj e = j 1G1e +· · ·+jr Gr e. Demand systems are of the form (1) where ther  factors, common across goods, are F h

=G1eh

· · ·Gr eh. When the number

of households, H , converges to infinity with a fixed J G1e · · ·Gr e can be esti-mated simultaneously, such as by nonparametric methods developed in Donald(1997). This approach will not work when the number of goods, J , also convergesto infinity. However, the theory to be developed in this paper will still provide aconsistent estimation of  r  and without the need for nonparametric estimation of the G· functions. Once the rank of the demand system is determined, the non-parametric functions evaluated at eh allow F h to be consistently estimable (up toa transformation). Then functions G1e · · ·Gr e may be recovered (also up to

a matrix transformation) from F h h = 1 H via nonparametric estimation.3. Forecasting with diffusion indices. Stock and Watson (1998, 1999) considered

forecasting inflation with diffusion indices (“factors”) constructed from a largenumber of macroeconomic series. The underlying premise is that these series maybe driven by a small number of unobservable factors. Consider the forecastingequation for a scalar series

yt+1 = F t + W t + t

The variables W t are observable. Although we do not observe F t, we observeXit i = 1 N . Suppose Xit bears relation with F t as in (1). In the presentcontext, we interpret (1) as the reduced-form representation of  Xit in terms of 

the unobservable factors. We can first estimate F t from (1). Denote it by F t . We

can then regress yt on F t−1 and W t−1 to obtain the coefficients and , fromwhich a forecast

yT +1T  = F T  + W T 

can be formed. Stock and Watson (1998, 1999) showed that this approach of forecasting outperforms many competing forecasting methods. But as pointedout earlier, the dimension of F  in Stock and Watson (1998, 1999) was determinedusing a criterion that minimizes the mean squared forecast errors of  y. This maynot be the same as the number of factors underlying Xit , which is the focus of this paper.

21 Notation and Preliminaries

Let F 0t 0i , and r  denote the true common factors, the factor loadings, and the

true number of factors, respectively. Note that F 0t is r  dimensional. We assumethat r  does not depend on N  or T . At a given t, we have

Xt = 0 F 0t + et

N ×1 N × r r ×1 N ×1(2)

where Xt =

X1t

X2t

XNt

0

=0

1 0

2 0

N , and e

t =e

1t e

2t

eNt . Our objective is to determine the true number of factors, r . In classical

Page 5: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 5/31

approximate factor models 195

factor analysis (e.g., Anderson (1984)), N  is assumed fixed, the factors are inde-pendent of the errors et, and the covariance of  et is diagonal. Normalizing thecovariance matrix of  F t to be an identity matrix, we have

=00

+, where

and are the covariance matrices of  Xt and et , respectively. Under theseassumptions, a root-T  consistent and asymptotically normal estimator of  , say,

the sample covariance matrix = 1/T T 

t=1Xt − XXt − X can be obtained.The essentials of classical factor analysis carry over to the case of large N  butfixed T  since the N × N  problem can be turned into a T  ×T  problem, as notedby Connor and Korajczyk (1993) and others.

Inference on r  under classical assumptions can, in theory, be based on the

eigenvalues of  since a characteristic of a panel of data that has an r  factorrepresentation is that the first r  largest population eigenvalues of the N × N covariance of  Xt diverge as N  increases to infinity, but the r 

+1th eigenvalue

is bounded; see Chamberlain and Rothschild (1983). But it can be shown that all

nonzero sample eigenvalues (not just the first r ) of the matrix increase withN , and a test based on the sample eigenvalues is thus not feasible. A likelihoodratio test can also, in theory, be used to select the number of factors if, in addi-tion, normality of  et is assumed. But as found by Dhrymes, Friend, and Glutekin(1984), the number of statistically significant factors determined by the likelihoodratio test increases with N  even if the true number of factors is fixed. Othermethods have also been developed to estimate the number of factors assumingthe size of one dimension is fixed. But Monte Carlo simulations in Cragg andDonald (1997) show that these methods tend to perform poorly for moderatelylarge N  and T . The fundamental problem is that the theory developed for clas-sical factor models does not apply when both N  and T  → . This is becauseconsistent estimation of  (whether it is an N × N  or a T  × T  matrix) is not a

well defined problem. For example, when N > T , the rank of  is no more thanT , whereas the rank of  can always be N . New theories are thus required toanalyze large dimensional factor models.

In this paper, we develop asymptotic results for consistent estimation of thenumber of factors when N  and T  → . Our results complement the sparse butgrowing literature on large dimensional factor analysis. Forni and Lippi (2000)

and Forni et al. (2000a) obtained general results for dynamic factor models, whileStock and Watson (1998) provided some asymptotic results in the context of forecasting. As in these papers, we allow for cross-section and serial dependence.In addition, we also allow for heteroskedasticity in et and some weak dependencebetween the factors and the errors. These latter generalizations are new in ouranalysis. Evidently, our assumptions are more general than those used when thesample size is fixed in one dimension.

Let X i be a T  × 1 vector of time-series observations for the ith cross-sectionunit. For a given i, we have

Xi =

F 0 0

i +e

i

T  ×1 T  × r r ×1 T  × 1(3)

Page 6: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 6/31

196 jushan bai and serena ng

where X i = Xi1 Xi2 XiT  F 0 = F 01 F 02 F 0T 

, and ei = ei1 ei2 eiT 

. For the panel of data X = X1 X N , we have

X = F 0

0 + e

T  ×N T  × r r ×N T  ×N (4)

with e = e1 eN .Let trA denote the trace of  A. The norm of the matrix A is then A =

trAA1/2. The following assumptions are made:

Assumption A—Factors: EF 0t 4 < and T −1T 

t=1 F 0t F 0

t → F  as T  → for some positive definite matrix F .

Assumption B—Factor Loadings: i ≤ < , and

0

0

/N − D → 0 asN  → for some r × r  positive definite matrix D.

Assumption C—Time and Cross-Section Dependence and Heteroskedasticity:There exists a positive constant M < , such that for all N  and T ,

1. Eeit = 0 Eeit 8 ≤ M ;2. Ee

s et /N = EN −1N 

i=1 eis eit =  N st  N ss ≤ M  for all s, andT −1

T s=1

T t=1  N st ≤ M ;

3. Eeit ejt =  ijt with  ijt ≤  ij  for some  ij  and for all t; in addition,

N −1

N i=1

N j =1  ij  ≤ M ;

4. Eeit ejs =  ijts and N T −1 N 

i=1N 

j =1T 

t=1T 

s=1  ijts ≤ M ;5. for every tsEN −1/2 N i=1eis eit − Eeis eit 4 ≤ M .

Assumption D—Weak Dependence between Factors and Idiosyncratic Errors:

E

1

N i=1

1√ T 

T t=1

F 0t eit

2

≤ M

Assumption A is standard for factor models. Assumption B ensures that eachfactor has a nontrivial contribution to the variance of  Xt. We only consider

nonrandom factor loadings for simplicity. Our results still hold when the i arerandom, provided they are independent of the factors and idiosyncratic errors,and Ei4 ≤ M . Assumption C allows for limited time-series and cross-sectiondependence in the idiosyncratic component. Heteroskedasticity in both the timeand cross-section dimensions is also allowed. Under stationarity in the timedimension,  N st =  N s − t, though the condition is not necessary. GivenAssumption C1, the remaining assumptions in C are easily satisfied if the eit areindependent for all i and t. The allowance for some correlation in the idiosyn-cratic components sets up the model to have an approximate factor structure. It ismore general than a strict factor model, which assumes eit is uncorrelated acrossi, the framework in which the APT theory of Ross (1976) is based. Thus, theresults to be developed will also apply to strict factor models. When the factors

Page 7: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 7/31

approximate factor models 197

and idiosyncratic errors are independent (a standard assumption for conventionalfactor models), Assumption D is implied by Assumptions A and C. Independenceis not required for D to be true. For example, suppose that eit

=it

F t

with

it being independent of  F t and it satisfies Assumption C; then Assumption Dholds. Finally, the developments proceed assuming that the panel is balanced.We also note that the model being analyzed is static, in the sense that Xit has acontemporaneous relationship with the factors. The analysis of dynamic modelsis beyond the scope of this paper.

For a factor model to be an approximate factor model in the sense of Chamberlain and Rothschild (1983), the largest eigenvalue (and hence all of the eigenvalues) of the N ×N  covariance matrix = Eete

t must be bounded.Note that Chamberlain and Rothschild focused on the cross-section behavior of the model and did not make explicit assumptions about the time-series behavior

of the model. Our framework allows for serial correlation and heteroskedastic-ity and is more general than their setup. But if we assume et is stationary withEeit ejt =  ij , then from matrix theory, the largest eigenvalue of  is bounded

by maxi

N j =1  ij . Thus if we assume

N j =1  ij  ≤ M  for all i and all N , which

implies Assumption C3, then (2) will be an approximate factor model in thesense of Chamberlain and Rothschild.

3 estimation of the common factors

When N  is small, factor models are often expressed in state space form, nor-mality is assumed, and the parameters are estimated by maximum likelihood. Forexample, Stock and Watson (1989) used N  = 4 variables to estimate one factor,the coincident index. The drawback is that because the number of parametersincreases with N ,4 computational difficulties make it necessary to abandon infor-mation on many series even though they are available.

We estimate common factors in large panels by the method of asymptotic prin-cipal components.5 The number of factors that can be estimated by this (non-parametric) method is min N T , much larger than permitted by estimationof state space models. But to determine which of these factors are statisticallyimportant, it is necessary to first establish consistency of all the estimated com-mon factors when both N  and T  are large. We start with an arbitrary number

k k < minN T. The superscript in ki and F kt signifies the allowance of  kfactors in the estimation. Estimates of  k and F k are obtained by solving theoptimization problem

Vk = min F k

N T −1N 

i=1

T t=1

Xit − k

i F kt

2

4 Gregory, Head, and Raynauld (1997) estimated a world factor and seven country specific fac-tors from output, consumption, and investment for each of the G7 countries. The exercise involved

estimation of 92 parameters and perhaps stretched the state-space model to its limit.5 The method of asymptotic principal components was studied by Connor and Korajzcyk (1986)

and Connor and Korajzcyk (1988) for fixed T . Forni et al. (2000a) and Stock and Watson (1998)considered the method for large T .

Page 8: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 8/31

198 jushan bai and serena ng

subject to the normalization of either kk/N  = I k or F k

F k/T  = I k. If we con-

centrate out k and use the normalization that F kF k/T  = I k, the optimization

problem is identical to maximizing trF kXX F k. The estimated factor matrix,

denoted by F k, is √ T  times the eigenvectors corresponding to the k largesteigenvalues of the T ×T  matrix XX . Given F k k = F k

F k−1F kX = F k

X/T 

is the corresponding matrix of factor loadings.The solution to the above minimization problem is not unique, even though the

sum of squared residuals Vk is unique. Another solution is given by F k k,where k is constructed as

√ N  times the eigenvectors corresponding to the k

largest eigenvalues of the N ×N  matrix X X. The normalization that k k/N =I k implies F k = X k/N . The second set of calculations is computationally lesscostly when T > N , while the first is less intensive when T < N .6

Define

F k = F kF k F k/T 1/2

a rescaled estimator of the factors. The following theorem summarizes the asymp-totic properties of the estimated factors.

Theorem 1: For any fixed k ≥ 1, there exists a r  × k matrix H k withrankH k = minkr, and C N T  = min

√ N

√ T , such that

C 2N T 

1

t=1

F kt − H kF 0t

2

= Op1(5)

Because the true factors (F 0) can only be identified up to scale, what is beingconsidered is a rotation of  F 0. The theorem establishes that the time average of the squared deviations between the estimated factors and those that lie in thetrue factor space vanish as N T  → . The rate of convergence is determined bythe smaller of  N  or T , and thus depends on the panel structure.

Under the additional assumption thatT 

s=1  N st2 ≤ M  for all t and T , theresult7

C 2N T 

F t − H k

F 0t

2 = Op1 for each t(6)

can be obtained. Neither Theorem 1 nor (6) implies uniform convergence in t.Uniform convergence is considered by Stock and Watson (1998). These authorsobtained a much slower convergence rate than C 2N T , and their result requires√ 

N  T . An important insight of this paper is that, to consistently estimatethe number of factors, neither (6) nor uniform convergence is required. It isthe average convergence rate of Theorem 1 that is essential. However, (6) couldbe useful for statistical analysis on the estimated factors and is thus a result of independent interest.

6 A more detailed account of computation issues, including how to deal with unbalanced panels, is

given in Stock and Watson (1998).7 The proof is actually simpler than that of Theorem 1 and is thus omitted to avoid repetition.

Page 9: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 9/31

approximate factor models 199

4 estimating the number of factors

Suppose for the moment that we observe all potentially informative factors

but not the factor loadings. Then the problem is simply to choose k factors thatbest capture the variations in X and estimate the corresponding factor loadings.Since the model is linear and the factors are observed, i can be estimated byapplying ordinary least squares to each equation. This is then a classical modelselection problem. A model with k + 1 factors can fit no worse than a modelwith k factors, but efficiency is lost as more factor loadings are being estimated.Let F k be a matrix of  k factors, and

VkF k = min

1

N T 

N i=1

T t=1

Xit − k

i F kt

2

be the sum of squared residuals (divided by N T ) from time-series regressions of Xi on the k factors for all i. Then a loss function VkF k + kgNT, whereg N T is the penalty for overfitting, can be used to determine k. Because theestimation of i is classical, it can be shown that the BIC with g N T = lnT /T can consistently estimate r . On the other hand, the AIC with g N T = 2/T may choose k > r  even in large samples. The result is the same as in Gewekeand Meese (1981) derived for N  = 1 because when the factors are observed,the penalty factor does not need to take into account the sample size in thecross-section dimension. Our main result is to show that this will no longer betrue when the factors have to be estimated, and even the BIC will not always

consistently estimate r .Without loss of generality, we let

Vk F k = min

1

N T 

N i=1

T t=1

Xit − k

iF kt

2(7)

denote the sum of squared residuals (divided by N T ) when k factors are esti-mated. This sum of squared residuals does not depend on which estimate of  F  is

used because they span the same vector space. That is, Vk F k = Vk F k =Vk

F k. We want to find penalty functions, g N T , such that criteria of the

form

PCk = Vk F k+ kgNT

can consistently estimate r . Let kmax be a bounded integer such that r ≤ kmax.

Theorem 2: Suppose that Assumptions A–D hold and that the k factorsare estimated by principal components. Let k = argmin0≤k≤kmax PCk. Then

limN T → Probk = r = 1 if (i) g N T → 0 and (ii) C 2N T  · g N T → as N ,

T  → , where C N T  = min√ 

N √ 

T .

Conditions (i) and (ii) are necessary in the sense that if one of the conditionsis violated, then there will exist a factor model satisfying Assumptions A–D, and

Page 10: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 10/31

200 jushan bai and serena ng

yet the number of factors cannot be consistently estimated. However, conditions(i) and (ii) are not always required to obtain a consistent estimate of  r .

A formal proof of Theorem 2 is provided in the Appendix. The crucial ele-

ment in consistent estimation of  r  is a penalty factor that vanishes at an appro-priate rate such that under and overparameterized models will not be chosen.An implication of Theorem 2 is the following:

Corollary 1: Under the Assumptions of Theorem 2, the class of criteriadefined by

ICk = lnVk F k+ kgNT

will also consistently estimate r .

Note that Vk F k is simply the average residual variance when k factors areassumed for each cross-section unit. The IC  criteria thus resemble informationcriteria frequently used in time-series analysis, with the important difference thatthe penalty here depends on both N  and T .

Thus far, it has been assumed that the common factors are estimated by themethod of principle components. Forni and Reichlin (1998) and Forni et al.(2000a) studied alternative estimation methods. However the proof of Theorem 2

mainly uses the fact that

F t satisfies Theorem 1, and does not rely on principal

components per se. We have the following corollary:

Corollary 2: Let Gk be an arbitrary estimator of  F 0. Suppose there exists a

matrix H k such that rank H k = minkr, and for some C 2N T  ≤ C 2N T ,

C 2N T 

1

T t=1

Gkt − H k

F 0t

2

= Op1(8)

Then Theorem 2 still holds with F k replaced by Gk and C N T  replaced by C N T .

The sequence of constants C 2

N T  does not need to equal C 2

N T  = min N T .Theorem 2 holds for any estimation method that yields estimators Gt satisfying

(8).8 Naturally, the penalty would then depend on C 2N T , the convergence rate

for Gt.

5 the PCp and the ICp

In this section, we assume that the method of principal components is used toestimate the factors and propose specific formulations of  g N T to be used in

8 We are grateful for a referee whose question led to the results reported here.

Page 11: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 11/31

approximate factor models 201

practice. Let  2 be a consistent estimate of  N T −1N 

i=1

T t=1 Eeit 2. Consider

the following criteria:

PC p1k = Vk F k + k 2N + T N T 

ln N T N + T 

PC p2k = Vk F k + k 2

N + T 

N T 

ln C 2N T 

PC p3k = Vk F k + k 2

ln C 2N T 

C 2N T 

Since Vk

F k = N −1

N i=1  2i , where  2i = e

iei/T , the criteria generalize the C p

criterion of Mallows (1973) developed for selection of models in strict time-series or cross-section contexts to a panel data setting. For this reason, we refer

to these statistics as Panel C p PC p criteria. Like the C p criterion,  2 provides

the proper scaling to the penalty term. In applications, it can be replaced by

V kmax F kmax. The proposed penalty functions are based on the sample size in

the smaller of the two dimensions. All three criteria satisfy conditions (i) and (ii)

of Theorem 2 since C −2N T  ≈ N +T /NT → 0 as N T  → . However, in finite

samples, C −2N T  ≤ N + T/NT . Hence, the three criteria, although asymptotically

equivalent, will have different properties in finite samples.9

Corollary 1 leads to consideration of the following three criteria:

IC p1k = lnVk F k+ k

N + T 

N T 

ln

N T 

N + T 

IC p2k = lnVk F k+ k

N + T 

N T 

ln C 2N T (9)

IC p3k = lnVk F k+ k

ln C 2N T 

C 2N T 

The main advantage of these three panel information criteria (IC p) is that they

do not depend on the choice of  kmax through  2, which could be desirable inpractice. The scaling by  2 is implicitly performed by the logarithmic transfor-

mation of  Vk F k and thus not required in the penalty term.

The proposed criteria differ from the conventional C p and information criteria

used in time-series analysis in that g N T is a function of both N  and T . To

understand why the penalty must be specified as a function of the sample size in

9 Note that PC p1 and PC p2 , and likewise, IC p1 and IC p2, apply specifically to the principal com-

ponents estimator because C 

2

N T  = min N T is used in deriving them. For alternative estimatorssatisfying Corollary 2, criteria PC p3 and IC p3 are still applicable with C N T  replaced by C NT  .

Page 12: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 12/31

202 jushan bai and serena ng

both dimensions, consider the following:

AIC 1k

=Vk F k

+k

ˆ 2

2

BIC 1k = Vk F k + k 2

ln T 

AIC 2k = Vk F k + k 2

2

BIC 2k = Vk F k + k 2

ln N 

AIC 3k

=Vk F k

+k

ˆ 22

N + T − k

N T 

BIC 3k = Vk F k + k 2

N + T − k ln N T

N T 

The penalty factors in AIC 1 and BIC 1 are standard in time-series applications.Although g N T → 0 as T  → AIC 1 fails the second condition of Theorem 2for all N  and T . When N  T  and N  logT /T  −→ , the BIC 1 also failscondition (ii) of Theorem 2. Thus we expect the AIC 1 will not work for all N and T , while the BIC 1 will not work for small N  relative to T . By analogy, AIC 2also fails the conditions of Theorem 2, while BIC 2 will work only if  N  T .

The next two criteria, AIC 3 and BIC 3, take into account the panel nature of theproblem. The two specifications of g N T reflect first, that the effective numberof observations is N  · T , and second, that the total number of parameters beingestimated is kN + T − k. It is easy to see that AIC 3 fails the second conditionof Theorem 2. While the BIC 3 satisfies this condition, g N T does not alwaysvanish. For example, if  N  = expT , then g N T → 1 and the first conditionof Theorem 2 will not be satisfied. Similarly, g N T does not vanish when T  =expN . Therefore BIC 3 may perform well for some but not all configurationsof the data. In contrast, the proposed criteria satisfy both conditions stated inTheorem 2.

6 simulations and an empirical application

We first simulate data from the following model:

Xit =r 

j =1

ij F tj +√ 

eit

= cit +√ 

eit

where the factors are T  × r  matrices of  N 0 1 variables, and the factor load-ings are N 0 1 variates. Hence, the common component of  X

it, denoted by c

it,

has variance r . Results with ij  uniformly distributed are similar and will not

Page 13: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 13/31

Page 14: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 14/31

204 jushan bai and serena ng

TABLE I

DGP: Xit =r j =1 ij F tj +

√ eit ; r = 1; = 1.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 102 100 297 100 100 100 800 297 800 800 757 100

100 60 100 100 241 100 100 100 800 241 800 800 711 100200 60 100 100 100 100 100 100 800 100 800 800 551 100

500 60 100 100 100 100 100 100 521 100 800 800 157 1001000 60 100 100 100 100 100 100 100 100 800 800 100 100

2000 60 100 100 100 100 100 100 100 100 800 800 100 100100 100 100 100 324 100 100 100 800 324 800 324 668 100

200 100 100 100 100 100 100 100 800 100 800 800 543 100500 100 100 100 100 100 100 100 800 100 800 800 155 100

1000 100 100 100 100 100 100 100 108 100 800 800 100 100

2000 100 100 100 100 100 100 100 100 100 800 800 100 100

40 100 101 100 269 100 100 100 800 800 800 269 733 10060 100 100 100 225 100 100 100 800 800 800 225 699 10060 200 100 100 100 100 100 100 800 800 800 100 514 100

60 500 100 100 100 100 100 100 800 800 467 100 132 10060 1000 100 100 100 100 100 100 800 800 100 100 100 100

60 2000 100 100 100 100 100 100 800 800 100 100 100 1004000 60 100 100 100 100 100 100 100 100 800 800 100 1004000 100 100 100 100 100 100 100 100 100 800 800 100 100

8000 60 100 100 100 100 100 100 100 100 800 800 100 1008000 100 100 100 100 100 100 100 100 100 800 800 100 100

60 4000 100 100 100 100 100 100 800 800 100 100 100 100100 4000 100 100 100 100 100 100 800 800 100 100 100 100

60 8000 100 100 100 100 100 100 800 800 100 100 100 100100 8000 100 100 100 100 100 100 800 800 100 100 100 100

10 50 800 800 800 800 800 800 800 800 800 800 800 718

10 100 800 800 800 800 800 800 800 800 800 800 800 58820 100 473 394 629 100 100 100 800 800 800 629 800 100

100 10 800 800 800 800 800 800 800 800 800 800 800 800100 20 562 481 716 100 100 100 800 716 800 800 800 100

Notes: Table I–Table VIII report the estimated number of factors (k) averaged over 1000 simulations. The true number of factors

is r  and kmax = 8. When the average of  k is an integer, the corresponding standard error is zero. In the few cases when the averagedk over replications is not an integer, the standard errors are no larger than .6. In view of the precision of the estimates in the majorityof cases, the standard errors in the simulations are not reported. The last five rows of each table are for models of small dimensions(either N  or T  is small).

common component, one might expect the common factors to be estimated with

less precision. Indeed, IC p1 and IC p2 underestimate r  when min N T < 60, butthe criteria still select values of k that are very close to r  for other configurations

of the data.

The models considered thus far have idiosyncratic errors that are uncorrelatedacross units and across time. For these strict factor models, the preferred criteria

are PC p1 PC p2 IC 1, and IC 2. It should be emphasized that the results reported

are the averages of ˆk over 1000 simulations. We do not report the standard

deviations of these averages because they are identically zero except for a few

Page 15: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 15/31

approximate factor models 205

TABLE II

DGP: Xit =r j =1 ij F tj +

√ eit ; r = 3; = 3.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 300 300 390 300 300 300 800 390 800 800 782 290

100 60 300 300 354 300 300 300 800 354 800 800 753 298200 60 300 300 300 300 300 300 800 300 800 800 614 300

500 60 300 300 300 300 300 300 595 300 800 800 313 3001000 60 300 300 300 300 300 300 300 300 800 800 300 300

2000 60 300 300 300 300 300 300 300 300 800 800 300 300100 100 300 300 423 300 300 300 800 423 800 423 720 300

200 100 300 300 300 300 300 300 800 300 800 800 621 300500 100 300 300 300 300 300 300 800 300 800 800 315 300

1000 100 300 300 300 300 300 300 301 300 800 800 300 300

2000 100 300 300 300 300 300 300 300 300 800 800 300 300

40 100 300 300 370 300 300 300 800 800 800 370 763 29260 100 300 300 342 300 300 300 800 800 800 342 739 29960 200 300 300 300 300 300 300 800 800 800 300 583 300

60 500 300 300 300 300 300 300 800 800 544 300 303 30060 1000 300 300 300 300 300 300 800 800 300 300 300 300

60 2000 300 300 300 300 300 300 800 800 300 300 300 3004000 60 300 300 300 300 300 300 300 300 800 800 300 2984000 100 300 300 300 300 300 300 300 300 800 800 300 300

8000 60 300 300 300 300 300 300 300 300 800 800 300 2978000 100 300 300 300 300 300 300 300 300 800 800 300 300

60 4000 300 300 300 300 300 300 800 800 300 300 300 299100 4000 300 300 300 300 300 300 800 800 300 300 300 300

60 8000 300 300 300 300 300 300 800 800 300 300 300 298100 8000 300 300 300 300 300 300 800 800 300 300 300 300

10 50 800 800 800 800 800 800 800 800 800 800 800 721

10 100 800 800 800 800 800 800 800 800 800 800 800 60120 100 522 457 662 295 292 298 800 800 800 662 800 268

100 10 800 800 800 800 800 800 800 800 800 800 800 800100 20 600 529 739 295 291 299 800 739 800 800 800 272

cases for which the average itself is not an integer. Even for these latter cases,the standard deviations do not exceed 0.6.

We next modify the assumption on the idiosyncratic errors to allow for serialand cross-section correlation. These errors are generated from the process

eit = eit−1 + vit +J 

j =0 j =−J 

vi−jt

The case of pure serial correlation obtains when the cross-section correlationparameter is zero. Since for each i, the unconditional variance of  eit is1/1 − 2, the more persistent are the idiosyncratic errors, the larger are theirvariances relative to the common factors, and the precision of the estimates canbe expected to fall. However, even with

=5, Table VI shows that the estimates

provided by the proposed criteria are still very good. The case of pure cross-

Page 16: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 16/31

206 jushan bai and serena ng

TABLE III

DGP: Xit =r j =1 ij F tj +

√ eit ; r = 5; = 5.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 499 498 517 488 468 499 800 517 800 800 794 305

100 60 500 500 507 499 494 500 800 507 800 800 787 350200 60 500 500 500 500 500 500 800 500 800 800 691 380

500 60 500 500 500 500 500 500 688 500 800 800 501 3881000 60 500 500 500 500 500 500 500 500 800 800 500 382

2000 60 500 500 500 500 500 500 500 500 800 800 500 359100 100 500 500 542 500 500 501 800 542 800 542 775 416

200 100 500 500 500 500 500 500 800 500 800 800 706 480500 100 500 500 500 500 500 500 800 500 800 800 502 497

1000 100 500 500 500 500 500 500 500 500 800 800 500 498

2000 100 500 500 500 500 500 500 500 500 800 800 500 498

40 100 500 499 509 486 469 500 800 800 800 509 786 29660 100 500 500 505 499 494 500 800 800 800 505 781 34660 200 500 500 500 500 500 500 800 800 800 500 671 383

60 500 500 500 500 500 500 500 800 800 644 500 500 39160 1000 500 500 500 500 500 500 800 800 500 500 500 379

60 2000 500 500 500 500 500 500 800 800 500 500 500 3584000 60 500 500 500 500 500 500 500 500 800 800 500 3374000 100 500 500 500 500 500 500 500 500 800 800 500 496

8000 60 500 500 500 500 500 500 500 500 800 800 500 3108000 100 500 500 500 500 500 500 500 500 800 800 500 493

60 4000 500 500 500 500 500 500 800 800 500 500 500 335100 4000 500 500 500 500 500 500 800 800 500 500 500 496

60 8000 500 500 500 500 500 500 800 800 500 500 500 312100 8000 500 500 500 500 500 500 800 800 500 500 500 493

10 50 800 800 800 800 800 800 800 800 800 800 800 728

10 100 800 800 800 800 800 800 800 800 800 800 800 63020 100 588 541 699 417 379 468 800 800 800 699 800 279

100 10 800 800 800 800 800 800 800 800 800 800 800 800100 20 649 594 762 424 387 481 800 762 800 800 800 293

section dependence obtains with = 0. As in Chamberlain and Rothschild (1983),our theory permits some degree of cross-section correlation. Given the assumedprocess for eit , the amount of cross correlation depends on the number of unitsthat are cross correlated (2J ), as well as the magnitude of the pairwise correla-tion (). We set to .2 and J  to maxN /20 10. Effectively, when N  200, 10percent of the units are cross correlated, and when N > 200 20/N  of the sam-ple is cross correlated. As the results in Table VII indicate, the proposed criteriastill give very good estimates of  r  and continue to do so for small variations in and J . Table VIII reports results that allow for both serial and cross-sectioncorrelation. The variance of the idiosyncratic errors is now 1 + 2J2/1 − 2times larger than the variance of the common component. While this reduces theprecision of the estimates somewhat, the results generally confirm that a smalldegree of correlation in the idiosyncratic errors will not affect the properties of 

Page 17: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 17/31

approximate factor models 207

TABLE IV

DGP: Xit =r j =1 ij F tj +

√ eit ; eit = e1

it +t e2it (t = 1 for t Even, t = 0 for t Odd); r = 5; = 5.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 496 486 609 409 337 493 800 609 800 800 800 181

100 60 499 490 585 469 418 501 800 585 800 800 800 208200 60 500 499 500 493 487 500 800 500 800 800 800 222

500 60 500 500 500 499 498 500 800 500 800 800 791 2231000 60 500 500 500 500 500 500 797 500 800 800 647 202

2000 60 500 500 500 500 500 500 551 500 800 800 503 172100 100 500 498 660 498 479 524 800 660 800 660 800 256

200 100 500 500 500 500 500 500 800 500 800 800 800 333500 100 500 500 500 500 500 500 800 500 800 800 794 393

1000 100 500 500 500 500 500 500 800 500 800 800 613 398

2000 100 500 500 500 500 500 500 536 500 800 800 500 385

40 100 494 480 539 404 330 490 800 800 800 539 799 16860 100 498 488 541 466 414 500 800 800 800 541 799 20460 200 500 499 500 495 487 500 800 800 800 500 756 214

60 500 500 500 500 499 498 500 800 800 729 500 507 21360 1000 500 500 500 500 500 500 800 800 500 500 500 190

60 2000 500 500 500 500 500 500 800 800 500 500 500 1594000 60 500 500 500 500 500 500 500 500 800 800 500 1464000 100 500 500 500 500 500 500 500 500 800 800 500 367

8000 60 500 500 500 500 500 500 500 500 800 800 500 1168000 100 500 500 500 500 500 500 500 500 800 800 500 337

60 4000 500 500 500 500 500 500 800 800 500 500 500 130100 4000 500 500 500 500 500 500 800 800 500 500 500 362

60 8000 500 500 500 500 500 500 800 800 500 500 500 108100 8000 500 500 500 500 500 500 800 800 500 500 500 329

10 50 800 800 800 800 800 800 800 800 800 800 800 727

10 100 800 800 800 800 800 800 800 800 800 800 800 63420 100 613 562 723 285 223 393 800 800 800 723 800 186

100 10 800 800 800 800 800 800 800 800 800 800 800 800100 20 752 699 799 331 264 617 800 799 800 800 800 230

the estimates. However, it will generally be true that for the proposed criteria tobe as precise in approximate as in strict factor models, N  has to be fairly large

relative to J cannot be too large, and the errors cannot be too persistent asrequired by theory. It is also noteworthy that the BIC 3 has very good propertiesin the presence of cross-section correlations (see Tables VII and VIII) and thecriterion can be useful in practice even though it does not satisfy all the condi-tions of Theorem 2.

61 Application to Asset Returns

Factor models for asset returns are extensively studied in the finance litera-ture. An excellent summary on multifactor asset pricing models can be found inCampbell, Lo, and Mackinlay (1997). Two basic approaches are employed. One

Page 18: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 18/31

208 jushan bai and serena ng

TABLE V

DGP: Xit =r j =1 ij F tj +

√ eit ; r = 5; = r × 2.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 463 429 514 279 191 447 800 800 800 514 793 082

100 60 478 441 506 373 261 496 800 800 800 506 786 092200 60 490 480 500 442 403 494 800 800 800 500 692 093

500 60 496 494 499 477 468 492 800 800 688 499 501 0771000 60 497 497 498 488 486 493 800 800 500 498 500 056

2000 60 498 498 499 491 489 492 800 800 500 499 500 034100 100 496 467 542 464 361 501 800 542 800 542 774 123

200 100 500 499 500 498 490 500 800 800 800 500 705 180500 100 500 500 500 500 500 500 800 800 800 500 502 219

1000 100 500 500 500 500 500 500 800 800 500 500 500 217

2000 100 500 500 500 500 500 500 800 800 500 500 500 206

40 100 461 425 507 265 184 448 800 507 800 800 783 07460 100 476 438 505 366 260 497 800 505 800 800 781 09260 200 490 478 500 443 407 495 800 500 800 800 670 088

60 500 497 495 499 478 471 493 644 499 800 800 500 07460 1000 498 497 499 487 484 492 500 499 800 800 500 051

60 2000 499 498 499 489 488 492 500 499 800 800 500 0324000 60 499 499 499 492 492 493 800 800 500 499 500 0184000 100 500 500 500 500 500 500 800 800 500 500 500 172

8000 60 499 499 499 492 492 493 800 800 500 499 500 0088000 100 500 500 500 500 500 500 800 800 500 500 500 140

60 4000 499 499 499 493 492 495 500 499 800 800 500 015100 4000 500 500 500 500 500 500 500 500 800 800 500 170

60 8000 499 499 499 492 492 493 500 499 800 800 500 008100 8000 500 500 500 500 500 500 500 500 800 800 500 140

100 10 800 800 800 800 800 800 800 800 800 800 800 724

100 20 800 800 800 800 800 800 800 800 800 800 800 61810 50 573 522 690 167 133 279 800 690 800 800 800 112

10 100 800 800 800 800 800 800 800 800 800 800 800 80020 100 639 579 757 185 144 304 800 800 800 757 800 131

is statistical factor analysis of unobservable factors, and the other is regressionanalysis on observable factors. For the first approach, most studies use groupeddata (portfolios) in order to satisfy the small N  restriction imposed by classicalfactor analysis, with exceptions such as Connor and Korajczyk (1993). The secondapproach uses macroeconomic and financial market variables that are thoughtto capture systematic risks as observable factors. With the method developed inthis paper, we can estimate the number of factors for the broad U.S. stock mar-ket, without the need to group the data, or without being specific about whichobserved series are good proxies for systematic risks.

Monthly data between 1994.1–1998.12 are available for the returns of 8436stocks traded on the New York Stock Exchange, AMEX, and NASDAQ. Thedata include all lived stocks on the last trading day of 1998 and are obtainedfrom the CRSP data base. Of these, returns for 4883 firms are available for each

Page 19: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 19/31

approximate factor models 209

TABLE VI

DGP: Xit =r j =1 ij F tj +

√ eit ; eit = eit−1 + vit +

J j =−J j =0 vi−jt ; r = 5; = 5, = 5, = 0, J  = 0.

N T PC  p1

PC p2

PC p3

IC p1

IC p2

IC p3

AIC 1

BIC 1

AIC 2

BIC 2

AIC 3

BIC 3

100 40 731 659 800 552 453 800 800 800 800 800 800 297

100 60 611 527 800 500 476 800 800 800 800 800 800 309200 60 594 538 788 501 499 739 800 788 800 800 800 331

500 60 568 539 679 500 500 511 800 679 800 800 800 3411000 60 541 527 602 500 500 500 800 602 800 800 800 327

2000 60 521 514 550 500 500 500 800 550 800 800 800 306100 100 504 500 800 500 497 800 800 800 800 800 800 345

200 100 500 500 775 500 500 712 800 775 800 800 800 426500 100 500 500 521 500 500 500 800 521 800 800 800 468

1000 100 500 500 500 500 500 500 800 500 800 800 800 473

2000 100 500 500 500 500 500 500 800 500 800 800 800 469

40 100 537 505 730 458 408 582 800 800 800 730 800 24560 100 513 499 788 493 467 740 800 800 800 788 800 28060 200 500 500 502 499 496 500 800 800 800 502 800 284

60 500 500 500 500 500 500 500 800 800 800 500 753 27260 1000 500 500 500 500 500 500 800 800 572 500 504 254

60 2000 500 500 500 500 500 500 800 800 500 500 500 2284000 60 511 508 522 500 500 500 800 522 800 800 800 2814000 100 500 500 500 500 500 500 800 500 800 800 800 462

8000 60 505 505 508 500 500 500 800 508 800 800 800 2558000 100 500 500 500 500 500 500 800 500 800 800 800 437

60 4000 500 500 500 500 500 500 800 800 500 500 500 192100 4000 500 500 500 500 500 500 800 800 500 500 500 421

60 8000 500 500 500 500 500 500 800 800 500 500 500 164100 8000 500 500 500 500 500 500 800 800 500 500 500 397

100 10 800 800 800 800 800 800 800 800 800 800 800 747

100 20 800 800 800 800 800 800 800 800 800 800 800 66910 50 716 668 789 357 292 570 800 800 800 789 800 242

10 100 800 800 800 800 800 800 800 800 800 800 800 80020 100 800 799 800 793 758 800 800 800 800 800 800 392

of the 60 months. We use the proposed criteria to determine the number of factors. We transform the data so that each series is mean zero. For this balanced

panel with T  = 60 N  = 4883 and kmax = 15, the recommended criteria, namely,PC p1 PC p2 IC p1, and IC p2, all suggest the presence of two factors.

7 concluding remarks

In this paper, we propose criteria for the selection of factors in large dimen-sional panels. The main appeal of our results is that they are developed underthe assumption that N T  → and are thus appropriate for many datasets typi-cally used in macroeconomic analysis. Some degree of correlation in the errors isalso allowed. The criteria should be useful in applications in which the numberof factors has traditionally been assumed rather than determined by the data.

Page 20: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 20/31

210 jushan bai and serena ng

TABLE VII

DGP: Xit =r j =1 ij F tj +

√ eit ; eit = eit−1 + vit +

J j =−J j =0 vi−jt ; r = 5; = 5, = 00, = 20,

J  = maxN /20 10.

N T PC  p1 PC p2 PC p3 IC p1 IC p2 IC p3 AIC 1 BIC 1 AIC 2 BIC 2 AIC 3 BIC 3

100 40 550 527 602 509 501 563 800 602 800 800 798 424100 60 557 524 603 515 502 596 800 603 800 800 796 472

200 60 597 594 600 588 576 599 800 600 800 800 763 489500 60 501 501 510 500 500 501 744 510 800 800 600 493

1000 60 500 500 500 500 500 500 598 500 800 800 593 4932000 60 500 500 500 500 500 500 505 500 800 800 501 488

100 100 579 530 631 543 504 603 800 631 800 631 795 498200 100 600 600 600 600 598 600 800 600 800 800 784 500500 100 521 511 564 506 503 541 800 564 800 800 602 500

1000 100 500 500 500 500 500 500 600 500 800 800 600 500

2000 100 500 500 500 500 500 500 572 500 800 800 541 50040 100 517 506 595 500 498 530 800 800 800 595 796 42260 100 530 506 601 503 500 587 800 800 800 601 794 469

60 200 535 516 595 504 501 565 800 800 800 595 739 48960 500 543 529 583 505 502 535 800 800 749 583 604 494

60 1000 555 545 579 508 505 525 800 800 601 579 600 49360 2000 564 559 576 507 504 517 800 800 600 576 600 491

4000 60 500 500 500 500 500 500 500 500 800 800 500 484

4000 100 500 500 500 500 500 500 500 500 800 800 500 5008000 60 500 500 500 500 500 500 500 500 800 800 500 472

8000 100 500 500 500 500 500 500 500 500 800 800 500 50060 4000 565 563 572 505 504 509 800 800 600 572 600 485

100 4000 600 600 600 600 600 600 800 800 614 600 602 50060 8000 567 566 571 504 504 505 800 800 600 571 600 477

100 8000 600 600 600 600 600 600 800 800 600 600 600 500

100 10 800 800 800 800 800 800 800 800 800 800 800 734100 20 800 800 800 800 800 800 800 800 800 800 800 649

10 50 623 584 718 482 467 514 800 800 800 718 800 37210 100 800 800 800 800 800 800 800 800 800 800 800 800

20 100 675 627 775 497 473 571 800 775 800 800 800 381

Our discussion has focused on balanced panels. However, as discussed in Rubinand Thayer (1982) and Stock and Watson (1998), an iterative EM algorithmcan be used to handle missing data. The idea is to replace Xit by its value aspredicted by the parameters obtained from the last iteration when Xit is notobserved. Thus, if  ij and F t j are estimated values of  i and F t from thej th iteration, let X∗

it j − 1 = Xit if  Xit is observed, and X∗it j − 1 =

ij − 1×F tj − 1 otherwise. We then minimize V ∗k with respect to F j and j,where V ∗k = N T −1

T i=1

T t=1X∗

it j − 1 − ki jF kt j2. Essentially, eigen-

values are computed for the T  × T  matrix X∗j − 1X∗j − 1. This process isiterated until convergence is achieved.

Many issues in factor analysis await further research. Except for some resultsderived for classical factor models, little is known about the limiting distribution

Page 21: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 21/31

approximate factor models 211

TABLE VIII

DGP: Xit =r j =1 ij F tj +

√ eit ; eit = eit−1 + vit +

J j =−J j =0 vi−jt ; r = 5; = 5, = 050, = 20,

J  = maxN /20 10.

N T PC  p1 PC p2 PC p3 IC p1 IC p2 IC p3 AIC 1 BIC 1 AIC 2 BIC 2 AIC 3 BIC 3

100 40 754 692 800 643 552 800 800 800 800 800 800 414100 60 657 593 800 568 528 800 800 800 800 800 800 439

200 60 652 615 797 600 591 784 800 797 800 800 800 468500 60 616 597 712 540 530 592 800 712 800 800 800 476

1000 60 571 556 620 503 502 508 800 620 800 800 800 4762000 60 533 526 561 500 500 500 800 561 800 800 800 469

100 100 598 571 800 572 527 800 800 800 800 800 800 480200 100 601 600 795 600 599 778 800 795 800 800 800 503500 100 589 581 606 559 546 594 800 606 800 800 800 500

1000 100 513 509 537 501 501 509 800 537 800 800 800 500

2000 100 500 500 500 500 500 500 800 500 800 800 800 50040 100 588 546 755 507 493 657 800 800 800 755 800 37660 100 584 545 796 524 505 779 800 800 800 796 800 425

60 200 567 544 599 520 507 583 800 800 800 599 800 44260 500 559 547 588 513 508 548 800 800 800 588 791 450

60 1000 561 554 581 513 508 534 800 800 691 581 615 44060 2000 564 560 574 511 508 522 800 800 600 574 600 427

4000 60 512 510 524 500 500 500 800 524 800 800 800 456

4000 100 500 500 500 500 500 500 800 500 800 800 800 5008000 60 505 505 508 500 500 500 800 508 800 800 800 437

8000 100 500 500 500 500 500 500 800 500 800 800 800 50060 4000 563 561 570 507 506 512 800 800 600 570 600 404

100 4000 600 600 600 600 600 600 800 800 644 600 617 50060 8000 563 562 568 506 505 507 800 800 600 568 600 383

100 8000 600 600 600 600 600 600 800 800 608 600 602 500

100 10 800 800 800 800 800 800 800 800 800 800 800 754100 20 800 800 800 800 800 800 800 800 800 800 800 685

10 50 734 687 793 484 437 682 800 800 800 793 800 34110 100 800 800 800 800 800 800 800 800 800 800 800 800

20 100 800 800 800 799 784 800 800 800 800 800 800 454

of the estimated common factors and common components (i.e.,ˆ

iF t ). But usingTheorem 1, it may be possible to obtain these limiting distributions. For example,

the rate of convergence of  F t derived in this paper could be used to examine thestatistical property of the forecast yT +1T  in Stock and Watson’s framework. It

would be useful to show that yT +1T  is not only a consistent but a√ 

T  consistentestimator of yT +1, conditional on the information up to time T  (provided that N is of no smaller order of magnitude than T ). Additional asymptotic results arecurrently being investigated by the authors.

The foregoing analysis has assumed a static relationship between the observeddata and the factors. Our model allows F t to be a dependent process, e.g,ALF 

t =

t, where AL is a polynomial matrix of the lag operator. However,

we do not consider the case in which the dynamics enter into Xt directly. If the

Page 22: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 22/31

212 jushan bai and serena ng

method developed in this paper is applied to such a dynamic model, the esti-mated number of factors gives an upper bound of the true number of factors.Consider the data generating process Xit

=aiF t

+biF t

−1

+eit . From the dynamic

point of view, there is only one factor. The static approach treats the model ashaving two factors, unless the factor loading matrix has a rank of one.

The literature on dynamic factor models is growing. Assuming N  is fixed,Sargent and Sims (1977) and Geweke (1977) extended the static strict factormodel to allow for dynamics. Stock and Watson (1998) suggested how dynamicscan be introduced into factor models when both N  and T  are large, althoughtheir empirical applications assumed a static factor structure. Forni et al. (2000a)further allowed Xit to also depend on the leads of the factors and proposed agraphic approach for estimating the number of factors. However, determiningthe number of factors in a dynamic setting is a complex issue. We hope that

the ideas and methodology introduced in this paper will shed light on a formaltreatment of this problem.

Dept. of Economics, Boston College, Chestnut Hill, MA 02467, U.S.A.;[email protected]

andDept. of Economics, Johns Hopkins University, Baltimore, MD 21218, U.S.A.;

[email protected]

Manuscript received April, 2000; final revision received December, 2000.

APPENDIX

To prove the main results we need the following lemma.

Lemma 1: Under Assumptions A–C, we have for some M 1 < , and for all N  and T ,

(i) T −1T 

s=1

T t=1

 N st2 ≤ M 1,

(ii) E

T −1

T t=1

N −1/2et 02

= E

T −1

T t=1

N −1/2N 

i=1

eit 0i

2

≤ M 1

(iii) E

T −2

t=1

s=1

N −1

i=1

Xit Xis

2≤ M 1,

(iv) E N T −1/2

N i=1

T t=1

eit 0i

≤ M 1.

Proof: Consider (i). Let st =  N st/ N ss N tt1/2 . Then st ≤ 1. From

 N ss ≤ M ,

T −1T 

s=1

T t=1

 N st2 = T −1T 

s=1

T t=1

 N ss N ttst2

≤ MT −1T 

s=1

T t=1

 N ss N tt1/2st

= MT −1

T s=1

T t=1

 N st ≤ M 2

Page 23: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 23/31

approximate factor models 213

by Assumption C2. Consider (ii).

E

N −1/2N 

i=1

eit 0i

2

= 1

i=1

j =1

Eeit ejt 0i 0

j  ≤ 2 1

i=1

j =1

 ij  ≤ 2M 

by Assumptions B and C3. For (iii), it is sufficient to prove EXit 4 ≤ M 1 for all it. Now EXit 4 ≤8E0

i F 0t 4 +8Eeit 4 ≤ 84EF 0t 4 +8Eeit 4 ≤ M 1 for some M 1 by Assumptions A, B, and C1. Finally

for (iv),

E

N T −1/2N 

i=1

T t=1

eit 0i

2

= 1

N T 

N i=1

N j =1

T t=1

T s=1

Ee it ejs 0i 0

≤ 2 1

N T 

N i=1

N j =1

T t=1

T s=1

 ijts ≤ 2M 

by Assumption C4.

Proof of Theorem 1: We use the mathematical identity F k = N −1X k and k = T −1XF k .

From the normalization F k F k/T  = I k , we also have T −1

T t=1 F kt 2 = Op 1. For H k

=F k

F 0/T0

0/N , we have

F kt − H kF 0t = T −1

T s=1

F ks  N st + T −1T 

s=1

F ks  st + T −1T 

s=1

F ks st + T −1T 

s=1

F ks st where

 st = es et

N −  N st

st = F 0

s 0et /N

st

=F 0

t 0

es /N 

=ts

Note that H k depends on N  and T . Throughout, we will suppress this dependence to simplify the

notation. We also note that H k = Op 1 because H k ≤ F k F k/T 1/2F 0

F 0/T 1/20

/N  andeach of the matrix norms is stochastically bounded by Assumptions A and B. Because x + y + z +u2 ≤ 4x2 + y2 + z2 + u2F kt − H k

F 0t 2 ≤ 4at + bt + ct + dt , where

at = T −2

T s=1

F ks  N st

2

bt = T −2

T s=1

F ks  st

2

ct = T −2 T s=1

F ks st2

dt = T −2

T s=1

F ks st

2

It follows that 1/T T 

t=1 F kt − H kF 0t 2 ≤ 1/T 

T t=1at + bt + ct + dt .

Now T s=1

F ks  N st2 ≤ T 

s=1 F ks 2 · T 

s=1  2N st. Thus,

T −1T 

t=1

at ≤ T −1

T −1

T s=1

F ks

2

· T −1

T t=1

T s=1

 N st2

= Op T −1

by Lemma 1(i).

Page 24: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 24/31

214 jushan bai and serena ng

For bt , we have that

t=1

bt = T −2T 

t=1

s=1 F ks  st

2

= T −2T 

t=1

T s=1

T u=1

F k

sF ku  st  ut

T −2T 

s=1

T u=1

F k

sF ku

2

1/2T −2

T s=1

T u=1

T t=1

 st  ut

21/2

T −1T 

s=1

F ks

2

·

T −2T 

s=1

T u=1

T t=1

 st  ut

21/2

From ET 

t=1  st  ut 2 = ET 

t=1

T v=1  st  ut  sv  uv ≤ T 2 maxs t E st 4 and

E st 4

=1

N 2 EN −1/2

i=1eit eis − Eeit eis 4

≤ N −2

by Assumption C5, we have

T t=1

bt ≤ Op1 · 

T 2

N 2= Op

T −1T 

t=1 bt = OpN −1. For ct , we have

ct = T −2

T s=1

F ks st

2

= T −2

T s=1

F ks F 0

s 0et /N 

2

≤N −2

e

t 0

2

T −1T 

s=1 F ks

2

T −1T 

s=1 F 0s

2

= N −2e

t 02Op 1

It follows that

T −1T 

t=1

ct = Op 1N −1T −1T 

t=1

et 0

√ N 

2

= Op N −1

by Lemma 1(ii). The term dt = OpN −1 can be proved similarly. Combining these results, we have

T −1T 

t=1at + bt + ct + dt = Op N −1+ OpT −1.To prove Theorem 2, we need additional results.

Lemma 2: For any k, 1 ≤ k ≤ r , and H k defined as the matrix in Theorem 1,

Vk F k − VkF 0H k = OpC −1N T 

Proof: For the true factor matrix with r  factors and H k defined in Theorem 1, let M 0FH  =I − P 0FH  denote the idempotent matrix spanned by null space of  F 0H k. Correspondingly, let M kF 

=I T  − F kF k

F k−1F k = I −P kF 

. Then

Vk F k = N −1T −1N 

i=1

XiM kF 

Xi

VkF 0H k = N −1T −1N 

t=1

Xi M 0FH X i

Vk F k − VkF 0H k = N −1T −1

N i=1

XiP 0FH  − P kF Xi

Page 25: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 25/31

approximate factor models 215

Let Dk = F k F k/T  and D0 = H k

F 0

F 0H k/T . Then

k

F  − P 

0

FH  = T −1F 

k F k

F k

T  −1

k

− T −1

0

kH k

F 0

F 0H k

T  −1

k

0

= T −1F kD−1

kF k − F 0H kD−1

0 H kF 0

= T −1F k − F 0H k + F 0H k D−1k F k − F 0H k +F 0H k − F 0H kD−1

0 H kF 0

= T −1F k − F 0H kD−1k F k − F 0H k + F k − F 0H kD−1

k H kF 0

+ F 0H kD−1k F k − F 0H k + F 0H kD−1

k − D−10 H k

F 0

Thus, N −1T −1N 

i=1 XiP kF 

− P 0FH Xi = I + II + III + IV . We consider each term in turn.

I = N −1T −2N 

i=1

T t=1

T s=1

F kt − H kF 0t D−1

k F ks − H kF 0s Xit Xis

T −2T 

t=1

T s=1

F kt − H kF 0t D−1

k F ks − H kF 0s 2

1/2

·

T −2T 

t=1

T s=1

N −1

N i=1

Xit Xis

21/2

T −1T 

t=1

F kt − H kF 0t 2

· D−1

k · Op1 = Op C −2N T 

by Theorem 1 and Lemma 1(iii). We used the fact that D−1k = Op1, which is proved below.

II =N −1T −2N 

i=1

T t=1

T s=1

F kt −H kF 0t D−1

k H kF 0s Xit Xis

T −2T 

t=1

T s=1

F kt −H kF 0t 2 ·H k

F 0s 2 ·D−1

k 2

1/2

·

T −2T 

t=1

T s=1

N −1

N i=1

Xit Xis

21/2

T −1T 

t=1

F kt −H kF 0t 2

1/2

·D−1k ·

T −1

T s=1

H kF 0s 2

1/2

·Op 1

=

T −1T 

t=1

F kt −H k

F 0t 2

1/2

·Op 1=Op C −1N T 

It can be verified that III  is also Op C −1N T .

IV  = N −1T −2N 

i=1

T t=1

T s=1

F 0

t H kD−1k − D−1

0 H kF 0s Xit Xis

≤ D−1k − D−1

0 N −1N 

i=1

T −1

T t=1

H kF 0t · Xit

2

= D−1k − D−1

0 · Op 1

where Op 1 is obtained because the term is bounded by H 

k

2

1/T T 

t=1 F 

0

t 2

1/NT×N i=1T 

t=1 Xit 2, which is Op 1 by Assumption A and EXit 2 ≤ M . Next, we prove that Dk −D0 =

Page 26: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 26/31

216 jushan bai and serena ng

Op C −1N T . From

Dk − D0 =F k

F k

T − H k

F 0

F 0H k

= T −1T 

t=1

F ktF k

t − H k

F 0t F 0

t H k

= T −1T 

t=1

F kt − H kF 0t F kt −H k

F 0t

+ T −1T 

t=1

F kt − H kF 0t F 0

t H k + T −1

T t=1

H kF 0t F kt − H k

F 0t

Dk − D0≤ T −1T 

t=1

F kt − H kF 0t 2 + 2

T −1

T t=1

F kt − H kF 0t 2

1/2

·

T −1T 

t=1

H kF 0t 2

1/2

=Op C −2

N T 

+OpC −1

N T 

=Op C −1

N T 

Because F 0F 0/T  converges to a positive definite matrix, and because rank H k = k ≤ r , D0k × k

converges to a positive definite matrix. From Dk − D0 = Op C −1N T , Dk also converges to a positive

definite matrix. This implies that D−1k = Op1. Moreover, from D−1

k −D−10 = D−1

k D0 −Dk D−10 we

have D−1k − D−1

0 = Dk − D0Op1 = Op C −1N T . Thus IV  = Op C −1

N T .

Lemma 3: For the matrix H k defined in Theorem 1, and for each k with k < r , there exists a  k > 0

such that

plim inf N T →

VkF 0H k− V r F  0 =  k

Proof:

VkF 0H k− V r F  0 = N −1T −1

N i

XiP 0F  − P 0FH X i

= N −1T −1N 

i−1

F 00i + ei

P 0F  − P 0FH F 00i + ei

= N −1T −1N 

i=1

0i F 0

P 0F  − P 0FH F 00

i

+ 2N −1T −1N 

i=1

eiP 0F  − P 0FH F 00

i

+N −1T −1

i=1

eiP 0F 

−P 0F H ei

= I + II + III 

First, note that P 0F  − P 0FH  ≥ 0. Hence, III ≥ 0. For the first two terms,

I  = tr

T −1F 0

P 0F  − P 0FH F 0N −1

N i=1

0i 0

i

= tr

F 0

F 0

T − F 0

F 0H k

H k

F 0

F 0H k

−1H k

F 0

F 0

· N −1

N i=1

0i 0

i

→ tr

F  − F H k0

H k

0 F H k0

−1

H k

0 F 

· D

= trA·D

Page 27: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 27/31

approximate factor models 217

where A = F  − F H k0 H k

0 F H k0 −1H k

0 F  and H k0 is the limit of  H k with rankH k0 = k < r . NowA = 0 because rankF  = r  (Assumption A). Also, A is positive semi-definite and D > 0 (AssumptionB). This implies that trA · D > 0.

Remark: Stock and Watson (1998) studied the limit of  H k. The convergence of  H k to H k0 holdsjointly in T  and N  and does not require any restriction between T  and N .

Now

II = 2N −1T −1N 

i=1

eiP 0F F 00

i − 2N −1T −1N 

i=1

eiP 0FH F 00

i

Consider the first term.

N −1T −1

N i=1

eiP 0F F 00

i = N −1T −1

N i=1

T t=1

eit F 0t 0

i ≤

T −1T 

t=1

F 0t 2

1/2

· 1√ N 

T −1

T t=1

1√ N 

N i=1

eit 0i

21/2

= Op

1√ N 

The last equality follows from Lemma 1(ii). The second term is also Op1/√ 

N , and hence II  =Op 1/

√ N → 0.

Lemma 4: For any fixed k with k ≥ r , Vk F k − Vr F r  = Op C −2N T ).

Proof:

Vk F k − Vr F r  ≤ Vk F k − V r F  0+ V r F  0− Vr F r ≤ 2 max

r ≤k≤kmaxVk F k − V r F  0

Thus, it is sufficient to prove for each k with k ≥ r ,

Vk F k − V r F  0 = OpC −2N T (10)

Let H k be as defined in Theorem 1, now with rank r  because k ≥ r . Let H k+ be the generalizedinverse of  H k such that H kH k+ = I r . From Xi = F 00

i + ei, we have Xi = F 0H k H k+0i + ei. This

implies

Xi = F k H k+0i + ei − F k − F 0H kH k+0

i

= F k H k+0i + ui

where ui = ei − F k − F 0H k H k+0i .

Page 28: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 28/31

218 jushan bai and serena ng

Note that

Vk

F k = N −1T −1

i=1

ui M k

ui

V r F  0 = N −1T −1N 

i=1

eiM 0F ei

Vk F k = N −1T −1N 

i=1

ei − F k − F 0H kH k+0i M kF 

ei − F k − F 0H kH k+0i

= N −1T −1N 

i=1

eiM kF 

ei − 2N −1T −1N 

i=1

0i H k+F k − F 0H k M kF 

ei

+ N −1T −1N 

i=1

0i H k+F k − F 0H kM kF 

F k − F 0H kH k+0i

= a + b + c

Because I − M kF is positive semi-definite, xM kF 

x ≤ xx. Thus,

c ≤ N −1T −1N 

i=1

0i H k+F k − F 0H k F k − F 0H kH k+0

i

≤ T −1T 

t=1

F kt − H kF 0t 2 ·

N −1

N i=1

0i 2H k+2

= Op C −2N T  · Op1

by Theorem 1. For term b, we use the fact that trA ≤ r A for any r × r  matrix A. Thus

b = 2T −1trH k+F k − F 0H kM kF N −1

N i=1

ei0i

≤ 2r H k+ · F k − F 0H k√ 

· 1√ 

T N 

N i=1

ei0i

≤ 2r H k+ ·

T −1

T t=1

F kt − H kF 0t 2

1/2

· 1√ N 

1

T t=1

1√ N 

N i=1

eit 0i

21/2

= OpC −1N T  · 1√ 

N = Op C −2

N T 

by Theorem 1 and Lemma 1(ii). Therefore,

Vk F k = N −1T −1N 

i=1

eiM kF 

ei + Op C −2N T 

Using the fact that Vk F k − V r F  0 ≤ 0 for k ≥ r ,

0 ≥ Vk F k − V r F  0 = 1

N T 

N i=1

eiP kF 

ei −1

N T 

N i=1

eiP 0F ei + Op C −2

N T (11)

Note that

1

N T 

N i=1

ei P 0F ei ≤ F 0

F 0/T −1 · N −1T −2

N i=1

eiF 0F 0

ei

= Op1T −1N −1

N i=1T −1/2

N t=1

F 0t eit2

= Op T −1 ≤ OpC −2N T 

Page 29: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 29/31

approximate factor models 219

by Assumption D. Thus

0 ≥ N −1T −1N 

i=1

ei P k

ei + Op C −2N T 

This implies that 0 ≤ N −1T −1N 

i=1 eiP kF 

ei = Op C −2N T . In summary

Vk F k − V r F  0 = OpC −2N T 

Proof of Theorem 2: We shall prove that limN T → PPC k < PCr = 0 for all k = r  andk ≤ kmax. Since

PCk − PCr = Vk F k− Vr F r  − r − kgNT

it is sufficient to prove PVk F k−Vr F r  < r  −kgNT → 0 as N T  → . Consider k < r .

We have the identity:

Vk F k − Vr F r  = Vk F k − VkF 0H k + VkF 0H k− V r F  0H r 

+ VrF 0H r − Vr F r 

Lemma 2 implies that the first and the third terms are both Op C −1N T . Next, consider the second

term. Because F 0H r  and F 0 span the same column space, V r F  0H r  = V r F  0. Thus the second

term can be rewritten as VkF 0H k − V r F  0, which has a positive limit by Lemma 3. Hence,PPC k < PCr → 0 if  g N T → 0 as N T  → . Next, for k ≥ r ,

PPCk− PCr < 0 = PVr F r  − Vk F k > k − rgNT

By Lemma 4, Vr

F r −Vk

F k = OpC −2

NT  . For k > r k−rgNT ≥ g N T , which converges

to zero at a slower rate than C −2

N T . Thus for k > r PPCk < PCr → 0 as N T  → .

Proof of Corollary 1: Denote Vk F k by Vk for all k. Then

ICk− ICr = lnVk/Vr+ k − rgNT

For k < r , Lemmas 2 and 3 imply that V k/V r > 1 + 0 for some 0 > 0 with large probability

for all large N  and T . Thus lnVk/V r ≥ 0/2 for large N  and T . Because g N T → 0, wehave ICk − ICr ≥ 0/2 − r − kgNT ≥ 0/3 for large N  and T  with large probability. Thus,PICk − ICr < 0 → 0. Next, consider k > r . Lemma 4 implies that Vk/Vr = 1 + OpC −2

N T .

Thus lnV k/V r = Op C −2N T . Because k − rgNT ≥ g N T , which converges to zero at a

slower rate than C −2NT  , it follows that

PICk− ICr < 0 ≤ P OpC −2N T + gN T < 0→ 0

Proof of Corollary 2: Theorem 2 is based on Lemmas 2, 3, and 4. Lemmas 2 and 3 are stillvalid with F k replaced by Gk and C N T  replaced by C N T . This is because their proof only uses the

convergence rate of  F t given in (5), which is replaced by (8). But the proof of Lemma 4 does make

use of the principle component property of  F k such that Vk F k − V r F  0 ≤ 0 for k ≥ r , which is

not necessarily true for Gk. We shall prove that Lemma 4 still holds when F k is replaced by Gk andC N T  is replaced by C N T . That is, for k ≥ r ,

Vk Gk − Vr Gr  = Op

C −2N T 

(12)

Using arguments similar to those leading to (10), it is sufficient to show that

Vk Gk − V r F  0 = OpC −2

N T (13)

Page 30: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 30/31

220 jushan bai and serena ng

Note that for k ≥ r ,

Vk

F k ≤ Vk

Gk ≤ Vr

Gr (14)

The first inequality follows from the definition that the principal component estimator gives thesmallest sum of squared residuals, and the second inequality follows from the least squares property

that adding more regressors does not increase the sum of squared residuals. Because C 2NT  ≤ C 2NT  , wecan rewrite (10) as

Vk F k − V r F  0 = Op

C −2N T 

(15)

It follows that if we can prove

Vr Gr − V r F  0 = Op

C −2N T 

(16)

then (14), (15), and (16) imply (13). To prove (16), we follow the same arguments as in the proof of 

Lemma 4 to obtain

Vr Gr − V r F  0 = 1

N T 

N i=1

eiP r Gei −

1

N T 

N i=1

eiP 0F ei + Op

C −2N T 

where P r G = Gr Gr  Gr −1 Gr  ; see (11). Because the second term on the right-hand side is shown in

Lemma 4 to be OpT −1, it suffices to prove the first term is OpC −2NT  . Now,

1

N T 

N i=1

ei P r Gei ≤ Gr  Gr /T −1 1

N i=1

eiGr /T 

2

Because

H r  is of full rank, we have Gr  Gr /T −1 = Op1 (follows from the same arguments in

proving D−1

k = Op1). Next,

1

N i=1

ei Gr /T 

2 ≤

1

N T 

N i=1

1√ T 

T t=1

F 0t eit

2H r 

2 +

1

N T 

N i=1

T t=1

e2it

1

T t=1

Gr t − H r  F 0t

2

=Op T −1Op1+Op1OpC −2NT  =Op C −2

N T 

by Assumption D and (8). This completes the proof of (16) and hence Corollary 2.

REFERENCES

Anderson, T. W. (1984): An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Backus, D., S. Forsei, A. Mozumdar, and L. Wu (1997): “Predictable Changes in Yields andForward Rates,” Mimeo, Stern School of Business.

Campbell, J., A. W. Lo, and A. C. Mackinlay (1997): The Econometrics of Financial Markets.Princeton, New Jersey: Princeton University Press.

Chamberlain, G., and M. Rothschild (1983): “Arbitrage, Factor Structure and Mean-VarianceAnalysis in Large Asset Markets,” Econometrica, 51, 1305–1324.

Cochrane, J. (1999): “New Facts in Finance, and Portfolio Advice for a Multifactor World,” NBERWorking Paper 7170.

Connor, G., and R. Korajzcyk  (1986): “Performance Measurement with the Arbitrage PricingTheory: A New Framework for Analysis,” Journal of Financial Economics, 15, 373–394.

(1988): “Risk and Return in an Equilibrium APT: Application to a New Test Methodology,”Journal of Financial Economics, 21, 255–289.

(1993): “A Test for the Number of Factors in an Approximate Factor Model,” Journal of Finance, 48, 1263–1291.

Page 31: bai and ng 2002

8/7/2019 bai and ng 2002

http://slidepdf.com/reader/full/bai-and-ng-2002 31/31

approximate factor models 221

Cragg, J., and S. Donald (1997): “Inferring the Rank of a Matrix,” Journal of Econometrics, 76,223–250.

Dhrymes, P. J., I. Friend, and N. B. Glutekin (1984): “A Critical Reexamination of the Empir-

ical Evidence on the Arbitrage Pricing Theory,”Journal of Finance

, 39, 323–346.Donald, S. (1997): “Inference Concerning the Number of Factors in a Multivariate Nonparameteric

Relationship,” Econometrica, 65, 103–132.Forni, M., M. Hallin, M. Lippi, and L. Reichlin (2000a): “The Generalized Dynamic Factor

Model: Identification and Estimation,” Review of Economics and Statistics, 82, 540–554.(2000b): “Reference Cycles: The NBER Methodology Revisited,” CEPR Discussion Paper

2400.Forni, M., and M. Lippi (1997): Aggregation and the Microfoundations of Dynamic Macroeconomics.

Oxford, U.K.: Oxford University Press.(2000): “The Generalized Dynamic Factor Model: Representation Theory,” Mimeo, Univer-

sitá di Modena.

Forni, M., and L. Reichlin (1998): “Let’s Get Real: a Factor-Analytic Approach to DisaggregatedBusiness Cycle Dynamics,” Review of Economic Studies, 65, 453–473.

Geweke, J. (1977): “The Dynamic Factor Analysis of Economic Time Series,” in Latent Variables in

Socio Economic Models, ed. by D. J. Aigner and A. S. Goldberger. Amsterdam: North Holland.

Geweke, J., and R. Meese (1981): “Estimating Regression Models of Finite but Unknown Order,”International Economic Review, 23, 55–70.

Ghysels, E., and S. Ng (1998): “A Semi-parametric Factor Model for Interest Rates and Spreads,”Review of Economics and Statistics, 80, 489–502.

Gregory, A., and A. Head (1999): “Common and Country-Specific Fluctuations in Productivity,Investment, and the Current Account,” Journal of Monetary Economics, 44, 423–452.

Gregory, A., A. Head, and J. Raynauld (1997): “Measuring World Business Cycles,” Interna-

tional Economic Review, 38, 677–701.Lehmann, B. N., and D. Modest (1988): “The Empirical Foundations of the Arbitrage Pricing

Theory,” Journal of Financial Economics, 21, 213–254.

Lewbel, A. (1991): “The Rank of Demand Systems: Theory and Nonparametric Estimation,” Econo-metrica, 59, 711–730.

Mallows, C. L. (1973): “Some Comments on C p ,” Technometrics, 15, 661–675.

Ross, S. (1976): “The Arbitrage Theory of Capital Asset Pricing,” Journal of Finance, 13, 341–360.Rubin, D. B., and D. T. Thayer (1982): “EM Algorithms for ML Factor Analysis,” Psychometrika,

57, 69–76.

Sargent, T., and C. Sims (1977): “Business Cycle Modelling without Pretending to Have toomuch a Priori Economic Theory,” in New Methods in Business Cycle Research, ed. by C. Sims.

Minneapolis: Federal Reserve Bank of Minneapolis.Schwert, G. W. (1989): “Tests for Unit Roots: A Monte Carlo Investigation,” Journal of Business

and Economic Statistics, 7, 147–160.Stock, J. H., and M. Watson (1989): “New Indexes of Coincident and Leading Economic Indica-

tions,” in NBER Macroeconomics Annual 1989, ed. by O. J. Blanchard and S. Fischer. Cambridge:M.I.T. Press.

(1998): “Diffusion Indexes,” NBER Working Paper 6702.(1999): “Forecasting Inflation,” Journal of Monetary Economics, 44, 293–335.