Subspace Regression: Predicting a Subspace from one Sample · ii is the trace of the matrix A....

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1369 CVPR #1369 CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Subspace Regression: Predicting a Subspace from one Sample Anonymous CVPR submission Paper ID 1369 Abstract Subspace methods have been extensively used to solve a variety of problems in computer vision including object de- tection, recognition, and tracking. Typically the subspaces are learned from a training set that contains different con- figurations or states of a particular object (e.g., variations on shape or appearance). However, in some situations it is not possible to have access to data with multiple configura- tions of the object. An example is the problem of building a person-specific subspace of the pose variation by having only the frontal face. In this paper we propose methods to predict a subspace from only one sample. Predicting a subspace from only one sample is a chal- lenging problem for several reasons: (i) it involves a map- ping in very high-dimensional spaces, and often there are far fewer training samples than the number of features, (ii) it is unclear how to parameterize the subspace. This pa- per proposes four methods that learn a mapping from one sample to a subspace: Individual Mapping on Images, Di- rect Mapping to Subspaces, Regression on Subspaces, and Direct Subspace Alignment. We show the validity of our ap- proaches in predicting a person-specific face subspace of pose or illumination and applications to face tracking. To the best of our knowledge this is the first paper that ad- dresses the problem of learning a subspace directly from one sample. 1. Introduction Since the early work of Sirovich and Kirby [21] pa- rameterizing the human face using Principal Component Analysis (PCA) [13] and the successful eigenfaces of Turk and Pentland [23], many computer vision researchers have used subspace techniques to construct linear models of op- tical flow, shape or gray level for tracking [4, 5], for de- tection [23] and recognition [3]. The modeling power of subspace techniques is especially useful when applied to visual data, because there is a need for dimensionality re- duction given the increase in the number of features (i.e., pixels). Typically, the subspaces are learned from a set of Figure 1. Illustration of the proposed subspace regression ap- proach. (Top) Traditional subspace learning. A person-specific pose subspace is learned by performing PCA on a set of train- ing samples. (Bottom) Subspace regression can predict a person- specific pose subspace from only one sample (e.g., frontal image). registered training samples. An example is the problem of building a person-specific image subspace that models the variation across pose. Building this subspace often requires a large number of training images sampled from the under- lying person manifold across viewpoints, typically a set of images of all possible pose variations. Once the data are collected we can compute PCA (Fig. 1 (Top)) to learn the person-specific pose subspace. In general this procedure typically incurs a costly data collection step: for every new test subject, one has to gather images of that same subject for all possible poses (often, illuminations and expressions). In this paper we propose to learn a subspace from a sin- gle sample (i.e., one image). A crucial advantage is that we can circumvent the expensive data collection step. We cast the problem as a prediction problem that predicts a sub- space from a single image. We refer to it as the subspace regression. Fig. 1 (Bottom) illustrates the main idea of the subspace regression. Predicting a subspace from only one sample is a chal- lenging problem mainly due to the high dimensionality of the data sets, typically with fewer training samples than the number of features. Moreover, it is not straightfor- ward how one can effectively and efficiently parameterize 1

Transcript of Subspace Regression: Predicting a Subspace from one Sample · ii is the trace of the matrix A....

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Subspace Regression: Predicting a Subspace from one Sample

Anonymous CVPR submission

Paper ID 1369

Abstract

Subspace methods have been extensively used to solve avariety of problems in computer vision including object de-tection, recognition, and tracking. Typically the subspacesare learned from a training set that contains different con-figurations or states of a particular object (e.g., variationson shape or appearance). However, in some situations it isnot possible to have access to data with multiple configura-tions of the object. An example is the problem of buildinga person-specific subspace of the pose variation by havingonly the frontal face. In this paper we propose methods topredict a subspace from only one sample.

Predicting a subspace from only one sample is a chal-lenging problem for several reasons: (i) it involves a map-ping in very high-dimensional spaces, and often there arefar fewer training samples than the number of features, (ii)it is unclear how to parameterize the subspace. This pa-per proposes four methods that learn a mapping from onesample to a subspace: Individual Mapping on Images, Di-rect Mapping to Subspaces, Regression on Subspaces, andDirect Subspace Alignment. We show the validity of our ap-proaches in predicting a person-specific face subspace ofpose or illumination and applications to face tracking. Tothe best of our knowledge this is the first paper that ad-dresses the problem of learning a subspace directly fromone sample.

1. IntroductionSince the early work of Sirovich and Kirby [21] pa-

rameterizing the human face using Principal ComponentAnalysis (PCA) [13] and the successful eigenfaces of Turkand Pentland [23], many computer vision researchers haveused subspace techniques to construct linear models of op-tical flow, shape or gray level for tracking [4, 5], for de-tection [23] and recognition [3]. The modeling power ofsubspace techniques is especially useful when applied tovisual data, because there is a need for dimensionality re-duction given the increase in the number of features (i.e.,pixels). Typically, the subspaces are learned from a set of

Figure 1. Illustration of the proposed subspace regression ap-proach. (Top) Traditional subspace learning. A person-specificpose subspace is learned by performing PCA on a set of train-ing samples. (Bottom) Subspace regression can predict a person-specific pose subspace from only one sample (e.g., frontal image).

registered training samples. An example is the problem ofbuilding a person-specific image subspace that models thevariation across pose. Building this subspace often requiresa large number of training images sampled from the under-lying person manifold across viewpoints, typically a set ofimages of all possible pose variations. Once the data arecollected we can compute PCA (Fig. 1 (Top)) to learn theperson-specific pose subspace. In general this proceduretypically incurs a costly data collection step: for every newtest subject, one has to gather images of that same subjectfor all possible poses (often, illuminations and expressions).

In this paper we propose to learn a subspace from a sin-gle sample (i.e., one image). A crucial advantage is thatwe can circumvent the expensive data collection step. Wecast the problem as a prediction problem that predicts a sub-space from a single image. We refer to it as the subspaceregression. Fig. 1 (Bottom) illustrates the main idea of thesubspace regression.

Predicting a subspace from only one sample is a chal-lenging problem mainly due to the high dimensionality ofthe data sets, typically with fewer training samples thanthe number of features. Moreover, it is not straightfor-ward how one can effectively and efficiently parameterize

1

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

the mapping and the subspace. We first suggest two rela-tively straightforward solutions which are based on learn-ing regressors from a query sample (e.g., a frontal image) toeither samples of all other states1 or the subspace itself. Inthe face case, for instance, the former approach amounts tolearning a set of regressors, one for each pose, that morphthe frontal-pose face into other poses. We call the formermethod Individual Mapping to Images (IMI). In the latterapproach the output responses for the regressor are the sub-space mean and basis vectors (e.g., eigenfaces). We refer tothe latter as Direct Mapping to Subspaces (DMS). Despitetheir conceptual simplicity, IMI and DMS have a potentialproblem of unreliable function estimation originated fromvery high-dimensional input/output spaces (e.g., image di-mension of several thousands).

To address the issue of estimating a large number of pa-rameters, we next propose the so-called Regression on Sub-spaces (ROS), a novel generative-discriminative approachthat performs regression on the subspace coordinates. ROScan yield a reliable estimator with a significantly reducednumber of parameters. Interestingly, we derive that ROScan be seen as a tri-factor reduced-rank regression whichaggressively reduces the number of parameters in a sensiblemanner. Furthermore we incorporate the parametrizationof ROS into a more discriminative objective function thatdirectly reflects the goodness of alignment between the tar-get person subspace and the predicted person subspace. Wehave observed that this approach, named Direct SubspaceAlignment (DSA), can achieve even further improvementover the ROS in subspace regression. For the four subspaceregression approaches we propose, we demonstrate their va-lidity in predicting a person-specific subspace of pose or il-lumination and applications to face tracking.

2. Previous WorkThis section reviews the notation and previous work on

subspace learning and linear/kernel regressors.

2.1. PCA Subspace Learning

PCA is one of the most popular dimensionality reductiontechniques [16, 10, 13]. The basic ideas behind PCA dateback to Pearson in 1901 [16], and a more general procedurewas described by Hotelling [10] in 1933.

Let X ∈ <p×n (see notation2) be a data matrix, whereeach column is a vectorized data sample. p denotes thenumber of features and n the number of samples. We let

1Recall that the state is defined as possible variation (e.g., pose).2For a matrix D, we let dj be its jth column vector, and dij denote

the (i, j) element of D. 1k ∈ <k×1 is a vector of ones. Ik ∈ <k×k

denotes the identity matrix. ||d||22 denotes the norm of the vector d.tr(A) =

∑i aii is the trace of the matrix A. vec(A) is a linear operator

which converts a matrix A ∈ <m×n into a column vector a ∈ <mn×1.||A||2F = tr(AT A) designates the Frobenius norm.

µ = 1n

∑ni=1 xi be the mean of the data samples. PCA

finds an orthogonal subspace (B) that best preserves the co-variance (St = 1

n

∑ni=1(xi − µ)(xi − µ)>), namely:

maxB

tr(BT StB) s.t. BT B = Iq (1)

where B ∈ <p×q , and q is the dimension of the subspace,typically q ≤ min(n, p). The columns of B form an or-thonormal basis that spans the principal subspace of X.PCA can be computed in a closed-form by calculating theleading eigenvectors of the covariance matrix St [8, 13].The PCA projection, Z = BT X(In − 1

n1n1Tn ) ∈ <q×n, is

decorrelated, that is, ZZT = Λ, where Λ ∈ <q×q is a diag-onal matrix containing the eigenvalues of St. This yields theembedding points zi ∈ <q×1 for xi, the column vectors ofZ. For large data sets of high dimensional data (i.e., p and nare large), it is often more efficient (in both space and time)to minimize a least-squares error function [15, 18, 9, 7]:

minµ,B

12

n∑i=1

∣∣∣∣xi − µ−BB>(xi − µ)∣∣∣∣2

2. (2)

2.2. Regression

This section reviews linear/kernel ridge regression andreduced-rank regression.

2.2.1 Linear Regression and Ridge Regression

Let {(xi,yi)}ni=1 be n training samples, xi ∈ <p×1 be-ing a p-dimensional input sample and yi ∈ <d×1 a d-dimensional output sample. In the linear regression the pre-diction function has a linear form, f(x) = W>x, whereW ∈ <p×d is the parameters to be learned. We use thematrix notation for input, X = [x1, . . . ,xn], which is ofdimension (p × n). Similarly, Y = [y1, . . . ,yn] is of di-mension (d × n). Following the regularized empirical riskminimization framework, we minimize:

minw||Y −W>X||2F + λ||W||2F , (3)

where λ controls the degree of regularization. (3) is oftencalled the ridge regression. It is easy to see that (3) has aclosed form solution: W = (XX> + λI)−1XY>.

2.2.2 Kernel Regression

The restriction on the linear functional is often relaxed byadmitting a family of smooth nonlinear functions, typi-cally represented by the Reproducing Kernel Hilbert Space(RKHS)H induced from some kernel function k(·, ·) on theinput space. Kernel regression optimizes:

minf

n∑i=1

||yi − f(xi)||22 + λ||f ||2H. (4)

2

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

From the well-known representer theorem [14], the op-timization like (4) results in a solution of the form f(x) =∑n

i=1αik(x,xi), where αi ∈ <d×1 is the d-dimensionaldual vector. Denoting the dual vectors in a matrix formα = [α1, . . . ,αn](∈ <d×n), it is not difficult to see that(4) can be written as:

minα||Y −αK||2F + λtr

(αKα>

), (5)

where K is the (n × n) kernel matrix (Ki,j = k(xi,xj)).The solution to (5) is: α = Y(K + λI)−1. Then for a newdata point x, we have:

f(x) = Y(K + λI)−1k(x), (6)

where k(x) = [k(x,x1), . . . , k(x,xn)]>.

2.2.3 Reduced-Rank Regression

Since its introduction in the early 1950s by Anderson [1, 2],the reduced-rank regression (RRR) model has inspired awealth of diverse applications in several fields such as sig-nal processing [19], neural networks [8], time series analy-sis [1], and computer vision [6]. The RRR learns a mappingbetween input x and output y by minimizing:

minA,C

n∑i=1

||yi −AC>xi||22, (7)

where A ∈ <d×q and C ∈ <p×q . One may guess that Aand C can be obtained from the singular value decomposi-tion (SVD) of the learned W from (3). However, as shownin [22], this often yields inferior results to simultaneouslyoptimizing A and C from the least square optimization.And, it is known that the least square solution has no localminima as it reduces to learning the canonical correlationanalysis (CCA) [11] embedding on x (to learn C) followedby the least square regression estimation for A from embed-ding of x, i.e., C>x, to y. A nonlinear version of RRR canbe similarly derived using a straightforward kernel trick.

3. Subspace RegressionThis section describes four methods that learn the re-

lationship between one sample and a subspace. The fourtechniques that we propose in this paper are: IndividualMapping on Images (IMI), Direct Mapping to Subspaces(DMS), Regression on Subspaces (ROS), and Direct Sub-space Alignment (DSA).

For all the methods we assume that there are n in-stances (i = 1, . . . , n), where for each instance i, thereare (K + 1) samples (denoted by xs

i ) of different statess ∈ {0, 1, . . . ,K}. These samples can be regarded as the(K + 1) realizable variations for the instance i. The state

s = 0 is reserved for indicating the reference state. Hence,the reference sample for i is x0

i .For instance, in the problem of predicting the person-

specific subspace from a frontal image, the training dataconsists of n subjects, where for each subject i, we have(K + 1) images of different states (e.g., facial poses, illu-minations, expressions). When the pose is only the state weconsider (i.e., the person-specific subspace accounting foronly the pose variability), one can let s = 0 correspond toa frontal pose. We presume that each training image xs

i islabeled with the subject identity i and the state s. In thetesting phase, our goal is to predict the subspace of a newsubject from a given frontal image.

3.1. Individual Mapping on Images (IMI)

The section describes the method for individual mappingof images (IMI). Given a query sample x0

∗ of a new unseeninstance (subject) ∗, we synthesize samples of other statesxs∗ for s = 1, . . . ,K. Then we can immediately learn a sub-

space for ∗ using PCA from the generated images. To ob-tain the image xs

∗ for each state s, the IMI approach learnsa regression function that maps a reference sample x0 to as-state sample xs. Then the training data set for regressioncan be formed as: {(x0

i ,xsi )}ni=1, for s = 1, . . . ,K.

The IMI method learns K linear or non-linear regressorsbetween x0 and each of the states xs. We used the reduced-rank regression model (RRR) that minimizes:

min{As},{Cs}

K∑s=1

n∑i=1

||xsi −AsC>s x0

i ||22. (8)

Here, As ∈ <p×l and Cs ∈ <p×l for s = 1, . . . ,K are theparameters of the model, where l is the reduced-rank dimen-sion, which is far smaller than the input/output dimension(l � p). Once we solve (8), Given a query sample x0

∗ of anew unseen instance (subject) ∗, our synthesized sample atstate s can be obtained from:

xs∗ = AsC>s x0

∗. (9)

After generating samples at all other states, xs∗ for s =

1, . . . ,K, we can build a person-specific subspace for thesubject ∗ via PCA. The nonlinear IMI, namely the nonlinearmapping from image to image, can be achieved straightfor-wardly by using the kernel regression.

3.2. Direct Mapping to Subspaces (DMS)

This section describes a method called Direct Mappingto Subspaces (DMS). DMS learns a direct regression be-tween the reference sample (x0) and the subspace built withall other samples at different states (xs, for s = 1, ..K).

In this setting, the training data can be formed as{(x0

i , (µi,Bi))}ni=1, where the output point (µi,Bi) is the

3

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2. Regression on Subspace (ROS) approach.

subspace of the instance (subject) i, typically learned viaPCA from the training samples {xs

i}Ks=0. We regard theoutput as the concatenated vector of the mean µi and thevectorized basis matrix Bi. Similar to the IMI approach, asthe output consists of many variables (e.g., mean and eigen-vectors), we consider the reduced-rank regressors.

In the face cases, the sample images are assumed wellnormalized (i.e., faces are aligned in up-frontal view, scaledto the same size, and vectorized to p-dim, x ∈ <p×1). Theorthonormal embedding matrix B = [b1, . . . ,bq] of theperson-specific subspace has the dimension (p × q) withthe column basis vectors bj ∈ <p×1 for j = 1, . . . , q. Thesubspace dimension q is chosen so as to preserve 90% ofthe energy.

3.3. Regression on Subspaces (ROS)

The above two approaches tend to perform poorly dueto the high dimensionality of the output space (either im-ages or subspaces) to be predicted. The reduced-rank re-gression has been adopted, but still requires many param-eters to achieve desired accuracy, and in practice, oftenyields similar prediction performance as the ordinary least-square regression. This section proposes a new methodcalled Regression on Subspaces (ROS) that learns a high-dimensional mapping between two high-dimensional datasets with significantly fewer parameters.

The overall concept of ROS is illustrated in Fig. 2. Wefirst learn the state-specific PCA subspaces, each of whichis learned from samples (images) of the same state acrossdifferent instances (subjects). For the state s, we denotethe r-dim state-specific subspace by Ss = (ms,As), wherems ∈ <p is the mean and As ∈ <p×r contains the basisvectors in its columns. As is learned via PCA3. Notice thatSs captures the variability solely in the styles (i.e., subjects),

3Ideally CCA can be optimal, however, in our experiments due to thesmall number of training data points, PCA often outperformed CCA.

Algorithm 1 Regression on Subspace (Training)Input: Samples {xs

i} for i = 1, . . . , n, s = 0, . . . ,K.Output: Learned gs ∈ <r×r for s = 1, . . . ,K.1) For s = 0, . . . ,K, PCA-learn a state subspace Ss =(ms,As) with data {xs

i}ni=1.2) Project {xs

i} onto the state subspace Ss (14).3) Learn a regressor for each state s = 1, . . . ,K:

mings

n∑i=1

||zsi − g>s z0

i ||22 (10)

Algorithm 2 Regression on Subspace (Testing)Input: Test reference x0

∗ and the learned {gs}Ks=1.Output: Learned person subspace for ∗, (µ∗,B∗).1) Project x0

∗ onto the 0-state subspace:z0∗ = A>0 (x0

∗ −m0). (11)2) Apply regression on subspace for s = 1, . . . ,K:

zs∗ = g>s z0

∗ (12)3) Synthesize a sample for each state s = 1, . . . ,K:

xs∗ = Aszs

∗ + ms (13)4) PCA-learn a person-specific subspace (µ∗,B∗) withthe generated samples {xs

∗}Ks=1 and the reference x0∗.

not in the contents (i.e., states). Once we have learned thestate-specific subspaces, one can project xs

i , the image ofsubject i in the state s, onto Ss, yielding the subspace coor-dinate zs

i ∈ <r derived as:

zsi = A>s (xs

i −ms) (14)

Although the state-specific subspaces are learned indi-vidually and independently, one can relate them to one an-other by introducing certain restrictions to the subspaces.Here we implicitly impose such constraints by consideringmappings from one subspace to another. More specifically,we presume that there exists a regression matrix gs ∈ <r×r

for each state s (= 1, . . . ,K) that maps the 0-state coordi-nate z0 to the subspace-s coordinate zs. That is,

zs = g>s z0. (15)

One can naturally treat gs as a subspace morphing functionfrom S0 to Ss, accounting for how a frontal image repre-sentation can be transformed into the representation of thestate s in a subject-generic manner. Even though either IMIor DMS explicitly aims at such a goal, it suffers from thehigh dimensionality of both input and output.

On the other hand, ROS operates on the coordinates ofthe subspaces, being more robust to noise and more sensi-ble with significantly fewer parameters to be learned. Byregarding the pairs {(z0

i , zsi )}ni=1 as samples independently

and identically drawn from an underlying probability dis-tribution on the joint subspace (S0,Ss), one can learn gs

4

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

using the regression algorithms we discussed before. Al-though we explained ROS using linear regression so far, itcan be straightforwardly extended to kernel regression.

Once we obtain the mappings among the state subspaces,we can synthesize out-of-state images {xs

∗}Ks=1 for a newsubject ∗, given its 0-state image x0

∗. Finally, with the syn-thesized images {xs

∗}Ks=1, one can learn a subspace for ∗with the data {xs

∗}Ks=1∪{x0∗} using PCA. The overall train-

ing/testing algorithms are shown in Alg. 1 and 2.

3.3.1 Relation between IMI and ROS

Recall from Sec 3.1 that the IMI solves the followingreduced-rank regression:

min{Rs}

K∑s=1

n∑i=1

||xsi−Rsx0

i ||22, where Rs = AsC>s , (16)

which can be seen as a two-factor parametrization of theregression coefficient Rs. Although the reduced-rank re-gressor results in a smaller number of parameters by dyadicfactorization, here we will show that the ROS approach de-scribed in the previous section can be seen as a more ag-gressive factorization with significantly fewer parameters.To see this, we rewrite the ROS objective in a more generalperspective. The optimization problem we impose for theROS can be expressed as:

min{As},{gs}

[K∑

s=0

n∑i=1

||xsi −AsA>s xs

i ||22 +

η

K∑s=1

n∑i=1

||A>s xsi − g>s A>0 x0

i ||22

]. (17)

Here, for expositional simplicity, we dropped the subspacemeans ms, assuming that the data points are centered.

The first term of (17) is responsible for the state-wisePCA subspace learning (i.e., As for s = 0, 1, . . . ,K),while the second term corresponds to the subspace regres-sion from 0-state subspace (S0) to s-state subspace (Ss).Two terms are balanced by the positive constant η. Theway we optimized this problem in Sec. 3.3 can be seen asa coordinate descent: we first learn As’s while ignoring thesecond term, then we fix the learned As’s, and least-squareoptimize the second term with respect to {gs} alone.

Now we replace A>s xsi in the first term by g>s A>0 x0

i inthe second term, as we try to make the second term become0. This then leads to:

min{As},{gs}

K∑s=1

n∑i=1

||xsi −Asg>s A>0 x0

i ||22. (18)

Note that (18) can be seen as a noise-free version of (17) byforcing the subspace regression error (i.e., the second termin (17)) to be 0.

When compared to the reduced-rank regression, (18) fur-ther factorizes the post-multiplying matrix Cs as:

Cs → A0gs, (19)

which yields the tri-factor reduced-rank regression:

xsi = Asg>s A>0 x0

i . (20)

This results in important consequence in that one can fix As

and A0, which causes otherwise a large number parametersto be estimated. In fact, as there could always be a rotationmatrix between As and gs, and another between gs andA0 that can make the final mapping invariant, we let As

and A0 be rather fixed. Our ROS forces that these matricescome from the state-wise PCA learning, which is quite in-tuitive as well. All we need to estimate are gs’s, hence thenumber of parameters becomes K × r2. In a nutshell, theROS essentially optimizes (18) with respect to gs’s whilethe high-dimensional matrices As are fixed.

3.4. Direct Subspace Alignment (DSA)

In the previous sections we have considered differentways of learning the mappings between the reference im-age and the subspaces or other images. In this section wepropose another new method called Direct Subspace Align-ment (DSA). Motivated by the tri-factor RRR parametriza-tion xs

i = Asg>s A>0 x0i , we consider for each reference

sample x0i that the K-state samples are generated by tri-

factor RRR. Together with x0i , these samples form:

Qi = [x0i , A1g>1 A0x0

i , . . . , AKg>KA0x0i ]. (21)

Note that Qi ∈ <p×(K+1) is composed of (K+1) columnswhere the second to the last columns are from synthesizedsamples from the tri-factor reduced-rank regression.

Having this parameterization for (K+1) images, a moredirect and discriminative approach is to optimize over{gs}Ks=1 a score function that can measure the correlationwith the target training subspaces Bi. Among some possi-bilities, we suggest the following reasonable score function:

E(g1, · · · ,gK) =n∑

i=1

q∑j=1

{b(i)

j

>eigj(QiQ>i )

}2

, (22)

where Bi = [b(i)1 , . . . ,b(i)

q ] is the orthonormal basis of thesubspace for instance (subject) i. We assume that the ba-sis vectors are always sorted by the decreasing eigenvalueorder. The operator eigj(A) returns A’s eigenvector corre-sponding to the jth largest eigenvalue. Thus eigj(QiQ>i )indicates the jth major basis of the column space of the tri-factor parameterized data matrix Qi. (Note that Qi’s arefunctions of {gs}Ks=1.)

5

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 3. Different pose images for the first five subjects in theCMU PIE data set [20]. The images are arranged in a way thateach row corresponds to a particular subject while each columnrepresents a specific state (pose in this case).

Notice that (22) is the sum of the squared cosine anglesbetween the basis vectors of the target subject subspace andthe parameterized subject subspace. (22) explicitly maxi-mizes the alignment between the principal directions on thetraining set and the eigenvectors of QiQ>i that are param-eterized by low-dimensional matrices g1, ...,gK . Optimiz-ing (22) with respect to g1, ...,gK is a challenging mathe-matical problem with many local minima. In this paper, weused a numerical gradient optimization using the Matlab’sfminunc() function.

4. Experimental ResultsWe conducted experiments on predicting a person-

specific facial pose subspace from a frontal image, predict-ing a subspace for illumination from only one illumination,and its applications to subspace-based face tracking.

4.1. Predicting a Subspace for Pose

We consider the CMU PIE data set [20] composed ofabout 41,368 images from 68 people. The face of eachperson is taken under 13 different poses, 43 different illu-mination, and 4 different expressions. The images are la-beled with these states. We used a subset of these imagesby taking 720 images of 60 subjects with 12 different poseswith the same expression and illumination conditions. Eachface image is cropped into a tight bounding box using theground-truth facial landmark points which are also providedby the data set. The images of all poses for some of the fivesubjects are shown in Fig. 3. All the images are under thesame expression and illumination. We normalized the im-ages into a same size (48 × 48). For each subject, we esti-mate a PCA subspace for a fixed dimension. Then we splitthe data randomly into 50/10 train/test subjects. By reveal-ing only a single frontal image for each subject in the testfold, we predict the subspaces of the test subjects.

To measure the goodness of subspace alignment (be-tween the estimated subspace and the ground-truth sub-space obtained from real samples), we use three quantitativeerror metrics: (i) the smallest principal angle [11] that can

Error Measure IMI DMS ROS DSA

Principal 1.1507 1.4088 1.0860 1.0742Angle (1.2312) (1.1645) (1.0651) (N/A)

Squared 0.8909 0.9286 0.8193 0.8236Cosine (d1) (0.8248) (0.9172) (0.8144) (N/A)

Subspace 1.2714 1.3286 1.2158 1.2106Dist. (d2) (1.2869) (1.3464) (1.2112) (N/A)

Table 1. Test errors on the subspace for pose. For each error mea-sure, the linear regression errors (kernel regression errors in paren-theses) are shown. For all measures, the smaller are better.

also be computed by the subspace() function in Mat-lab, (ii) the sum of the squared cosine angles between ba-sis vectors, and (iii) the subspace distance defined in [24].More specifically, to assess the distance between two sub-spaces B1 = [b(1)

1 , . . . ,b(1)q ] and B2 = [b(2)

1 , . . . ,b(2)q ],

we do the SVD decomposition: B1 = U1Σ1V>1 andB2 = U2Σ2V>2 . Then the sum of the squared cosine angleerrors between two subspaces can be defined as (in fact, wetake an average, and subtract it from 1):

d1(B1,B2) = 1− 1q

q∑j=1

(u(1)

j

>u(2)

j

)2

. (23)

where u(m)j is the jth column of Um for m = 1, 2. Re-

cently, the subspace distance has been defined in [24],which has the following formula:

d2(B1,B2) =

√12

∣∣∣∣tr(B1B>1 −B2B>2)∣∣∣∣. (24)

Note that for all three measures, the smaller numbers indi-cate better performance.

We performed our four subspace regression approaches.For the regression algorithms, we used the linear and thekernel ridge regression methods. The test errors are shownin Table 1. In this experiment, we set the subject sub-space dimension q = 4, the state (pose) subspace dimensionr = 5. As shown in the table, the four approaches exhibitimproved performance from the the least sophisticated one(IMI) to the most advanced ones (ROS and DSA). For theregression algorithms, we use both the linear and the kernelregression except for the DSA where only linear regressionis applicable. In addition, the results are consistent regard-less of the choice of the regression algorithms.

4.2. Face Tracking Across Pose

This section shows how effectively the person-specificsubspaces built from one sample can be applied to the prob-lem of face tracking across pose. The face tracking sce-nario is within a vehicle with a camera mounted at a fixedposition. We recorded video that lasted for about 2 min-utes yielding about 3000-frame long video at 25 fps rates.

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 4. Selected frames illustrating tracking performance of IVT, DMS, and ROS. Yellow bounding box represents the tracked face.

The video is very challenging because there are large vari-ations in illumination and pose changes, since it is takenoutdoors. One of the major challenges of this problem isthat the driver’s facial pose varies abruptly and dramatically.Moreover, the video is noisy with low resolution and con-trast. See Fig. 4 for few frames of the video.

In this context, we used subspace-based trackers [4, 17]that can effectively deal with noise and large variation in ap-pearance. In these approaches, the appearance model of thetracker is a subspace that can capture the variability of thetarget appearance, which is combined with efficient search-ing strategies, typically the sampling-based particle filter-ing [12]. To handle the appearance change during tracking,the recent Incremental Visual Tracker (IVT) [17] updatesthe subspace model using the previously tracked images.However, as the training images are collected from the de-cision made by the current tracker, IVT’s subspace learn-ing can be seen as self-training or unsupervised learning.Although it is seemingly a reasonable approach to pursuegiven restricted information of a single image at the firstframe, IVT is potentially unable to judge whether the cur-rent decision is correct or not in a principled manner, whichis indeed the main cause of tracking drift.

On the other hand, the way we learn a person-specificsubspace (i.e., the subspace regression) is highly advanta-geous as we directly predict a subspace from a single imagegiven in the first frame of the video. Not only addressingthe above-mentioned issues of the IVT, our approaches alsoavoid the computational overhead of updating the subspaceat every frame since the subspace model is determined andfixed at the first frame. We design a tracker system that in-corporates the subspace estimated by our subspace regres-sion approaches into the particle filtering framework. Thetraining data used for the subspace regression are obtainedfrom the CMU PIE data set [20], where we use 50 subjectswith 12 different poses (other conditions such as illumina-tion and expression are not considered here).

In addition to comparing our approaches with IVT, wealso present the performance of a fairly basic templatematching tracker as a baseline comparison. We maintain

ATM IVT [17] IMI DMS ROS

RMS 42.99 38.41 40.04 38.16 37.98

Table 2. Average RMS errors for face tracking.

a template image model for the face target, while similarto IVT, the template model is updated at every frame bycomputing a weighted average of current template and thetracked images, hence named as Adaptive Template Match-ing (ATM). The ATM is essentially identical to the IVT ex-cept that it maintains only the mean of the subspace.

We enforce the same settings for all competing meth-ods for fair comparison. The initial location of a face isobtained from a face detector. For the tracking states, weused the axis-aligned bounding box representation, mean-ing that we keep track of three parameters: center positionand scale. To provide quantitative tracking results, we man-ually labeled the face location for every 10th frame to formthe ground-truth. The average root-mean-square (RMS) er-ror (in pixels) is shown in Table 2. Our two most sophisti-cated approaches4 (DMS and ROS) achieved slightly betterperformance than the IVT even though we only take into ac-count the pose variation in the subspace learning. Fig. 4 alsodepicts some selected frames that compare our approacheswith the IVT.

4.3. Predicting a Subspace for Illumination

The third experiment illustrates the capability of sub-space regression to learn a person-specific illumination sub-space. We used the frontal faces of 60 subjects under 19 dif-ferent illuminations from the CMU PIE data set [20]. The19 illumination states are selected so that they are roughlyuniformly spread along the horizon. The reference states = 0 corresponds to frontal lighting (e.g., the leftmost im-age in Fig. 5(c)). The subject subspace dimension is setto q = 4 (with ideal Lambertian surfaces should be three).From the 60 subjects, 50 subjects are selected for training,and the rest as testing. For ROS, the state subspace dimen-

4We did not present the results of DSA here due to its similar perfor-mance as ROS.

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1369

CVPR#1369

CVPR 2010 Submission #1369. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Error Measure IMI DMS ROS DSA

Principal 1.5592 1.2797 1.2185 1.2795Angle (1.5613) (1.0424) (1.4468) (N/A)

Squared 0.9563 0.9476 0.9083 0.8898Cosine (d1) (0.9920) (0.9127) (0.8692) (N/A)

Subspace 2.2248 2.3370 1.4392 1.3882Dist. (d2) (2.2239) (1.4549) (1.4111) (N/A)

Table 3. Test errors on the subspace for illumination. For eacherror measure, the linear regression errors (kernel regression errorsin parentheses) are shown. For all measures, the smaller are better.

(a) Subspace for subject A (b) Subspace for subject B

(c) Synthesized images from the ROS-learned subspace

Figure 5. (a), (b) Subject subspaces for illumination variation. Fortwo subjects (A and B), where (top) = the ground-truth subspaces,(bottom) = the predicted subject subspace by ROS. (c) Synthesizedimages from the ROS-learned subspace.

sion is chosen as r = 10. All images are manually labeledwith 66 landmarks, and warped to a canonical shape repre-sentation using triangulation.

Table 3 shows the test errors indicating the goodnessof alignment with respect to the ground-truth subspace interms of three error measures described in Sec. 4.1. ROSand DSA usually outperform the straightforward regressionapproaches IMI and DMS, while being less sensitive tothe choice of the regression methods. DMS also performswell especially when the kernel regression is adopted. Thiscan be because the underlying relationship between the ref-erence image and the subspace itself is highly nonlinear.Fig. 5 depicts the learned subspace by ROS for some testsubjects as well as the images generated from the predictedsubspace. As shown, the mean and basis of the predictedsubspace appear to be very similar to those of the ground-truth subspace. Also, the synthesized images sampled fromthe learned subspace look quite realistic.

5. Concluding RemarksThis paper has addressed the novel problem of predict-

ing a subspace from one sample, and we have illustrated itsbenefits in predicting person-specific pose and illuminationsubspaces. We have also successfully applied it to the prob-lem of face tracking. By casting the problem as the sub-space regression, we have proposed four approaches, wherewe have observed that our approaches, especially the ROS

and the DSA, are promising and perform well with a signif-icantly reduced number of parameters. Interestingly, DSAis the optimal error function to optimize because it directlyregresses on the subspace. Some drawbacks are that theoptimization procedure is complex and prone to be trappedinto local minima. Developing efficient optimization meth-ods is left as future work.

References[1] T. W. Anderson. Estimating linear restrictions on regression coef-

ficients for multivariate normal distributions. Ann. Math. Statist.,12:327–351, 1951. 3

[2] T. W. Anderson. An Introduction to Multivariate Statistical Analysis.2nd ed. Wiley, New York, 1984. 3

[3] H. Bischof, H. Wildenauer, and A. Leonardis. Illumination insen-sitive recognition using eigenspaces. Computer Vision and ImageUnderstanding, 1(95):86 – 104, 2004. 1

[4] M. J. Black and A. D. Jepson. Eigentracking: Robust matching andtracking of objects using view-based representation. InternationalJournal of Computer Vision, 26(1):63–84, 1998. 1, 7

[5] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearancemodels. In ECCV, 1998. 1

[6] F. de la Torre and M. J. Black. Dynamic coupled component analysis.In CVPR, 2001. 3

[7] F. de la Torre and M. J. Black. A framework for robust subspacelearning. IJCV, 54:117–142, 2003. 2

[8] K. I. Diamantaras. Principal Component Neural Networks (Theroryand Applications). John Wiley & Sons, 1996. 2, 3

[9] K. R. Gabriel and S. Zamir. Lower rank approximation of matricesby least squares with any choice of weights. Technometrics, Vol. 21,pp., 21:489–498, 1979. 2

[10] H. Hotelling. Analysis of a complex of statistical variables into prin-cipal components. Journal of Educational Psychology, 24, 1933. 2

[11] H. Hotelling. Relations between two sets of variates. Biometrika,28:321–377, 1936. 3, 6

[12] M. Isard and A. Blake. Contour tracking by stochastic propagationof conditional density, 1996. ECCV. 7

[13] I. T. Jolliffe. Principal Component Analysis. New York: Springer-Verlag, 1986. 1, 2

[14] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian splinefunctions. J. Mathematical Analysis and Applications, 33(1):82–95,1971. 3

[15] E. Oja. A simplified neuron model as principal component analyzer.Journal of Mathematical Biology, 15:267–273, 1982. 2

[16] K. Pearson. On lines and planes of closest fit to systems of points inspace. The London, Edinburgh and Dublin Philosophical Magazineand Journal, 6:559–572, 1901. 2

[17] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learning forrobust visual tracking. IJCV, 77(1-3):125–141, 2007. 7

[18] S. Roweis. EM algorithms for PCA and SPCA. In NIPS, pages 626–632, 1997. 2

[19] L. Scharf. The SVD and reduced rank signal processing. SVD andSignal Processing, II. Elsevier, 1991. 3

[20] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, andexpression (PIE) database. IEEE Transactions on Pattern Analysisand Machine Intelligence, 25(12):1615–1618, 2003. 6, 7

[21] L. Sirovich and M. Kirby. Low-dimensional procedure for the char-acterization of human faces. J. Opt. Soc. Am. A, 4(3):519–524, March1987. 1

[22] P. Stoica and M. Viberg. Maximum likelihood parameter and rank es-timation in reduced-rank multivariate linear regressions. IEEE Trans-actions on Signal Processing, 44(12):3069–3078, 1996. 3

[23] M. Turk and A. Pentland. Eigenfaces for recognition. Journal Cog-nitive Neuroscience, 3(1):71–86, 1991. 1

[24] L. Wang, X. Wang, and J. Feng. Intrapersonal subspace analysis withapplication to adaptive Bayesian face recognition. Pattern Recogni-tion, 38:617–621, 2005. 6

8