Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University...

1
NONLINEAR LOW-DIMENSIONAL REGRESSION USING AUXILIARY COORDINATES. Weiran Wang and Miguel ´ A. Carreira-Perpi ˜ n ´ an. EECS, University of California, Merced. 1 Abstract When doing regression with inputs and outputs that are high-dimensional, it often makes sense to reduce the di- mensionality of the inputs before mapping to the outputs. We propose a method where both the dimensionality re- duction and the regression mapping can be nonlinear and are estimated jointly. Our key idea is to define an ob- jective function where the low-dimensional coordinates are free parameters, in addition to the dimensionality re- duction and the regression mapping. This has the ef- fect of decoupling many groups of parameters from each other, affording a more effective optimization, and to use a good initialization from other methods. Work funded by NSF CAREER award IIS–0754089. 2 Low-dimensional regression using auxiliary coordinates Given a training set X D x ×N and Y D y ×N , instead of directly op- timizing E 1 (F, g)= N n=1 y n g(F(x n ))2 + λ g R(g)+ λ F R(F) with λ F g 0 for dimension reduction mapping F and re- gression mapping g, we let the low-dimensional coordinates Z D z ×N =(z 1 ,..., z N ) be independent, auxiliary parameters to be optimized over, and unfold the squared error into two terms that decouple given Z: E 2 (F, g, Z)= N n=1 y n g(z n )2 + λ g R(g) + N n=1 z n F(x n )2 + λ F R(F). Now, every squared error involves only a shallow mapping, compared to deeper nesting in the function g F that leads to ill-conditioning. We apply the following alternating optimiza- tion procedure to solve the problem. Given X D x ×N ,Y D y ×N , and intialization Z D z ×N repeat 1. Optimize over g : min g N n=1 y n g(z n )2 + λ g R g (g) 2. Optimize over F : min F N n=1 z n F(x n )2 + λ F R F (F) 3. Optimize over Z : min Z N n=1 y n g(z n )2 + N n=1 z n F(x n )2 until stop 3 Optimization over F and g We use shallower functions–linear or RBFs–for F and g. Linear g: a direct regression that acts on a lower input dimension D z , reduces to least squares problem. RBFs g: g(z)= (z) with M N Gaussian RBFs φ m (z)= e ((zμ m )2 ) 2σ 2 , and R(g)= W2 is a quadratic regularizer on the weights. Centers μ m are chosen by k -means on Z (once every few iterations, initialized at previous centers) Weights W have a unique solution given by a linear system. Time complexity: O (NM (M + D z )), linear in training set size. Space complexity: O (M (D y + D z )). 4 Optimization over Z For fixed g and F, optimization of the objective function decouples over each z n R D z . We have N independent nonlinear minimizations each on D z parameters, of the form min zR D z E (z)= y g(z)2 + z F(x)2 . If g is linear, then z can be solved in closed form by solving a linear system of size D z . If g is nonlinear, we use Gauss-Newton method with line search. Cost over all Z: O (ND 2 z D y ), linear in training set size. The distribution of the coordinates Z changes dramati- cally in the first few iterations, while the error decreases quickly, but after that Z changes little. 5 Initialization and Validation Initialization for Z Unsupervised dimensionality reduction on X only. Supervised dimension reduction: Reduced Rank Regression, Kernel Slice Inverse Regression, Kernel Dimension Reduction, etc. Spectral methods run on (X, Y ) jointly. Validation of hyper-parameters Parameters for function F and g (#RBFs, Gaussian Kernel width). Regularization coefficients λ g and λ F . Dimensionality of z. They can be determined through evaluating performance of g F on a validation set. 6 Advantages of low-dimensional regression We jointly optimize over both mappings F and g, unlike one-shot methods. Our optimization is much more efficient than using a deep network with nested mappings (pretty good model pretty fast ). The low-dimensional regressor has fewer parameters when D z is small or #RBFs is small. The smooth functions F and g impose regularization on the regressor and may result in a better generalization performance. 7 Experimental evaluation We use g F as our regression function for testing, which is the natural “out-of-sample” extension for above optimization. Criteria: test error, and the quality of dimension reduction. Early stopping for training, usually happens within 100 iterations. Rotated MNIST digits ‘7’ Input: images of digit 7, each has 60 different rotated versions. Output: skeleton version of the digit. test Ground Truth lin G RBFs G GP G RBFs F lin g RBFs F RBFs g test Ground Truth lin G RBFs G GP G RBFs F lin g RBFs F RBFs g Sample outputs of different algorithms. Method SSE direct linear 51 710 direct RBFs (1200, 10 2 ) 32 495 direct RBFs (2000, 10 2 ) 29 525 Gaussian process 29 208 KPCA (3) + RBFs (2000, 1) 49 782 KSIR (60, 26) + RBFs (20, 10 5 ) 39 421 F RBFs (2000, 10 2 )+ g linear (10 3 ) 29 612 F RBFs (1200, 10 2 )+g RBFs (75, 10 2 ) 27 346 Sum of squared errors (test set), with optimal parameters coded as RBFs (M,λ), KPCA (Gaussian kernel width), KSIR (num- ber of slices, Gaussian kernel width). Num- ber of iterations for our method: 16 (linear g), 7 (RBFs g). KPCA KSIR (60 slices) -0.15 -0.1 -0.05 0 0.05 0.1 -0.15 -0.1 -0.05 0 0.05 0.1 -0.1 -0.05 0 0.05 0.1 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 Isomap on (X, Y ) F RBFs + g linear -3 -2 -1 0 1 2 3 x 10 -3 -3 -2 -1 0 1 2 3 x 10 -3 z1 z2 -0.025 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 -0.025 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 2 dimensional embeddings obtained by different algorithms. Serpentine robot forward kinematics Forward kinematics mapping goes from 12 to 24 dimen- sions through 4 dimensions. ... x y z end- effector camera links ρ x y θ 1 θ 2 φ Method RMSE direct regression, linear (0) 2.2827 direct regression, RBF (2000, 10 6 ) 0.3356 direct regression, Gaussian process 0.7082 KPCA (2.5) + RBF (400, 10 10 ) 3.7455 KSIR (400, 100) + RBF (1000, 10 8 ) 3.5533 F RBF (2000, 10 6 )+g RBF (100, 10 9 0.1006 Test error obtained by different algorithms. Ground truth -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 t 1 t 2 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 t 3 t 4 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 t 3 t 4 -5 0 5 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 t 1 t 2 Ours -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 z 1 z 2 -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 z 1 z 2 -0.025 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 -0.01 -0.005 0 0.005 0.01 0.015 0.02 z 3 z 4 -0.025 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 -0.01 -0.005 0 0.005 0.01 0.015 0.02 z 3 z 4 Correspondence between our 4 dimensional auxiliary coordinates Z and the ideal one that generates data. 1 2 3 4 5 6 7 8 9 10 11 12 0 0.5 1 1.5 2 RMSE dimension D z 10 1 10 2 10 3 10 4 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 nested, training nested, test auxcoord, training auxcoord, test root mean squared regression error run time (seconds) switch from auxiliary coordinates to nested Validation of D z by our algorithm. Comparison of run time of our approach and optimiz- ing the nested objective function.

Transcript of Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University...

Page 1: Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University …ttic.uchicago.edu/~wwang5/papers/aistats12-poster.pdf · 2015. 4. 22. · NONLINEAR LOW-DIMENSIONAL REGRESSION

NONLINEAR LOW-DIMENSIONAL REGRESSION

USING AUXILIARY COORDINATES.Weiran Wang and Miguel A. Carreira-Perpinan.

EECS, University of California, Merced.

1 AbstractWhen doing regression with inputs and outputs that arehigh-dimensional, it often makes sense to reduce the di-mensionality of the inputs before mapping to the outputs.We propose a method where both the dimensionality re-duction and the regression mapping can be nonlinearand are estimated jointly. Our key idea is to define an ob-jective function where the low-dimensional coordinatesare free parameters, in addition to the dimensionality re-duction and the regression mapping. This has the ef-fect of decoupling many groups of parameters from eachother, affording a more effective optimization, and to usea good initialization from other methods.

Work funded by NSF CAREER award IIS–0754089.

2 Low-dimensional regression using auxiliary coordinatesGiven a training set XDx×N and YDy×N , instead of directly op-timizing

E1(F,g) =N∑

n=1

‖yn − g(F(xn))‖2 + λgR(g) + λFR(F)

with λF, λg ≥ 0 for dimension reduction mapping F and re-gression mapping g, we let the low-dimensional coordinatesZDz×N = (z1, . . . , zN) be independent, auxiliary parameters tobe optimized over, and unfold the squared error into two termsthat decouple given Z:

E2(F, g,Z) =∑N

n=1 ‖yn − g(zn)‖2 + λgR(g)

+∑N

n=1 ‖zn − F(xn)‖2 + λFR(F).

Now, every squared error involves only a shallow mapping,compared to deeper nesting in the function g ◦F that leads toill-conditioning. We apply the following alternating optimiza-tion procedure to solve the problem.

Given XDx×N ,YDy×N , and intialization ZDz×N

repeat1. Optimize over g : min

g

∑Nn=1 ‖yn − g(zn)‖

2 + λgRg(g)

2. Optimize over F : minF

∑Nn=1 ‖zn − F(xn)‖

2 + λFRF(F)

3. Optimize over Z :minZ

∑Nn=1 ‖yn − g(zn)‖

2 +∑N

n=1 ‖zn − F(xn)‖2

until stop

3 Optimization over F and g

We use shallower functions–linear or RBFs–for F and g.

•Linear g: a direct regression that acts on a lower inputdimension Dz, reduces to least squares problem.

•RBFs g: g(z) = WΦ(z) with M ≤ N Gaussian RBFsφm(z) = e(−

‖(z−µm)‖2)2σ2 , and R(g) = ‖W‖2 is a quadratic

regularizer on the weights.

– Centers µm are chosen by k-means on Z (once everyfew iterations, initialized at previous centers)

– Weights W have a unique solution given by a linearsystem.

– Time complexity: O(NM(M +Dz)), linear in trainingset size.

– Space complexity: O(M(Dy +Dz)).

4 Optimization over Z

•For fixed g and F, optimization of the objective functiondecouples over each zn ∈ R

Dz.

•We have N independent nonlinear minimizations eachon Dz parameters, of the form

minz∈RDz E(z) = ‖y − g(z)‖2 + ‖z− F(x)‖2.

• If g is linear, then z can be solved in closed form bysolving a linear system of size Dz.

• If g is nonlinear, we use Gauss-Newton method with linesearch.

•Cost over all Z: O(ND2zDy), linear in training set size.

•The distribution of the coordinates Z changes dramati-cally in the first few iterations, while the error decreasesquickly, but after that Z changes little.

5 Initialization and ValidationInitialization for Z

•Unsupervised dimensionality reduction on X only.

•Supervised dimension reduction: Reduced Rank Regression,Kernel Slice Inverse Regression, Kernel Dimension Reduction,etc.

•Spectral methods run on (X,Y) jointly.

Validation of hyper-parameters

•Parameters for function F and g (#RBFs, Gaussian Kernel width).

•Regularization coefficients λg and λF.

•Dimensionality of z.

They can be determined through evaluating performance of g ◦Fon a validation set.

6 Advantages of low-dimensional regression

•We jointly optimize over both mappings F and g, unlike one-shot methods.

•Our optimization is much more efficient than using a deep network with nested mappings(pretty good model pretty fast).

•The low-dimensional regressor has fewer parameters when Dz is small or #RBFs is small.

•The smooth functions F and g impose regularization on the regressor and may result in abetter generalization performance.

7 Experimental evaluation

•We use g ◦ F as our regression function for testing, which is the natural “out-of-sample”extension for above optimization.

•Criteria: test error, and the quality of dimension reduction.

•Early stopping for training, usually happens within 100 iterations.

Rotated MNIST digits ‘7’Input: images of digit 7, each has 60 different rotated versions.Output: skeleton version of the digit.

testGround

Truth lin G RBFs G GP G

RBFs F

lin g

RBFs F

RBFs g testGround

Truth lin G RBFs G GP G

RBFs F

lin g

RBFs F

RBFs g

Sample outputs of different algorithms.

Method SSEdirect linear 51 710

direct RBFs (1200, 10−2) 32 495direct RBFs (2000, 10−2) 29 525

Gaussian process 29 208KPCA (3) + RBFs (2000, 1) 49 782

KSIR (60, 26) + RBFs (20, 10−5) 39 421F RBFs (2000, 10−2) + g linear (10−3) 29 612F RBFs (1200, 10−2)+g RBFs (75, 10−2) 27 346

� Sum of squared errors (test set), withoptimal parameters coded as RBFs (M,λ),KPCA (Gaussian kernel width), KSIR (num-ber of slices, Gaussian kernel width). Num-ber of iterations for our method: 16 (linear g),7 (RBFs g).

KPCA KSIR (60 slices)

−0.15 −0.1 −0.05 0 0.05 0.1

−0.15

−0.1

−0.05

0

0.05

0.1

−0.1 −0.05 0 0.05 0.1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Isomap on (X,Y) F RBFs + g linear

−3 −2 −1 0 1 2 3

x 10−3

−3

−2

−1

0

1

2

3x 10

−3

z1

z2

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

2 dimensional embeddings obtained by different algorithms.

Serpentine robot forward kinematicsForward kinematics mapping goes from 12 to 24 dimen-sions through 4 dimensions.

...x

yz

end-effector

camera

linksρ

x

y

θ1θ2

φ

Method RMSEdirect regression, linear (0) 2.2827

direct regression, RBF (2000, 10−6) 0.3356direct regression, Gaussian process 0.7082

KPCA (2.5) + RBF (400, 10−10) 3.7455KSIR (400, 100) + RBF (1000, 10−8) 3.5533F RBF (2000, 10−6)+g RBF (100, 10−9 0.1006

Test error obtained by different algorithms.

Gro

und

trut

h

−5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

t1

t2

−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

t3

t4

−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

t3

t4

−5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

t1

t2

Our

s

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

z1

z2

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

z1

z2

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01

−0.005

0

0.005

0.01

0.015

0.02

z3

z4

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01

−0.005

0

0.005

0.01

0.015

0.02

z3

z4

Correspondence between our 4 dimensional auxiliary coordinates Z and the ideal one that generates data.

1 2 3 4 5 6 7 8 9 10 11 120

0.5

1

1.5

2

RM

SE

dimension Dz

101

102

103

104

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

nested, trainingnested, testauxcoord, trainingauxcoord, test

root

mea

nsq

uare

dre

gres

sion

erro

r

run time (seconds)

switch from auxiliarycoordinates to nested

Validation of Dz by our algorithm.Comparison of run time of our approach and optimiz-ing the nested objective function.