Download - Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University …ttic.uchicago.edu/~wwang5/papers/aistats12-poster.pdf · 2015. 4. 22. · NONLINEAR LOW-DIMENSIONAL REGRESSION

Transcript
Page 1: Weiran Wang and Miguel A. Carreira-Perpi´ n˜an.´ EECS, University …ttic.uchicago.edu/~wwang5/papers/aistats12-poster.pdf · 2015. 4. 22. · NONLINEAR LOW-DIMENSIONAL REGRESSION

NONLINEAR LOW-DIMENSIONAL REGRESSION

USING AUXILIARY COORDINATES.Weiran Wang and Miguel A. Carreira-Perpinan.

EECS, University of California, Merced.

1 AbstractWhen doing regression with inputs and outputs that arehigh-dimensional, it often makes sense to reduce the di-mensionality of the inputs before mapping to the outputs.We propose a method where both the dimensionality re-duction and the regression mapping can be nonlinearand are estimated jointly. Our key idea is to define an ob-jective function where the low-dimensional coordinatesare free parameters, in addition to the dimensionality re-duction and the regression mapping. This has the ef-fect of decoupling many groups of parameters from eachother, affording a more effective optimization, and to usea good initialization from other methods.

Work funded by NSF CAREER award IIS–0754089.

2 Low-dimensional regression using auxiliary coordinatesGiven a training set XDx×N and YDy×N , instead of directly op-timizing

E1(F,g) =N∑

n=1

‖yn − g(F(xn))‖2 + λgR(g) + λFR(F)

with λF, λg ≥ 0 for dimension reduction mapping F and re-gression mapping g, we let the low-dimensional coordinatesZDz×N = (z1, . . . , zN) be independent, auxiliary parameters tobe optimized over, and unfold the squared error into two termsthat decouple given Z:

E2(F, g,Z) =∑N

n=1 ‖yn − g(zn)‖2 + λgR(g)

+∑N

n=1 ‖zn − F(xn)‖2 + λFR(F).

Now, every squared error involves only a shallow mapping,compared to deeper nesting in the function g ◦F that leads toill-conditioning. We apply the following alternating optimiza-tion procedure to solve the problem.

Given XDx×N ,YDy×N , and intialization ZDz×N

repeat1. Optimize over g : min

g

∑Nn=1 ‖yn − g(zn)‖

2 + λgRg(g)

2. Optimize over F : minF

∑Nn=1 ‖zn − F(xn)‖

2 + λFRF(F)

3. Optimize over Z :minZ

∑Nn=1 ‖yn − g(zn)‖

2 +∑N

n=1 ‖zn − F(xn)‖2

until stop

3 Optimization over F and g

We use shallower functions–linear or RBFs–for F and g.

•Linear g: a direct regression that acts on a lower inputdimension Dz, reduces to least squares problem.

•RBFs g: g(z) = WΦ(z) with M ≤ N Gaussian RBFsφm(z) = e(−

‖(z−µm)‖2)2σ2 , and R(g) = ‖W‖2 is a quadratic

regularizer on the weights.

– Centers µm are chosen by k-means on Z (once everyfew iterations, initialized at previous centers)

– Weights W have a unique solution given by a linearsystem.

– Time complexity: O(NM(M +Dz)), linear in trainingset size.

– Space complexity: O(M(Dy +Dz)).

4 Optimization over Z

•For fixed g and F, optimization of the objective functiondecouples over each zn ∈ R

Dz.

•We have N independent nonlinear minimizations eachon Dz parameters, of the form

minz∈RDz E(z) = ‖y − g(z)‖2 + ‖z− F(x)‖2.

• If g is linear, then z can be solved in closed form bysolving a linear system of size Dz.

• If g is nonlinear, we use Gauss-Newton method with linesearch.

•Cost over all Z: O(ND2zDy), linear in training set size.

•The distribution of the coordinates Z changes dramati-cally in the first few iterations, while the error decreasesquickly, but after that Z changes little.

5 Initialization and ValidationInitialization for Z

•Unsupervised dimensionality reduction on X only.

•Supervised dimension reduction: Reduced Rank Regression,Kernel Slice Inverse Regression, Kernel Dimension Reduction,etc.

•Spectral methods run on (X,Y) jointly.

Validation of hyper-parameters

•Parameters for function F and g (#RBFs, Gaussian Kernel width).

•Regularization coefficients λg and λF.

•Dimensionality of z.

They can be determined through evaluating performance of g ◦Fon a validation set.

6 Advantages of low-dimensional regression

•We jointly optimize over both mappings F and g, unlike one-shot methods.

•Our optimization is much more efficient than using a deep network with nested mappings(pretty good model pretty fast).

•The low-dimensional regressor has fewer parameters when Dz is small or #RBFs is small.

•The smooth functions F and g impose regularization on the regressor and may result in abetter generalization performance.

7 Experimental evaluation

•We use g ◦ F as our regression function for testing, which is the natural “out-of-sample”extension for above optimization.

•Criteria: test error, and the quality of dimension reduction.

•Early stopping for training, usually happens within 100 iterations.

Rotated MNIST digits ‘7’Input: images of digit 7, each has 60 different rotated versions.Output: skeleton version of the digit.

testGround

Truth lin G RBFs G GP G

RBFs F

lin g

RBFs F

RBFs g testGround

Truth lin G RBFs G GP G

RBFs F

lin g

RBFs F

RBFs g

Sample outputs of different algorithms.

Method SSEdirect linear 51 710

direct RBFs (1200, 10−2) 32 495direct RBFs (2000, 10−2) 29 525

Gaussian process 29 208KPCA (3) + RBFs (2000, 1) 49 782

KSIR (60, 26) + RBFs (20, 10−5) 39 421F RBFs (2000, 10−2) + g linear (10−3) 29 612F RBFs (1200, 10−2)+g RBFs (75, 10−2) 27 346

� Sum of squared errors (test set), withoptimal parameters coded as RBFs (M,λ),KPCA (Gaussian kernel width), KSIR (num-ber of slices, Gaussian kernel width). Num-ber of iterations for our method: 16 (linear g),7 (RBFs g).

KPCA KSIR (60 slices)

−0.15 −0.1 −0.05 0 0.05 0.1

−0.15

−0.1

−0.05

0

0.05

0.1

−0.1 −0.05 0 0.05 0.1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

Isomap on (X,Y) F RBFs + g linear

−3 −2 −1 0 1 2 3

x 10−3

−3

−2

−1

0

1

2

3x 10

−3

z1

z2

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

2 dimensional embeddings obtained by different algorithms.

Serpentine robot forward kinematicsForward kinematics mapping goes from 12 to 24 dimen-sions through 4 dimensions.

...x

yz

end-effector

camera

linksρ

x

y

θ1θ2

φ

Method RMSEdirect regression, linear (0) 2.2827

direct regression, RBF (2000, 10−6) 0.3356direct regression, Gaussian process 0.7082

KPCA (2.5) + RBF (400, 10−10) 3.7455KSIR (400, 100) + RBF (1000, 10−8) 3.5533F RBF (2000, 10−6)+g RBF (100, 10−9 0.1006

Test error obtained by different algorithms.

Gro

und

trut

h

−5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

t1

t2

−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

t3

t4

−0.5 0 0.5 1 1.5 2 2.5 3 3.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

t3

t4

−5 0 5 10−10

−8

−6

−4

−2

0

2

4

6

8

10

t1

t2

Our

s

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

z1

z2

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

z1

z2

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01

−0.005

0

0.005

0.01

0.015

0.02

z3

z4

−0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 0.025−0.01

−0.005

0

0.005

0.01

0.015

0.02

z3

z4

Correspondence between our 4 dimensional auxiliary coordinates Z and the ideal one that generates data.

1 2 3 4 5 6 7 8 9 10 11 120

0.5

1

1.5

2

RM

SE

dimension Dz

101

102

103

104

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

nested, trainingnested, testauxcoord, trainingauxcoord, test

root

mea

nsq

uare

dre

gres

sion

erro

r

run time (seconds)

switch from auxiliarycoordinates to nested

Validation of Dz by our algorithm.Comparison of run time of our approach and optimiz-ing the nested objective function.