Theory and algorithms Sparse methods for machine learningfbach/nips2009tutorial/nips... · Sparse...

Sparse

meth

ods

for

mach

ine

learn

ing

Theory

and

alg

orith

ms

Fra

ncis

Bach

Willow

project,

INRIA

-Eco

leN

ormale

Superieu

re

NIP

STutorial

-D

ecember

2009

Special

than

ksto

R.Jen

atton,J.

Mairal,

G.O

bozin

ski

Superv

ised

learn

ing

and

regulariza

tion

•D

ata:x

i ∈X

,y

i ∈Y

,i=

1,...,n

•M

inim

izew

ithresp

ectto

function

f:X

→Y

:

n∑i=

1

ℓ(yi ,f

(xi ))

+λ2 ‖f‖

2

Error

ondata

+Regu

larization

Loss

&fu

nction

space

?N

orm?

•T

wo

theoretical/algorith

mic

issues:

1.Loss

2.Functio

nsp

ace

/norm

Regulariza

tions

•M

ain

goal:

avo

idove

rfittin

g

•Two

main

lines

ofwork

:

1.Euclid

eanan

dH

ilbertian

norm

s(i.e.,

ℓ2 -n

orms)

–Possib

ilityof

non

linear

predictors

–N

onparam

etricsu

pervised

learnin

gan

dkern

elm

ethods

–W

elldevelop

ped

theory

and

algorithm

s(see,

e.g.,W

ahba,

1990;

Sch

olkopfan

dSm

ola,2001;

Shaw

e-Taylor

and

Cristian

ini,

2004)

Regulariza

tions

•M

ain

goal:

avo

idove

rfittin

g

•Two

main

lines

ofwork

:

1.Euclid

eanan

dH

ilbertian

norm

s(i.e.,

ℓ2 -n

orms)

–Possib

ilityof

non

linear

predictors

–N

onparam

etricsu

pervised

learnin

gan

dkern

elm

ethods

–W

elldevelop

ped

theory

and

algorithm

s(see,

e.g.,W

ahba,

1990;

Sch

olkopfan

dSm

ola,2001;

Shaw

e-Taylor

and

Cristian

ini,

2004)

2.Sparsity-in

ducin

gnorm

s

–U

sually

restrictedto

linear

predictors

onvectors

f(x

)=

w⊤x

–M

ainexam

ple:

ℓ1 -n

orm‖w‖

1=

∑pi=

1 |wi |

–Perform

model

selectionas

well

asregu

larization

–Theory

and

alg

orith

ms

“in

the

makin

g”

ℓ2

vs.

ℓ1-

Gaussia

nhare

vs.

Lapla

cian

torto

ise

•First-ord

erm

ethods

(Fu,1998;

Wu

and

Lan

ge,2008)

•H

omotopy

meth

ods

(Markow

itz,1956;

Efron

etal.,

2004)

Lasso

-T

wo

main

rece

nt

theore

ticalre

sults

1.Support

reco

very

conditio

n(Z

hao

and

Yu,2006;

Wain

wrigh

t,2009;

Zou

,2006;

Yuan

and

Lin

,2007):

the

Lasso

issign

-consisten

tif

and

only

if

‖Q

JcJ Q

−1

JJsign

(wJ )‖∞

61,

where

Q=

limn→

+∞

1n

∑ni=

1x

i x⊤i∈

Rp×

p

Lasso

-T

wo

main

rece

nt

theore

ticalre

sults

1.Support

reco

very

conditio

n(Z

hao

and

Yu,2006;

Wain

wrigh

t,2009;

Zou

,2006;

Yuan

and

Lin

,2007):

the

Lasso

issign

-consisten

tif

and

only

if

‖Q

JcJ Q

−1

JJsign

(wJ )‖∞

61,

where

Q=

limn→

+∞

1n

∑ni=

1x

i x⊤i∈

Rp×

p

2.Exp

onentia

llym

any

irrele

vant

variable

s(Z

hao

and

Yu,

2006;

Wain

wrigh

t,2009;

Bickel

etal.,

2009;Lou

nici,

2008;M

einsh

ausen

and

Yu,2008):

under

appropriate

assum

ption

s,con

sistency

ispossib

le

aslon

gas

logp

=O

(n)

Goin

gbeyo

nd

the

Lasso

•ℓ1 -n

ormfor

linear

feature

selectionin

hig

hdim

ensio

ns

–Lasso

usu

allynot

applicab

ledirectly

•N

on-lin

earitie

s

•D

ealin

gw

ithexp

onentia

llym

any

featu

res

•Sparse

learn

ing

on

matrice

s

Goin

gbeyo

nd

the

Lasso

Non-lin

earity

-M

ultip

lekern

elle

arnin

g

•M

ultip

lekern

elle

arnin

g

–Learn

sparse

combin

ationof

matrices

k(x

,x′)

=∑

pj=

1η

j kj (x

,x′)

–M

ixing

positive

aspects

ofℓ1 -n

orms

and

ℓ2 -n

orms

•Equiva

lent

togro

up

Lasso

–p

multi-d

imen

sional

features

Φj (x

),w

here

kj (x

,x′)

=Φ

j (x) ⊤

Φj (x

′)

–learn

predictor

∑pj=

1w

⊤jΦ

j (x)

–Pen

alizationby

∑pj=

1 ‖wj ‖

2

Goin

gbeyo

nd

the

Lasso

Stru

cture

dse

toffe

atu

res

•D

ealin

gw

ithexp

onentia

llym

any

featu

res

–Can

we

design

efficien

talgorith

ms

forth

ecase

logp≈

n?

–U

sestru

cture

tored

uce

the

num

ber

ofallow

edpattern

sof

zeros

–Recu

rsivity,hie

rarchie

san

dfactorization

•Prio

rin

form

atio

non

sparsity

patte

rns

–G

rouped

variables

with

overlappin

ggrou

ps

Goin

gbeyo

nd

the

Lasso

Sparse

meth

ods

on

matrice

s

•Learn

ing

pro

ble

ms

on

matrice

s

–M

ulti-task

learnin

g

–M

ulti-category

classification

–M

atrixcom

pletion

–Im

ageden

oising

–N

MF,top

icm

odels,

etc.

•M

atrix

facto

rizatio

n

–T

wo

types

ofsp

arsity(low

-rank

ordiction

arylearn

ing)

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Why

ℓ1 -n

orm

constra

ints

leads

tosp

arsity?

•Exam

ple:

min

imize

quad

raticfu

nction

Q(w

)su

bject

to‖w‖

16

T.

–cou

pled

softth

reshold

ing

•G

eometric

interpretation

–N

B:pen

alizing

is“eq

uivalen

t”to

constrain

ing

1

2w

w1

2w

w

ℓ1 -n

orm

regulariza

tion

(linear

settin

g)

•D

ata:covariates

xi ∈

Rp,

respon

sesy

i ∈Y

,i=

1,...,n

•M

inim

izew

ithresp

ectto

loadin

gs/weigh

tsw

∈R

p:

J(w

)=

n∑i=

1

ℓ(yi ,w

⊤x

i )+

λ‖w‖1

Error

ondata

+Regu

larization

•In

cludin

ga

constan

tterm

b?Pen

alizing

orcon

strainin

g?

•sq

uare

loss⇒basis

pursu

itin

signal

processin

g(C

hen

etal.,

2001),

Lasso

instatistics/m

achin

elearn

ing

(Tib

shiran

i,1996)

Are

vie

wofnonsm

ooth

convex

analy

sisand

optim

izatio

n

•A

nalysis:

optim

alitycon

dition

s

•O

ptim

ization:

algorithm

s

–First-ord

erm

ethods

•Books:

Boyd

and

Van

den

bergh

e(2004),

Bon

nan

set

al.(2003),

Bertsekas

(1995),B

orwein

and

Lew

is(2000)

Optim

ality

conditio

ns

for

smooth

optim

izatio

n

Zero

gra

die

nt

•Exam

ple:

ℓ2 -regu

larization:

min

w∈

Rp

n∑i=

1

ℓ(yi ,w

⊤x

i )+

λ2 ‖w‖22

–G

radien

t∇

J(w

)=

∑ni=

1ℓ ′(y

i ,w⊤x

i )xi+

λw

where

ℓ ′(yi ,w

⊤x

i )

isth

epartial

derivative

ofth

eloss

w.r.t

the

second

variable

–If

square

loss,∑

ni=1ℓ(y

i ,w⊤x

i )=

12 ‖y−

Xw‖

22

∗grad

ient

=−

X⊤(y−

Xw

)+

λw

∗norm

aleq

uation

s⇒w

=(X

⊤X

+λI) −

1X⊤y

Optim

ality

conditio

ns

for

smooth

optim

izatio

n

Zero

gra

die

nt

•Exam

ple:

ℓ2 -regu

larization:

min

w∈

Rp

n∑i=

1

ℓ(yi ,w

⊤x

i )+

λ2 ‖w‖22

–G

radien

t∇

J(w

)=

∑ni=

1ℓ ′(y

i ,w⊤x

i )xi+

λw

where

ℓ ′(yi ,w

⊤x

i )

isth

epartial

derivative

ofth

eloss

w.r.t

the

second

variable

–If

square

loss,∑

ni=1ℓ(y

i ,w⊤x

i )=

12 ‖y−

Xw‖

22

∗grad

ient

=−

X⊤(y−

Xw

)+

λw

∗norm

aleq

uation

s⇒w

=(X

⊤X

+λI) −

1X⊤y

•ℓ1 -n

orm

isnon

diff

ere

ntia

ble

!

–can

not

compute

the

gradien

tof

the

absolu

tevalu

e

⇒D

irectio

nalderiva

tives

(orsu

bgrad

ient)

Dire

ctionalderiv

ativ

es

-co

nvex

functio

ns

on

Rp

•D

irectional

derivative

inth

edirection

∆at

w:

∇J(w

,∆)

=lim

ε→0+

J(w

+ε∆

)−J(w

)

ε

•A

lways

existw

hen

Jis

convex

and

contin

uou

s

•M

ainid

ea:in

non

smooth

situation

s,m

ayneed

tolo

okat

all

direction

s∆

and

not

simply

pin

dep

enden

ton

es

•Pro

positio

n:

Jis

diff

erentiab

leat

w,if

and

only

if∆

7→∇

J(w

,∆)

islin

ear.T

hen

,∇J(w

,∆)

=∇

J(w

) ⊤∆

Optim

ality

conditio

ns

for

convex

functio

ns

•U

ncon

strained

min

imization

(function

defi

ned

onR

p):

–Pro

positio

n:

wis

optim

alif

and

only

if∀∆

∈R

p,∇J(w

,∆)

>0

–G

oup

locally

inall

direction

s

•Red

uces

tozero-grad

ient

forsm

ooth

problem

s

•Con

strained

min

imization

(function

defi

ned

ona

convex

setK

)

–restrict

∆to

direction

sso

that

w+

ε∆∈

Kfor

small

ε

Dire

ctionalderiv

ativ

es

for

ℓ1 -n

orm

regulariza

tion

•Function

J(w

)=

n∑i=

1

ℓ(yi ,w

⊤x

i )+

λ‖w‖1

=L

(w)+

λ‖w‖1

•ℓ1 -n

orm:‖w

+ε∆

‖1 −

‖w‖1=

∑

j,w

j 6=0 {|w

j+

ε∆j |−

|wj |}

+∑

j,w

j =0 |ε∆

j |

•T

hus,

∇J(w

,∆)

=∇

L(w

) ⊤∆

+λ

∑

j,w

j 6=0

sign(w

j )∆j+

λ∑

j,w

j =0 |∆

j |

=∑

j,w

j 6=0 [∇

L(w

)j+

λsign

(wj )]∆

j+

∑

j,w

j =0 [∇

L(w

)j ∆

j+

λ|∆j |]

•Sep

arability

ofop

timality

condition

s

Optim

ality

conditio

ns

for

ℓ1 -n

orm

regulariza

tion

•G

eneral

loss:w

optim

alif

and

only

iffor

allj∈

{1,...,p},

sign(w

j )6=0

⇒∇

L(w

)j+

λsign

(wj )

=0

sign(w

j )=

0⇒

|∇L

(w)j |

6λ

•Square

loss:w

optim

alif

and

only

iffor

allj∈{1,...,p},

sign(w

j )6=0

⇒−

X⊤j(y

−X

w)+

λsign

(wj )

=0

sign(w

j )=

0⇒

|X⊤j(y

−X

w)|

6λ

–For

J⊂

{1,...,p},X

J∈

Rn×

|J|=

X(:,J

)den

otesth

ecolu

mns

ofX

indexed

byJ,i.e.,

variables

indexed

byJ

First

ord

er

meth

ods

for

convex

optim

izatio

non

Rp

Sm

ooth

optim

izatio

n

•Gra

die

nt

desce

nt:

wt+

1=

wt −

αt ∇

J(w

t )

–w

ithlin

esearch

:search

fora

decen

t(n

otnecessarily

best)

αt

–fixed

dim

inish

ing

stepsize,

e.g.,α

t=

a(t

+b) −

1

•Con

vergence

off(w

t )to

f∗

=m

inw∈

Rpf(w

)(N

esterov,2003)

–f

convex

and

M-L

ipsch

itz:f(w

t )−f∗

=O

(M/ √

t)

–an

d,diff

erentiab

lew

ithL

-Lip

schitz

gradien

t:f(w

t )−f∗

=O

(L/t

)

–an

d,f

µ-stron

glycon

vex:f(w

t )−f∗

=O

(Lex

p(−

4tµL))

•µL

=con

dition

num

ber

ofth

eop

timization

problem

•Coord

inate

descen

t:sim

ilarprop

erties

•N

B:“op

timalsch

eme”

f(w

t )−f∗

=O

(Lm

in{exp(−

4t√

µ/L

),t −2}

)

First-o

rder

meth

ods

for

convex

optim

izatio

non

Rp

Non

smooth

optim

izatio

n

•First-ord

erm

ethods

fornon

diff

erentiab

leob

jective

–Subgrad

ient

descen

t:w

t+1

=w

t −α

t gt ,

with

gt ∈

∂J(w

t ),i.e.,

such

that∀

∆,g ⊤t

∆6

∇J(w

t ,∆)

∗w

ithexact

line

search:

not

always

convergen

t(see

counter-

example)

∗dim

inish

ing

stepsize,

e.g.,α

t=

a(t

+b) −

1:con

vergent

–Coord

inate

descen

t:not

always

convergen

t(sh

owcou

nter-exam

ple)

•Con

vergence

rates(f

convex

and

M-L

ipsch

itz):f(w

t )−f∗

=O

(M√

t

)

Counte

r-exam

ple

Coord

inate

desce

nt

for

nonsm

ooth

obje

ctives

5

4

32

1

w

w2

1

Counte

r-exam

ple

(Bertse

kas,

1995)

Ste

epest

desce

nt

for

nonsm

ooth

obje

ctives

•q(x

1 ,x2 )

=

{−

5(9x21+

16x22 )

1/2

ifx

1>

|x2 |

−(9x

1+

16|x2 |)

1/2

ifx

16

|x2 |

•Steep

estdescen

tstartin

gfrom

any

xsu

chth

atx

1>

|x2 |

>

(9/16)2|x

1 |

−5

05

−5 0 5

Sparsity

-inducin

gnorm

s

Usin

gth

estru

cture

ofth

epro

ble

m

•Prob

lems

ofth

eform

min

w∈

RpL

(w)+

λ‖w‖or

min

‖w‖6

µL

(w)

–L

smooth

–O

rthogon

alprojectionson

the

ballor

the

dualb

allcanbe

perform

ed

insem

i-closedform

,e.g.,

ℓ1 -n

orm(M

aculan

and

GA

LD

INO

DE

PAU

LA

,1989)

orm

ixedℓ1 -ℓ

2(see,

e.g.,van

den

Berg

etal.,

2009)

•M

ayuse

similar

techniq

ues

than

smooth

optim

ization

–Projected

gradien

tdescen

t

–Proxim

alm

ethods

(Beck

and

Teb

oulle,

2009)

–D

ual

ascent

meth

ods

•Sim

ilarcon

vergence

rates

–depends

on

the

conditio

nnum

ber

ofth

elo

ss

Cheap

(and

not

dirty

)alg

orith

ms

for

all

losse

s

•Coord

inate

desce

nt

(Fu,

1998;W

uan

dLan

ge,2008;

Fried

man

etal.,

2007)

–con

vergenthere

under

reasonab

leassu

mption

s!(B

ertsekas,1995)

–sep

arability

ofop

timality

condition

s

–eq

uivalen

tto

iterativeth

reshold

ing

Cheap

(and

not

dirty

)alg

orith

ms

for

all

losse

s

•Coord

inate

desce

nt

(Fu,

1998;W

uan

dLan

ge,2008;

Fried

man

etal.,

2007)

–con

vergenthere

under

reasonab

leassu

mption

s!(B

ertsekas,1995)

–sep

arability

ofop

timality

condition

s

–eq

uivalen

tto

iterativeth

reshold

ing

•“η-trick

”(M

icchelli

and

Pon

til,2006;

Rakotom

amon

jyet

al.,2008;

Jenatton

etal.,

2009b)

–N

oticeth

at∑

pj=

1 |wj |

=m

inη>

012

∑pj=

1

{w

2j

ηj

+η

j

}

–A

lternatin

gm

inim

izationw

ithresp

ectto

η(closed

-form)

and

w

(weigh

tedsq

uared

ℓ2 -n

ormregu

larizedprob

lem)

Cheap

(and

not

dirty

)alg

orith

ms

for

all

losse

s

•Coord

inate

desce

nt

(Fu,

1998;W

uan

dLan

ge,2008;

Fried

man

etal.,

2007)

–con

vergenthere

under

reasonab

leassu

mption

s!(B

ertsekas,1995)

–sep

arability

ofop

timality

condition

s

–eq

uivalen

tto

iterativeth

reshold

ing

•“η-trick

”(M

icchelli

and

Pon

til,2006;

Rakotom

amon

jyet

al.,2008;

Jenatton

etal.,

2009b)

–N

oticeth

at∑

pj=

1 |wj |

=m

inη>

012

∑pj=

1

{w

2j

ηj

+η

j

}

–A

lternatin

gm

inim

izationw

ithresp

ectto

η(closed

-form)

and

w

(weigh

tedsq

uared

ℓ2 -n

ormregu

larizedprob

lem)

•D

edica

ted

alg

orith

ms

thatuse

sparsity

(activesets

and

hom

otopy

meth

ods)

Specia

lca

seofsq

uare

loss

•Q

uadra

ticpro

gra

mm

ing

form

ula

tion:

min

imize

12 ‖y−X

w‖2+

λ

p∑j=

1 (w+j+

w−j)

such

that

w=

w+−

w−,

w+

>0,

w−

>0

Specia

lca

seofsq

uare

loss

•Q

uadra

ticpro

gra

mm

ing

form

ula

tion:

min

imize

12 ‖y−X

w‖2+

λ

p∑j=

1 (w+j+

w−j)

such

that

w=

w+−

w−,

w+

>0,

w−

>0

–generic

toolb

oxe

s⇒

very

slow

•M

ain

pro

perty

:if

the

signpattern

s∈{−

1,0,1}p

ofth

esolu

tionis

know

n,th

esolu

tioncan

be

obtain

edin

closedform

–Lasso

equivalen

tto

min

imizin

g12 ‖y−

XJw

J ‖2+

λs ⊤J

wJ

w.r.t.

wJ

where

J=

{j,sj 6=

0}.–

Closed

formsolu

tionw

J=

(X⊤JX

J) −

1(X⊤Jy−

λs

J)

•Alg

orith

m:

“Guess”

sand

check

optim

ality

conditio

ns

Optim

ality

conditio

ns

for

the

sign

vecto

rs

(Lasso

)

•For

s∈{−

1,0,1}p

signvector,

J=

{j,sj 6=

0}th

enon

zeropattern

•poten

tialclosed

formsolu

tion:

wJ

=(X

⊤JX

J) −

1(X⊤Jy−

λs

J)

and

wJ

c=

0

•s

isop

timal

ifan

don

lyif

–active

variables:

sign(w

J)

=s

J

–in

activevariab

les:‖X⊤J

c (y−

XJw

J)‖∞

6λ

•Active

set

alg

orith

ms

(Lee

etal.,

2007;Roth

and

Fisch

er,2008)

–Con

struct

Jiteratively

byad

din

gvariab

lesto

the

activeset

–O

nly

requires

toin

vertsm

alllin

earsystem

s

Hom

oto

py

meth

ods

for

the

square

loss

(Mark

ow

itz,

1956;O

sborn

eet

al.,

2000;Efro

net

al.,

2004)

•Goal:

Get

allsolu

tions

forall

possib

levalu

esof

the

regularization

param

eterλ

•Sam

eid

eaas

before:

ifth

esign

vectoris

know

n,

w∗J(λ

)=

(X⊤JX

J) −

1(X⊤Jy−

λs

J)

valid,as

long

as,

–sign

condition

:sign

(w∗J(λ

))=

sJ

–su

bgrad

ient

condition

:‖X⊤J

c (XJw

∗J(λ

)−y)‖∞

6λ

–th

isdefi

nes

anin

tervalon

λ:

the

path

isth

uspie

cew

iseaffi

ne

•Sim

ply

need

tofind

breakpoin

tsan

ddirection

s

Pie

cew

iselin

ear

path

s

00

.10

.20

.30

.40

.50

.6

−0

.6

−0

.4

−0

.2 0

0.2

0.4

0.6

reg

ula

riza

tion

pa

ram

ete

r

weights

Alg

orith

ms

for

ℓ1 -n

orm

s(sq

uare

loss):

Gaussia

nhare

vs.

Lapla

cian

torto

ise

•Coord

inate

descen

t:O

(pn)

per

iterations

forℓ1

and

ℓ2

•“E

xact”algorith

ms:

O(k

pn)

forℓ1

vs.O

(p2n

)for

ℓ2

Additio

nalm

eth

ods

-Softw

ares

•M

any

contrib

ution

sin

signal

processin

g,op

timization

,m

achin

e

learnin

g

–Proxim

alm

ethods

(Nesterov,

2007;B

eckan

dTeb

oulle,

2009)

–Exten

sions

tosto

chastic

setting

(Bottou

and

Bou

squet,

2008)

•Exten

sions

tooth

ersp

arsity-inducin

gnorm

s

•Softw

ares

–M

any

available

codes

–SPA

MS

(SPA

rseM

odelin

gSoftw

are)-

note

diff

erence

with

SpA

M(R

avikum

aret

al.,2008)

http://www.di.ens.fr/willow/SPAMS/

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Theore

ticalre

sults

-Square

loss

•M

ainassu

mption

:data

generated

froma

certainsp

arsew

•T

hree

main

problem

s:

1.Regular

consiste

ncy

:con

vergence

ofestim

atorw

tow

,i.e.,

‖w−

w‖

tends

tozero

when

nten

ds

to∞

2.M

odelse

lectio

nco

nsiste

ncy

:con

vergence

ofth

esp

arsitypattern

ofw

toth

epattern

w

3.Effi

ciency

:con

vergence

ofpred

ictions

with

wto

the

prediction

s

with

w,i.e.,

1n ‖Xw−

Xw‖22

tends

tozero

•M

ainresu

lts:

–Conditio

nfo

rm

odelco

nsiste

ncy

(support

reco

very)

–H

igh-d

imensio

nalin

fere

nce

Modelse

lectio

nco

nsiste

ncy

(Lasso

)

•A

ssum

ew

sparse

and

den

oteJ

={j,w

j 6=0}

the

non

zeropattern

•Support

reco

very

conditio

n(Z

hao

and

Yu,2006;

Wain

wrigh

t,2009;

Zou

,2006;

Yuan

and

Lin

,2007):

the

Lasso

issign

-consisten

tif

and

only

if‖Q

JcJ Q

−1

JJsign

(wJ )‖∞

61

where

Q=

limn→

+∞

1n

∑ni=

1x

i x⊤i∈

Rp×

p(covarian

cem

atrix)

Modelse

lectio

nco

nsiste

ncy

(Lasso

)

•A

ssum

ew

sparse

and

den

oteJ

={j,w

j 6=0}

the

non

zeropattern

•Support

reco

very

conditio

n(Z

hao

and

Yu,2006;

Wain

wrigh

t,2009;

Zou

,2006;

Yuan

and

Lin

,2007):

the

Lasso

issign

-consisten

tif

and

only

if‖Q

JcJ Q

−1

JJsign

(wJ )‖∞

61

where

Q=

limn→

+∞

1n

∑ni=

1x

i x⊤i∈

Rp×

p(covarian

cem

atrix)

•Con

dition

dep

ends

onw

and

J(m

aybe

relaxed)

–m

aybe

relaxedby

maxim

izing

out

sign(w

)or

J

•Valid

inlow

and

high

-dim

ension

alsettin

gs

•Req

uires

lower-b

ound

onm

agnitu

de

ofnon

zerow

j

Modelse

lectio

nco

nsiste

ncy

(Lasso

)

•A

ssum

ew

sparse

and

den

oteJ

={j,w

j 6=0}

the

non

zeropattern

•Support

reco

very

conditio

n(Z

hao

and

Yu,2006;

Wain

wrigh

t,2009;

Zou

,2006;

Yuan

and

Lin

,2007):

the

Lasso

issign

-consisten

tif

and

only

if‖Q

JcJ Q

−1

JJsign

(wJ )‖∞

61

where

Q=

limn→

+∞

1n

∑ni=

1x

i x⊤i∈

Rp×

p(covarian

cem

atrix)

•The

Lasso

isusu

ally

not

model-co

nsiste

nt

–Selects

more

variables

than

necessary

(see,e.g.,

Lv

and

Fan

,2009)

–Fixin

gth

eLasso

:ad

aptive

Lasso

(Zou

,2006),

relaxed

Lasso

(Mein

shau

sen,

2008),th

reshold

ing

(Lou

nici,

2008),

Bolasso

(Bach

,2008a),

stability

selection(M

einsh

ausen

and

Buhlm

ann,2008),

Wasserm

anan

dRoed

er(2009)

Adaptiv

eLasso

and

conca

ve

penaliza

tion

•Adaptive

Lasso

(Zou

,2006;

Huan

get

al.,2008)

–W

eighted

ℓ1 -n

orm:

min

w∈

RpL

(w)+

λ

p∑j=

1

|wj |

|wj | α

–w

estimator

obtain

edfrom

ℓ2

orℓ1

regularization

•Refo

rmula

tion

inte

rms

ofco

nca

vepenaliza

tion

min

w∈

RpL

(w)+

p∑j=

1

g(|w

j |)

–Exam

ple:

g(|w

j |)=

|wj | 1

/2

orlog|w

j |.Closer

toth

eℓ0

pen

alty

–Con

cave-convex

proced

ure:

replace

g(|w

j |)by

affine

upper

bou

nd

–B

ettersp

arsity-inducin

gprop

erties(F

anan

dLi,

2001;Zou

and

Li,

2008;Zhan

g,2008b

)

Bola

sso(B

ach

,2008a)

•Pro

perty

:for

asp

ecific

choice

ofregu

larizationparam

eterλ≈

√n:

–all

variables

inJ

arealw

aysselected

with

high

probab

ility

–all

other

ones

selectedw

ithprob

ability

in(0,1)

•U

seth

ebootstrap

tosim

ulate

severalrep

lications

–In

tersecting

supports

ofvariab

les

–Fin

alestim

ationof

won

the

entire

dataset

J 2 1 J

Bo

otstra

p 4

Bo

otstra

p 5

Bo

otstra

p 2

Bo

otstra

p 3

Bo

otstra

p 1

Inte

rsectio

n 5 4 3J J J

Modelse

lectio

nco

nsiste

ncy

ofth

eLasso

/B

ola

sso

•prob

abilities

ofselection

ofeach

variable

vs.regu

larizationparam

.µ

LA

SSO

−lo

g(µ

)

variable index

05

10

15

5

10

15

−lo

g(µ

)

variable index

05

10

15

5

10

15

BO

LA

SSO

−lo

g(µ

)

variable index

05

10

15

5

10

15

−lo

g(µ

)

variable index

05

10

15

5

10

15

Support

recoverycon

dition

satisfied

not

satisfied

Hig

h-d

imensio

nalin

fere

nce

Goin

gbeyo

nd

exact

support

reco

very

•T

heoreticalresu

ltsusu

allyassu

me

that

non

-zerow

jare

largeen

ough

,

i.e.,|wj |

>σ√

log

pn

•M

ayin

clude

too

many

variable

sbut

stillpre

dict

well

•O

raclein

equalities

–Pred

ictas

well

asth

eestim

atorob

tained

with

the

know

ledge

ofJ

–A

ssum

ei.i.d

.G

aussian

noise

with

variance

σ2

–W

ehave:

1nE‖X

wora

cle −

Xw‖22

=σ

2|J|n

Hig

h-d

imensio

nalin

fere

nce

Varia

ble

sele

ction

with

out

com

puta

tionallim

its

•A

pproach

esbased

onpen

alizedcriteria

(closeto

BIC

)

min

J⊂{1,...,p}

{m

inw

J ∈R|J| ‖y

−X

Jw

J ‖22

}+

Cσ

2|J| (1+

logp|J| )

•O

racle

inequality

ifdata

generated

byw

with

knon

-zeros(M

assart,

2003;B

unea

etal.,

2007):

1n ‖Xw−

Xw‖22

6C

kσ

2

n

(1+

logpk

)

•G

aussian

noise

-N

oassu

mptio

ns

regard

ing

corre

latio

ns

•Sca

ling

betw

een

dim

ensio

ns:

klo

gp

nsm

all

•O

ptim

alin

the

min

imax

sense

Hig

h-d

imensio

nalin

fere

nce

Varia

ble

sele

ction

with

orth

ogonaldesig

n

•O

rthogonaldesig

n:

assum

eth

at1nX

⊤X

=I

•Lasso

iseq

uivalen

tto

soft-thresh

oldin

g1nX

⊤Y

∈R

p

–Solu

tion:

wj

=soft-th

reshold

ing

of1nX

⊤jy

=w

j+

1nX

⊤jε

atλn

t

+(|t|−

a)

−a

a

sign

(t)

min

w∈

R

12w

2−w

t+

a|w|

Solu

tionw

=(|t|−

a)+

sign(t)

Hig

h-d

imensio

nalin

fere

nce

Varia

ble

sele

ction

with

orth

ogonaldesig

n

•O

rthogonaldesig

n:

assum

eth

at1nX

⊤X

=I

•Lasso

iseq

uivalen

tto

soft-thresh

oldin

g1nX

⊤Y

∈R

p

–Solu

tion:

wj

=soft-th

reshold

ing

of1nX

⊤jy

=w

j+

1nX

⊤jε

atλn

–Take

λ=

Aσ √

nlog

p

•W

here

does

the

logp

=O

(n)

com

efro

m?

–Exp

ectationof

the

maxim

um

ofp

Gau

ssianvariab

les≈√

logp

–U

nion

-bou

nd:

P(∃

j∈

Jc,|X

⊤jε|

>λ)

6∑

j∈J

cP(|X

⊤jε|

>λ)

6|J

c|e −λ2

2n

σ2

6pe −

A22

log

p=

p1−

A22

Hig

h-d

imensio

nalin

fere

nce

(Lasso

)

•M

ain

resu

lt:we

only

need

klog

p=

O(n

)

–if

wis

suffi

ciently

sparse

–and

input

variables

arenot

too

correlated

•Precise

condition

son

covariance

matrix

Q=

1nX

⊤X

.

–M

utu

alin

cohere

nce

(Lou

nici,

2008)

–Restricte

deig

enva

lue

conditio

ns

(Bickel

etal.,

2009)

–Sparse

eigenvalu

es(M

einsh

ausen

and

Yu,2008)

–N

ull

space

property

(Don

oho

and

Tan

ner,

2005)

•Lin

ksw

ithsign

alpro

cessing

and

compressed

sensin

g(C

andes

and

Wakin

,2008)

•A

ssum

eth

atQ

has

unit

diagon

al

Mutu

alin

cohere

nce

(unifo

rmlo

wco

rrela

tions)

•Theore

m(L

ounici,

2008):

–y

i=

w⊤x

i+

εi ,

εi.i.d

.norm

alw

ithm

eanzero

and

variance

σ2

–Q

=X

⊤X

/nw

ithunit

diagon

alan

dcross-term

sless

than

1

14k–

if‖w‖0

6k,an

dA

2>

8,th

en,w

ithλ

=A

σ √n

logp

P

(‖w−

w‖∞

65A

σ

(log

p

n

)1/2)

>1−

p1−

A2/8

•M

odel

consisten

cyby

thresh

oldin

gif

min

j,wj 6=

0 |wj |

>C

σ

√

logp

n

•M

utu

alin

coheren

cecon

dition

dep

ends

strongly

onk

•Im

provedresu

ltby

averaging

oversp

arsitypattern

s(C

andes

and

Plan

,

2009b)

Restricte

deig

envalu

eco

nditio

ns

•Theore

m(B

ickelet

al.,2009):

–assu

me

κ(k

)2

=m

in|J|6

km

in∆

,‖∆

Jc ‖

16‖∆

J ‖1

∆⊤Q

∆

‖∆J ‖

22

>0

–assu

me

λ=

Aσ √

nlog

pan

dA

2>

8

–th

en,w

ithprob

ability

1−p1−

A2/

8,we

have

estimation

error‖w

−w‖1

616A

κ2(k

) σk

√

logp

n

prediction

error1n ‖X

w−

Xw‖22

616A

2

κ2(k

) σ2k

nlog

p

•Con

dition

imposes

apoten

tiallyhid

den

scaling

betw

een(n

,p,k

)

•Con

dition

always

satisfied

forQ

=I

Check

ing

suffi

cient

conditio

ns

•M

ost

ofth

eco

nditio

ns

arenot

com

puta

ble

inpolyn

om

ialtim

e

•Random

matrice

s

–Sam

ple

X∈

Rn×

pfrom

the

Gau

ssianen

semble

–Con

dition

ssatisfi

edw

ithhigh

probab

ilityfor

certain(n

,p,k

)

–Exam

ple

fromW

ainw

right

(2009):n

>C

klog

p

•Check

ing

with

conve

xoptim

izatio

n

–Relax

condition

sto

convex

optim

izationprob

lems

(d’A

spremon

t

etal.,

2008;Ju

ditsky

and

Nem

irovski,2008;

d’A

spremon

tan

d

ElG

haou

i,2008)

–Exam

ple:

sparse

eigenvalu

esm

in|J|6

kλ

min (Q

JJ)

–O

pen

problem

:verifi

able

assum

ption

sstill

leadto

weaker

results

Sparse

meth

ods

Com

mon

exte

nsio

ns

•Rem

ovin

gbia

softh

eestim

ato

r

–K

eepth

eactive

set,an

dperform

unregu

larizedrestricted

estimation

(Can

des

and

Tao,

2007)

–B

etterth

eoreticalbou

nds

–Poten

tialprob

lems

ofrob

ustn

ess

•Ela

sticnet

(Zou

and

Hastie,

2005)

–Rep

laceλ‖w‖

1by

λ‖w‖1+

ε‖w‖22

–M

aketh

eop

timization

strongly

convex

with

uniq

ue

solution

–B

etterbeh

aviorw

ithheavily

correlatedvariab

les

Rele

vance

ofth

eore

ticalre

sults

•M

ost

resu

ltsonly

for

the

square

loss

–Exten

dto

other

losses(V

anD

eG

eer,2008;

Bach

,2009b

)

•M

ost

resu

ltsonly

for

ℓ1 -re

gulariza

tion

–M

aybe

extended

tooth

ernorm

s(see,

e.g.,H

uan

gan

dZhan

g,

2009;B

ach,2008b

)

•Conditio

non

corre

latio

ns

–very

restrictive,far

fromresu

ltsfor

BIC

pen

alty

•N

on

sparse

genera

ting

vecto

r

–little

work

onrob

ustn

essto

lackof

sparsity

•Estim

atio

nofre

gulariza

tion

para

mete

r

–N

osatisfactory

solution

⇒op

enprob

lem

Alte

rnativ

esp

arsem

eth

ods

Gre

edy

meth

ods

•Forw

ardselection

•Forw

ard-b

ackward

selection

•N

on-con

vexm

ethod

–H

arder

toan

alyze

–Sim

pler

toim

plem

ent

–Prob

lems

ofstab

ility

•Positive

theoretical

results

(Zhan

g,2009,

2008a)

–Sim

ilarsu

fficien

tcon

dition

sth

anfor

the

Lasso

Alte

rnativ

esp

arsem

eth

ods

Baye

sian

meth

ods

•Lasso:

min

imize

∑ni=

1(y

i −w

⊤x

i )2+

λ‖w‖1

–Equivalen

tto

MA

Pestim

ationw

ithG

aussian

likelihood

and

factorizedLapla

ceprior

p(w

)∝∏

pj=

1e −

λ|wj |

(Seeger,

2008)

–H

oweve

r,poste

riorputs

zero

weig

ht

on

exa

ctze

ros

•H

eavy-taileddistrib

ution

sas

aproxy

tosp

arsity

–Stu

den

tdistrib

ution

s(C

aronan

dD

oucet,

2008)

–G

eneralized

hyp

erbolic

priors(A

rcham

beau

and

Bach

,2008)

–In

stance

ofau

tomatic

relevance

determ

ination

(Neal,

1996)

•M

ixtures

of“D

iracs”an

dan

other

absolu

telycon

tinuou

sdistrib

ution

s,

e.g.,“sp

ikean

dslab

”(Ish

waran

and

Rao,

2005)

•Less

theory

than

frequen

tistm

ethods

Com

parin

gLasso

and

oth

er

strate

gie

sfo

rlin

ear

regre

ssion

•Com

pared

meth

ods

toreach

the

least-square

solution

–Rid

geregression

:m

inw∈

Rp

12 ‖y−

Xw‖

22+

λ2 ‖w‖22

–Lasso:

min

w∈

Rp

12 ‖y−

Xw‖

22+

λ‖w‖1

–Forw

ardgreed

y:

∗In

itializationw

ithem

pty

set

∗Seq

uen

tiallyad

dth

evariab

leth

atbest

reduces

the

square

loss

•Each

meth

od

build

sa

path

ofsolu

tions

from0

toord

inary

least-

squares

solution

•Regu

larizationparam

etersselected

onth

etest

set

Sim

ula

tion

resu

lts

•i.i.d

.G

aussian

design

matrix,

k=

4,n

=64,

p∈

[2,256],SN

R=

1

•N

otestab

ilityto

non

-sparsity

and

variability

24

68

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log

2 (p)

mean square error

L1

L2

gre

ed

y

ora

cle

24

68

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log

2 (p)

mean square error

L1

L2

gre

ed

y

Sparse

Rotated

(non

sparse)

Sum

mary

ℓ1 -n

orm

regulariza

tion

•ℓ1 -n

ormregu

larizationlead

sto

non

smooth

optim

izationprob

lems

–an

alysisth

rough

direction

alderivatives

orsu

bgrad

ients

–op

timization

may

orm

aynot

takead

vantage

ofsp

arsity

•ℓ1 -n

ormregu

larizationallow

shigh

-dim

ension

alin

ference

•In

teresting

problem

sfor

ℓ1 -regu

larization

–Stab

levariab

leselection

–W

eakersu

fficien

tcon

dition

s(for

weaker

results)

–Estim

ationof

regularization

param

eter(all

bou

nds

dep

end

onth

e

unkn

own

noise

variance

σ2)

Exte

nsio

ns

•Sparse

meth

ods

arenot

limite

dto

the

square

loss

–e.g.,

theoretical

results

forlogistic

loss(V

anD

eG

eer,2008;

Bach

,

2009b)

•Sparse

meth

ods

arenot

limite

dto

supervise

dle

arnin

g

–Learn

ing

the

structu

reof

Gau

ssiangrap

hical

models

(Mein

shau

sen

and

Buhlm

ann,2006;

Ban

erjeeet

al.,2008)

–Sparsity

onm

atrices(last

part

ofth

etu

torial)

•Sparse

meth

ods

arenot

limite

dto

variable

sele

ction

ina

linear

model

–See

next

part

ofth

etu

toria

l

Questio

ns?

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Penaliza

tion

with

gro

uped

varia

ble

s

(Yuan

and

Lin

,2006)

•A

ssum

eth

at{1,...,p}is

partitio

ned

into

mgrou

ps

G1 ,...,G

m

•Pen

alizationby

∑mi=

1 ‖wG

i ‖2 ,

oftencalled

ℓ1 -ℓ

2norm

•In

duces

group

sparsity

–Som

egrou

ps

entirely

setto

zero

–no

zerosw

ithin

groups

•In

this

tutorial:

–G

roups

may

have

infinite

size⇒

MK

L

–G

roups

may

overlap⇒

structu

red

sparsity

Lin

ear

vs.

non-lin

ear

meth

ods

•A

llm

ethods

inth

istu

torialare

linear

inth

epara

mete

rs

•B

yrep

lacing

xby

features

Φ(x

),th

eycan

be

mad

enon

linear

in

the

data

•Im

plicit

vs.exp

licitfe

atu

res

–ℓ1 -n

orm:

explicit

features

–ℓ2 -n

orm:

representer

theorem

allows

tocon

sider

implicit

features

if

their

dot

products

canbe

computed

easily(kern

elm

ethods)

Kern

elm

eth

ods:

regulariza

tion

by

ℓ2 -n

orm

•D

ata:x

i ∈X

,y

i ∈Y

,i=

1,...,n,w

ithfe

atu

res

Φ(x

)∈F

=R

p

–Pred

ictorf(x

)=

w⊤Φ

(x)

linear

inth

efeatu

res

•O

ptim

izationprob

lem:

min

w∈

Rp

n∑i=

1

ℓ(yi ,w

⊤Φ

(xi ))

+λ2 ‖w‖

22

Kern

elm

eth

ods:

regulariza

tion

by

ℓ2 -n

orm

•D

ata:x

i ∈X

,y

i ∈Y

,i=

1,...,n,w

ithfe

atu

res

Φ(x

)∈F

=R

p

–Pred

ictorf(x

)=

w⊤Φ

(x)

linear

inth

efeatu

res

•O

ptim

izationprob

lem:

min

w∈

Rp

n∑i=

1

ℓ(yi ,w

⊤Φ

(xi ))

+λ2 ‖w‖

22

•Repre

sente

rth

eore

m(K

imeld

orfan

dW

ahba,

1971):solu

tionm

ust

be

ofth

eform

w=

∑ni=

1α

i Φ(x

i )

–Equivalen

tto

solving:

min

α∈R

n

n∑i=

1

ℓ(yi ,(K

α)i )

+λ2α⊤K

α

–K

ernel

matrix

Kij

=k(x

i ,xj )

=Φ

(xi ) ⊤

Φ(x

j )

Multip

lekern

elle

arnin

g(M

KL)

(Lanck

riet

et

al.,

2004b;B

ach

et

al.,

2004a)

•Sparse

meth

ods

arelin

ear!

•Sparsity

with

non

-linearities

–rep

lacef(x

)=

∑pj=

1w

⊤jx

jw

ithx∈

Rp

and

wj ∈

R

–by

f(x

)=

∑pj=

1w

⊤jΦ

j (x)

with

x∈X

,Φ

j (x)∈

Fj

anw

j ∈F

j

•Rep

laceth

eℓ1 -n

orm∑

pj=

1 |wj |

by“b

lock”

ℓ1 -n

orm∑

pj=

1 ‖wj ‖

2

•Rem

arks

–H

ilbert

space

extension

ofth

egrou

pLasso

(Yuan

and

Lin

,2006)

–A

lternative

sparsity-in

ducin

gnorm

s(R

avikum

aret

al.,2008)

Multip

lekern

elle

arnin

g(M

KL)

(Lanck

riet

et

al.,

2004b;B

ach

et

al.,

2004a)

•M

ultip

lefeatu

rem

aps

/kern

elson

x∈X

:

–p

“feature

map

s”Φ

j:X

7→F

j ,j

=1,...,p

.

–M

inim

izationw

ithresp

ectto

w1 ∈

F1 ,...,w

p ∈F

p

–Pred

ictor:f(x

)=

w1 ⊤

Φ1 (x

)+···

+w

p ⊤Φ

p (x)

x

Φ1 (x

) ⊤w

1

ր...

...ց

−→Φ

j (x) ⊤

wj

−→ց

......

րΦ

p (x) ⊤

wp

w⊤1Φ

1 (x)+···

+w

⊤pΦ

p (x)

–G

eneralized

additive

models

(Hastie

and

Tib

shiran

i,1990)

Regulariza

tion

for

multip

lefe

atu

res

x

Φ1 (x

) ⊤w

1

ր...

...ց

−→Φ

j (x) ⊤

wj

−→ց

......

րΦ

p (x) ⊤

wp

w⊤1Φ

1 (x)+···

+w

⊤pΦ

p (x)

•Regu

larizationby

∑pj=

1 ‖wj ‖

22is

equivalen

tto

usin

gK

=∑

pj=

1K

j

–Sum

min

gkern

elsis

equivalen

tto

concaten

ating

feature

spaces

Regulariza

tion

for

multip

lefe

atu

res

x

Φ1 (x

) ⊤w

1

ր...

...ց

−→Φ

j (x) ⊤

wj

−→ց

......

րΦ

p (x) ⊤

wp

w⊤1Φ

1 (x)+···

+w

⊤pΦ

p (x)

•Regu

larizationby

∑pj=

1 ‖wj ‖

22is

equivalen

tto

usin

gK

=∑

pj=

1K

j

•Regu

larizationby

∑pj=

1 ‖wj ‖

2im

poses

sparsity

atth

egrou

plevel

•M

ain

questio

ns

when

regularizin

gby

blo

ckℓ1 -n

orm

:

1.A

lgorithm

s

2.A

nalysis

ofsp

arsityin

ducin

gprop

erties(R

avikum

aret

al.,2008;

Bach

,2008b

)

3.D

oes

itcorresp

ond

toa

specifi

ccom

bin

ationof

kernels?

Genera

lkern

elle

arnin

g

•Pro

positio

n(L

anckriet

etal,

2004,B

achet

al.,2005,

Micch

ellian

d

Pon

til,2005):

G(K

)=

min

w∈F

∑ni=

1ℓ(y

i ,w⊤Φ

(xi ))

+λ2 ‖w‖

22

=m

axα∈

Rn −

∑ni=

1ℓ ∗i (λ

αi )−

λ2 α⊤K

α

isa

conve

xfu

nction

ofth

ekern

elm

atrixK

•T

heoretical

learnin

gbou

nds

(Lan

ckrietet

al.,2004,

Srebro

and

Ben

-

David

,2006)

–Less

assum

ption

sth

ansp

arsity-based

bou

nds,

but

slower

rates

Equiv

ale

nce

with

kern

elle

arnin

g(B

ach

et

al.,

2004a)

•B

lock

ℓ1 -n

ormprob

lem:

n∑i=

1

ℓ(yi , w

⊤1Φ

1 (xi )

+···

+w

⊤pΦ

p (xi ))

+λ2

( ‖w1 ‖

2+···

+‖w

p ‖2 )

2

•Pro

positio

n:

Blo

ckℓ1 -n

ormregu

larizationis

equivalen

tto

min

imizin

gw

ithresp

ectto

ηth

eop

timal

value

G(∑

pj=

1η

j Kj )

•(sp

arse)weigh

tsη

obtain

edfrom

optim

alitycon

dition

s

•dual

param

etersα

optim

alfor

K=

∑pj=

1η

j Kj ,

•Sin

gle

optim

izatio

npro

ble

mfo

rle

arnin

gboth

ηand

α

Pro

ofofequiv

ale

nce

min

w1,...,w

p

n∑i=

1

ℓ(y

i ,

p∑j=

1

w⊤jΦ

j (xi )

)+

λ(

p∑j=

1 ‖wj ‖

2

)2

=m

inw

1,...,w

pm

inP

jη

j =1

n∑i=

1

ℓ(y

i ,

p∑j=

1

w⊤jΦ

j (xi )

)+

λ

p∑j=

1 ‖wj ‖

22 /ηj

=m

inP

jη

j =1

min

w1,...,w

p

n∑i=

1

ℓ(y

i ,

p∑j=

1

η1/2

jw

⊤jΦ

j (xi )

)+

λ

p∑j=

1 ‖wj ‖

22w

ithw

j=

wj η

−1/2

j

=m

inP

jη

j =1m

inw

n∑i=

1

ℓ(y

i ,w⊤Ψ

η (xi )

)+

λ‖w‖22

with

Ψη (x

)=

(η1/2

1Φ

1 (x),...,η

1/2

pΦ

p (x))

•W

ehave:

Ψη (x

) ⊤Ψ

η (x′)

=∑

pj=

1η

j kj (x

,x′)

with

∑pj=

1η

j=

1(an

dη

>0)

Alg

orith

ms

for

the

gro

up

Lasso

/M

KL

•G

roup

Lasso

–B

lock

coord

inate

descen

t(Y

uan

and

Lin

,2006)

–A

ctiveset

meth

od

(Roth

and

Fisch

er,2008;

Obozin

skiet

al.,2009)

–N

esterov’saccelerated

meth

od

(Liu

etal.,

2009)

•M

KL

–D

ual

ascent,

e.g.,seq

uen

tialm

inim

alop

timization

(Bach

etal.,

2004a)

–η-trick

+cu

tting-p

lanes

(Son

nen

burg

etal.,

2006)

–η-trick

+projected

gradien

tdescen

t(R

akotomam

onjy

etal.,

2008)

–A

ctiveset

(Bach

,2008c)

Applica

tions

ofm

ultip

lekern

elle

arnin

g

•Sele

ction

ofhyp

erp

aram

ete

rsfo

rkern

elm

eth

ods

•Fusio

nfro

mhete

rogeneousdata

source

s(L

anckriet

etal.,

2004a)

•T

wo

strategiesfor

kernel

combin

ations:

–U

niform

combin

ation⇔

ℓ2 -n

orm

–Sparse

combin

ation⇔

ℓ1 -n

orm

–M

KL

always

leads

tom

orein

terpretable

models

–M

KL

does

not

always

leadto

better

predictive

perform

ance

∗In

particu

lar,w

ithfew

well-d

esigned

kernels

∗B

ecarefu

lw

ithnorm

alizationof

kernels

(Bach

etal.,

2004b)

Applica

tions

ofm

ultip

lekern

elle

arnin

g

•Sele

ction

ofhyp

erp

aram

ete

rsfo

rkern

elm

eth

ods

•Fusio

nfro

mhete

rogeneousdata

source

s(L

anckriet

etal.,

2004a)

•T

wo

strategiesfor

kernel

combin

ations:

–U

niform

combin

ation⇔

ℓ2 -n

orm

–Sparse

combin

ation⇔

ℓ1 -n

orm

–M

KL

always

leads

tom

orein

terpretable

models

–M

KL

does

not

always

leadto

better

predictive

perform

ance

∗In

particu

lar,w

ithfew

well-d

esigned

kernels

∗B

ecarefu

lw

ithnorm

alizationof

kernels

(Bach

etal.,

2004b)

•Sparse

meth

ods:

new

possib

ilitiesan

dnew

features

•See

NIP

S2009

worksh

op“U

nderstan

din

gM

KL

meth

ods”

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Lasso

-T

wo

main

rece

nt

theore

ticalre

sults

1.Support

reco

very

conditio

n

2.Exp

onentia

llym

any

irrele

vant

variable

s:under

appropriate

assum

ption

s,con

sistency

ispossib

leas

long

as

logp

=O

(n)

Lasso

-T

wo

main

rece

nt

theore

ticalre

sults

1.Support

reco

very

conditio

n

2.Exp

onentia

llym

any

irrele

vant

variable

s:under

appropriate

assum

ption

s,con

sistency

ispossib

leas

long

as

logp

=O

(n)

•Q

uestion

:is

itpossib

leto

build

asp

arsealgorith

mth

atcan

learn

fromm

oreth

an10

80

features?

Lasso

-T

wo

main

rece

nt

theore

ticalre

sults

1.Support

reco

very

conditio

n

2.Exp

onentia

llym

any

irrele

vant

variable

s:under

appropriate

assum

ption

s,con

sistency

ispossib

leas

long

as

logp

=O

(n)

•Q

uestion

:is

itpossib

leto

build

asp

arsealgorith

mth

atcan

learn

fromm

oreth

an10

80

features?

–Som

etyp

eofre

cursivity/

facto

rizatio

nis

needed!

Hie

rarchica

lkern

elle

arnin

g(B

ach

,2008c)

•M

any

kernels

canbe

decom

posed

asa

sum

ofm

any

“small”

kernels

indexed

bya

certainset

V:

k(x

,x′)

=∑v∈

V

kv (x

,x′)

•Exam

ple

with

x=

(x1 ,...,x

q )∈R

q(⇒

non

linear

variable

selection)

–G

aussian

/AN

OVA

kernels:

p=

#(V

)=

2q

q∏j=

1

(

1+

e −α(x

j −x′j )

2)

=∑

J⊂{1,...,q}

∏j∈J

e −α(x

j −x′j )

2

=∑

J⊂{1,...,q}

e −α‖

xJ −

x′J ‖

22

–N

B:decom

position

isrelated

toCosso

(Lin

and

Zhan

g,2006)

•Goal:

learnin

gsp

arsecom

bin

ation∑

v∈V

ηv k

v (x,x

′)

•U

niversally

consisten

tnon

-linear

variable

selectionreq

uires

allsu

bsets

Restrictin

gth

ese

tofactiv

ekern

els

•W

ithflat

structu

re

–Con

sider

blo

ckℓ1 -n

orm:

∑

v∈V

dv ‖w

v ‖2

–can

not

avoidbein

glin

earin

p=

#(V

)=

2q

•U

sing

the

structu

reof

the

small

kernels

1.for

computation

alreason

s

2.to

allowm

oreirrelevan

tvariab

les

Restrictin

gth

ese

tofactiv

ekern

els

•V

isen

dow

edw

itha

directed

acyclicgrap

h(D

AG

)stru

cture:

sele

cta

kern

elonly

afte

rall

ofits

ance

stors

have

been

sele

cted

•G

aussian

kernels:

V=

pow

erset

of{1,...,q}w

ithin

clusion

DAG

–Select

asu

bset

only

afterall

itssu

bsets

have

been

selected

23

34

14

13

24

123

234

124

134

1234

12

12

34

DAG

-adapte

dnorm

(Zhao

&Yu,2008)

•G

raph-b

asedstru

ctured

regularization

–D

(v)

isth

eset

ofdescen

dan

tsof

v∈

V:

∑v∈V

dv ‖w

D(v

) ‖2

=∑v∈

V

dv

∑

t∈D

(v) ‖w

t ‖22

1/2

•M

ainprop

erty:If

vis

selected,so

areall

itsan

cestors

23

34

14

13

24

123

234

124

134

1234

12

12

34

23

4

12

23

34

14

24

234

124

13

134

1234

123 1

DAG

-adapte

dnorm

(Zhao

&Yu,2008)

•G

raph-b

asedstru

ctured

regularization

–D

(v)

isth

eset

ofdescen

dan

tsof

v∈

V:

∑v∈V

dv ‖w

D(v

) ‖2

=∑v∈

V

dv

∑

t∈D

(v) ‖w

t ‖22

1/2

•M

ainprop

erty:If

vis

selected,so

areall

itsan

cestors

•H

ierarch

icalkern

elle

arnin

g(B

ach,2008c)

:

–polyn

om

ial-tim

ealgorith

mfor

this

norm

–nece

ssary/su

fficie

nt

conditio

ns

forcon

sistent

kernel

selection

–Sca

ling

betw

een

p,q,n

forcon

sistency

–Applica

tions

tovariab

leselection

oroth

erkern

els

Sca

ling

betw

een

p,n

and

oth

er

gra

ph-re

late

dquantitie

sn

=num

ber

ofob

servations

p=

num

ber

ofvertices

inth

eD

AG

deg

(V)

=m

aximum

out

degree

inth

eD

AG

num

(V)

=num

ber

ofcon

nected

compon

ents

inth

eD

AG

•Pro

positio

n(B

ach,2009a):

Assu

me

consisten

cycon

dition

satisfied

,

Gau

ssiannoise

and

data

generated

froma

sparse

function

,th

enth

e

support

isrecovered

with

high

-probab

ilityas

soon

as:

logdeg

(V)+

lognum

(V)

=O

(n)

Sca

ling

betw

een

p,n

and

oth

er

gra

ph-re

late

dquantitie

sn

=num

ber

ofob

servations

p=

num

ber

ofvertices

inth

eD

AG

deg

(V)

=m

aximum

out

degree

inth

eD

AG

num

(V)

=num

ber

ofcon

nected

compon

ents

inth

eD

AG

•Pro

positio

n(B

ach,2009a):

Assu

me

consisten

cycon

dition

satisfied

,

Gau

ssiannoise

and

data

generated

froma

sparse

function

,th

enth

e

support

isrecovered

with

high

-probab

ilityas

soon

as:

logdeg

(V)+

lognum

(V)

=O

(n)

•U

nstru

ctured

case:num

(V)=

p⇒

logp

=O

(n)

•Pow

erset

ofq

elemen

ts:deg

(V)=

q⇒

logq

=log

logp

=O

(n)

Mean-sq

uare

erro

rs(re

gre

ssion)

dataset

np

k#

(V)

L2

greedy

MK

LH

KL

abalon

e4177

10pol4

≈10

744.2±

1.343.9±

1.444.5±

1.143.3±

1.0

abalon

e4177

10rb

f≈

1010

43.0±

0.9

45.0±1.7

43.7±1.0

43.0±1.1

boston

50613

pol4

≈10

917.1±

3.6

24.7±10.8

22.2±2.2

18.1±3.8

boston

50613

rbf

≈10

12

16.4±

4.0

32.4±8.2

20.7±2.1

17.1±4.7

pum

adyn

-32fh8192

32pol4

≈10

22

57.3±0.7

56.4±0.8

56.4±

0.7

56.4±0.8

pum

adyn

-32fh8192

32rb

f≈

1031

57.7±0.6

72.2±22.5

56.5±0.8

55.7±

0.7

pum

adyn

-32fm8192

32pol4

≈10

22

6.9±0.1

6.4±1.6

7.0±0.1

3.1±

0.0

pum

adyn

-32fm8192

32rb

f≈

1031

5.0±0.1

46.2±51.6

7.1±0.1

3.4±

0.0

pum

adyn

-32nh

819232

pol4

≈10

22

84.2±1.3

73.3±25.4

83.6±1.3

36.7±

0.4

pum

adyn

-32nh

819232

rbf

≈10

31

56.5±1.1

81.3±25.0

83.7±1.3

35.5±

0.5

pum

adyn

-32nm

819232

pol4

≈10

22

60.1±1.9

69.9±32.8

77.5±0.9

5.5±

0.1

pum

adyn

-32nm

819232

rbf

≈10

31

15.7±0.4

67.3±42.4

77.6±0.9

7.2±

0.1

Exte

nsio

ns

tooth

er

kern

els

•Exten

sionto

graph

kernels,

string

kernels,

pyramid

match

kernels

AB

BA

BA

AA

AA

AB

BB

BA

AB

AA

BA

BA

BB

BA

BB

BB

AA

•Exp

loring

largefeatu

resp

acesw

ithstru

ctured

sparsity-in

ducin

gnorm

s

–O

pposite

viewth

antrad

itional

kernel

meth

ods

–In

terpretable

models

•O

ther

structu

res

than

hie

rarchie

sor

DAGs

Gro

uped

varia

ble

s

•Supervised

learnin

gw

ithkn

own

groups:

–T

he

ℓ1 -ℓ

2norm

∑G∈

G ‖wG ‖

2=

∑G∈

G

(∑j∈

G

w2j

)1/2,

with

Ga

partition

of{1,...,p}

–T

he

ℓ1 -ℓ

2norm

setsto

zeronon

-overlappin

ggrou

ps

ofvariab

les

(asop

posed

tosin

glevariab

lesfor

the

ℓ1

norm

)

Gro

uped

varia

ble

s

•Supervised

learnin

gw

ithkn

own

groups:

–T

he

ℓ1 -ℓ

2norm

∑G∈

G ‖wG ‖

2=

∑G∈

G

(∑j∈

G

w2j

)1/2,

with

Ga

partition

of{1,...,p}

–T

he

ℓ1 -ℓ

2norm

setsto

zeronon

-overlappin

ggrou

ps

ofvariab

les

(asop

posed

tosin

glevariab

lesfor

the

ℓ1

norm

).

•H

owever,

the

ℓ1 -ℓ

2norm

enco

des

fixe

d/sta

ticprio

rin

form

atio

n,

requires

tokn

owin

advan

cehow

togrou

pth

evariab

les

•W

hat

hap

pen

sif

the

setof

groupsG

isnot

apartition

anym

ore?

Stru

cture

dSparsity

(Jenatto

net

al.,

2009a)

•W

hen

pen

alizing

byth

eℓ1 -ℓ

2norm

∑G∈

G ‖wG ‖

2=

∑G∈

G

(∑j∈

G

w2j

)1/2

–T

he

ℓ1

norm

induces

sparsity

atth

egrou

plevel:

∗Som

ew

G’s

areset

tozero

–In

side

the

groups,

the

ℓ2

norm

does

not

promote

sparsity

•In

tuitively,

the

zeropattern

ofw

isgiven

by

{j∈{1,...,p};

wj

=0}

=⋃

G∈

G′ G

forsom

eG

′⊆G

.

•T

his

intu

itionis

actually

true

and

canbe

formalized

Exam

ple

sofse

tofgro

ups

G(1

/3)

•Selection

ofcon

tiguou

spattern

son

aseq

uen

ce,p

=6

–G

isth

eset

ofblu

egrou

ps

–A

ny

union

ofblu

egrou

ps

setto

zerolead

sto

the

selectionof

a

contigu

ous

pattern

Exam

ple

sofse

tofgro

ups

G(2

/3)

•Selection

ofrectan

gleson

a2-D

grids,

p=

25

–G

isth

eset

ofblu

e/greengrou

ps

(with

their

complem

ents,

not

disp

layed)

–A

ny

union

ofblu

e/greengrou

ps

setto

zerolead

sto

the

selection

ofa

rectangle

Exam

ple

sofse

tofgro

ups

G(3

/3)

•Selection

ofdiam

ond-sh

aped

pattern

son

a2-D

grids,

p=

25

–It

ispossib

leto

extentsu

chsettin

gsto

3-Dsp

ace,or

more

com

plex

topologies

–See

applica

tions

late

r(sp

arsePCA)

Rela

tionsh

ipbew

teen

Gand

Zero

Patte

rns

(Jenatto

n,A

udib

ert,

and

Bach

,2009a)

•G

→Zero

patte

rns:

–by

generatin

gth

eunion

-closure

ofG

•Zero

patte

rns→

G:

–D

esigngrou

psG

froman

yunio

n-clo

sed

set

ofze

ropattern

s

–D

esigngrou

ps

Gfrom

any

inte

rsectio

n-clo

sed

set

ofnon-ze

ro

pattern

s

Overv

iew

ofoth

er

work

on

structu

red

sparsity

•Specifi

chierarch

icalstru

cture

(Zhao

etal.,

2009;B

ach,2008c)

•U

nion

-closed(as

opposed

toin

tersection-closed

)fam

ilyof

non

zero

pattern

s(Jacob

etal.,

2009;B

araniu

ket

al.,2008)

•N

oncon

vexpen

altiesbased

onin

formation

-theoretic

criteriaw

ith

greedy

optim

ization(H

uan

get

al.,2009)

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Learn

ing

on

matrice

s-

Colla

bora

tive

Filte

ring

(CF)

•G

ivennX

“movies”

x∈X

and

nY

“custom

ers”y∈Y

,

•pred

ictth

e“ratin

g”z(x

,y)∈

Zof

custom

ery

form

oviex

•Train

ing

data:

largenX×

nY

incom

plete

matrix

Zth

atdescrib

esth

e

know

nratin

gsof

some

custom

ersfor

some

movies

•G

oal:com

plete

the

matrix.

Learn

ing

on

matrice

s-M

ulti-ta

skle

arnin

g

•k

prediction

taskson

same

covariatesx∈

Rp

–k

weigh

tvectors

wj ∈

Rp

–Join

tm

atrixof

predictors

W=

(w1 ,...,w

k )∈R

p×k

•M

any

application

s

–“tran

sferlearn

ing”

–M

ulti-category

classification

(one

taskper

class)(A

mit

etal.,

2007)

•Share

param

etersbetw

eenvariou

stasks

–sim

ilarto

fixed

effect/ran

dom

effect

models

(Rau

den

bush

and

Bryk,

2002)

–join

tvariab

leor

feature

selection(O

bozin

skiet

al.,2009;

Pon

til

etal.,

2007)

Learn

ing

on

matrice

s-

Image

denoisin

g

•Sim

ultan

eously

den

oiseall

patch

esof

agiven

image

•Exam

ple

fromM

airal,B

ach,Pon

ce,Sap

iro,an

dZisserm

an(2009c)

Two

types

ofsp

arsityfo

rm

atrice

sM

∈R

n×p

I-

Dire

ctlyon

the

ele

ments

of

M

•M

any

zeroelem

ents:

Mij

=0

M

•M

any

zerorow

s(or

colum

ns):

(Mi1 ,...,M

ip )=

0

M

Two

types

ofsp

arsityfo

rm

atrice

sM

∈R

n×p

II-T

hro

ugh

afa

ctoriza

tion

of

M=

UV

⊤

•M

=U

V⊤,U

∈R

n×m

and

V∈

Rn×

m

•Low

rank:

msm

all

=

T

UV

M

•Sparse

deco

mpositio

n:

Usp

arseU=

VM

T

Stru

cture

dm

atrix

facto

rizatio

ns

-M

any

insta

nce

s

•M

=U

V⊤,U

∈R

n×m

and

V∈

Rp×

m

•Stru

cture

on

Uand/or

V

–Low

-rank:

Uan

dV

have

fewcolu

mns

–D

ictionary

learnin

g/

sparse

PCA

:U

orV

has

man

yzeros

–Clu

stering

(k-m

eans):

U∈{0,1}

n×m

,U

1=

1

–Poin

twise

positivity:

non

negative

matrix

factorization(N

MF)

–Specifi

cpattern

sof

zeros

–etc.

•M

any

applica

tions

–e.g.,

source

separation

(Fevotte

etal.,

2009),exp

loratorydata

analysis

Multi-ta

skle

arnin

g

•Join

tm

atrixof

predictors

W=

(w1 ,...,w

k )∈R

p×k

•Join

tvaria

ble

sele

ction

(Obozin

skiet

al.,2009)

–Pen

alizeby

the

sum

ofth

enorm

sof

rows

ofW

(group

Lasso)

–Select

variables

which

arepred

ictivefor

alltasks

Multi-ta

skle

arnin

g

•Join

tm

atrixof

predictors

W=

(w1 ,...,w

k )∈R

p×k

•Join

tvaria

ble

sele

ction

(Obozin

skiet

al.,2009)

–Pen

alizeby

the

sum

ofth

enorm

sof

rows

ofW

(group

Lasso)

–Select

variables

which

arepred

ictivefor

alltasks

•Join

tfe

atu

rese

lectio

n(P

ontil

etal.,

2007)

–Pen

alizeby

the

trace-norm

(seelater)

–Con

struct

linear

features

comm

onto

alltasks

•T

heory:

allows

num

ber

ofob

servations

which

issu

blin

earin

the

num

ber

oftasks

(Obozin

skiet

al.,2008;

Lou

nici

etal.,

2009)

•Practice:

more

interpretab

lem

odels,

slightly

improved

perform

ance

Low

-rank

matrix

facto

rizatio

ns

Tra

cenorm

•G

ivena

matrix

M∈

Rn×

p

–Ran

kof

Mis

the

min

imum

sizem

ofall

factorizations

ofM

into

M=

UV

⊤,U

∈R

n×m

and

V∈

Rp×

m

–Sin

gular

value

decom

position

:M

=U

Diag

(s)V⊤

where

Uan

dV

have

orthon

ormal

colum

ns

and

s∈R

m+are

singu

larvalu

es

•Ran

kof

Meq

ual

toth

enum

ber

ofnon

-zerosin

gular

values

Low

-rank

matrix

facto

rizatio

ns

Tra

cenorm

•G

ivena

matrix

M∈

Rn×

p

–Ran

kof

Mis

the

min

imum

sizem

ofall

factorizations

ofM

into

M=

UV

⊤,U

∈R

n×m

and

V∈

Rp×

m

–Sin

gular

value

decom

position

:M

=U

Diag

(s)V⊤

where

Uan

dV

have

orthon

ormal

colum

ns

and

s∈R

m+are

singu

larvalu

es

•Ran

kof

Meq

ual

toth

enum

ber

ofnon

-zerosin

gular

values

•Tra

ce-n

orm

(a.k

.a.nucle

arnorm

)=

sum

ofsin

gular

values

•Con

vexfu

nction

,lead

sto

asem

i-defi

nite

program(F

azelet

al.,2001)

•First

used

forcollab

orativefilterin

g(S

rebroet

al.,2005)

Resu

ltsfo

rth

etra

cenorm

•Ran

krecovery

condition

(Bach

,2008d

)

–T

he

Hessian

ofth

eloss

around

the

asymptotic

solution

shou

ldbe

closeto

diagon

al

•Suffi

cient

condition

forexact

rank

min

imization

(Rech

tet

al.,2009)

•H

igh-d

imen

sional

inferen

cefor

noisy

matrix

completion

(Srebro

etal.,

2005;Can

des

and

Plan

,2009a)

–M

ayrecover

entire

matrix

fromsligh

tlym

oreen

triesth

anth

e

min

imum

ofth

etw

odim

ension

s

•Effi

cient

alg

orith

ms:

–First-ord

erm

ethodsbased

onth

esin

gular

value

decom

position

(see,

e.g.,M

azum

der

etal.,

2009)

–Low

-rank

formulation

s(R

ennie

and

Srebro,

2005;A

bern

ethy

etal.,

2009)

Spectra

lre

gulariza

tions

•Exten

sions

toan

yfu

nction

sof

singu

larvalu

es

•Exten

sions

tobilin

earform

s(A

bern

ethy

etal.,

2009)

(x,y

)7→Φ

(x) ⊤

BΨ

(y)

onfeatu

resΦ

(x)∈

RfX

and

Ψ(y

)∈R

fY,an

dB

∈R

fX×

fY

•Collab

orativefilterin

gw

ithattrib

utes

•Repre

sente

rth

eore

m:

the

solution

must

be

ofth

eform

B=

nX

∑i=1

nY

∑j=

1

αij Ψ

(xi )Φ

(yj ) ⊤

•O

nly

norm

sin

variantby

orthogon

altransform

s(A

rgyriouet

al.,2009)

Sparse

prin

cipalco

mponent

analy

sis

•G

ivendata

matrix

X=

(x⊤1,...,x

⊤n) ⊤

∈R

n×p,

princip

alcom

pon

ent

analysis

(PCA

)m

aybe

seenfrom

two

persp

ectives:

–Analysis

view

:find

the

projectionv∈

Rp

ofm

aximum

variance

(with

defl

ationto

obtain

more

compon

ents)

–Syn

thesis

view

:find

the

basis

v1 ,...,v

ksu

chth

atall

xihave

low

reconstru

ctionerror

when

decom

posed

onth

isbasis

•For

regular

PCA

,th

etw

oview

sare

equivalen

t

•Sparse

extension

s

–In

terpretability

–H

igh-d

imen

sional

inferen

ce

–Two

view

sare

diff

ere

nts

Sparse

prin

cipalco

mponent

analy

sis

Analy

sisvie

w

•D

SPCA

(d’A

spremon

tet

al.,2007),

with

A=

1nX

⊤X

∈R

p×p

–m

ax‖v‖

2=

1,‖

v‖06

kv ⊤

Av

relaxedin

tom

ax‖v‖

2=

1,‖

v‖16

k1/2v ⊤

Av

–usin

gM

=vv ⊤

,itself

relaxedin

tom

axM

<0,tr

M=

1, 1 ⊤

|M|1

6ktr

AM

Sparse

prin

cipalco

mponent

analy

sis

Analy

sisvie

w

•D

SPCA

(d’A

spremon

tet

al.,2007),

with

A=

1nX

⊤X

∈R

p×p

–m

ax‖v‖

2=

1, ‖

v‖06

kv ⊤

Av

relaxedin

tom

ax‖v‖

2=

1,‖

v‖16

k1/2v ⊤

Av

–usin

gM

=vv ⊤

,itself

relaxedin

tom

axM

<0,tr

M=

1, 1 ⊤

|M|1

6ktr

AM

•Req

uires

defl

ationfor

multip

lecom

pon

ents

(Mackey,

2009)

•M

orerefi

ned

convex

relaxation(d

’Asprem

ont

etal.,

2008)

•N

oncon

vexan

alysis(M

oghad

dam

etal.,

2006b)

•A

pplication

sbeyon

din

terpretable

princip

alcom

pon

ents

–used

assu

fficien

tcon

dition

sfor

high

-dim

ension

alin

ference

Sparse

prin

cipalco

mponent

analy

sis

Synth

esis

vie

w

•Fin

dv1 ,...,v

m∈

Rp

sparse

soth

at

n∑i=

1

min

u∈R

m

∥∥∥∥x

i −m

∑j=

1

uj v

j

∥∥∥∥

22

issm

all

•Equivalen

tto

look

forU

∈R

n×m

and

V∈

Rp×

msu

chth

atV

is

sparse

and‖X

−U

V⊤‖

2Fis

small

Sparse

prin

cipalco

mponent

analy

sis

Synth

esis

vie

w

•Fin

dv1 ,...,v

m∈

Rp

sparse

soth

at

n∑i=

1

min

u∈R

m

∥∥∥∥x

i −m

∑j=

1

uj v

j

∥∥∥∥

22

issm

all

•Equivalen

tto

look

forU

∈R

n×m

and

V∈

Rp×

msu

chth

atV

is

sparse

and‖X

−U

V⊤‖

2Fis

small

•Sparse

formulation

(Witten

etal.,

2009;B

achet

al.,2008)

–Pen

alizecolu

mns

viof

Vby

the

ℓ1 -n

ormfor

sparsity

–Pen

alizecolu

mns

uiof

Uby

the

ℓ2 -n

ormto

avoidtrivial

solution

s

min

U,V

‖X−

UV

⊤‖2F

+λ

m∑i=

1

{‖ui ‖

22+‖v

i ‖21

}

Stru

cture

dm

atrix

facto

rizatio

ns

min

U,V

‖X−

UV

⊤‖2F

+λ

m∑i=

1

{‖ui ‖

2+

‖vi ‖

2}

•Pen

alizing

by‖u

i ‖2+

‖vi ‖

2eq

uivalen

tto

constrain

ing‖u

i ‖6

1an

d

pen

alizing

by‖v

i ‖(B

achet

al.,2008)

•O

ptim

izatio

nby

alte

rnatin

gm

inim

izatio

n(n

on-co

nve

x)

•u

idecom

position

coeffi

cients

(or“co

de”),

vidiction

aryelem

ents

•Sparse

PCA

=sp

arsedictio

nary

(ℓ1 -n

ormon

ui )

•D

ictionary

learn

ing

=sp

arsedeco

mpositio

ns

(ℓ1 -n

ormon

vi)

–O

lshau

senan

dField

(1997);Elad

and

Aharon

(2006);Rain

aet

al.

(2007)

Dictio

nary

learn

ing

for

image

denoisin

g

x︸︷︷︸

measu

remen

ts

=x

︸︷︷︸

origin

alim

age +

ε︸︷︷︸

noise

Dictio

nary

learn

ing

for

image

denoisin

g

•Solvin

gth

edenoisin

gpro

ble

m(E

ladan

dA

haron

,2006)

–Extract

alloverlap

pin

g8×

8patch

esx

i ∈R

64.

–Form

the

matrix

X=

[x1 ,...,x

n] ⊤

∈R

n×64

–Solve

am

atrixfactorization

problem

:

min

U,V

||X−

UV

⊤|| 2F=

n∑i=

1 ||xi −

VU

(i,:)|| 22

where

Uis

sparse

,an

dV

isth

edictio

nary

–Each

patch

isdecom

posed

into

xi=

VU

(i,:)

–A

verageth

erecon

struction

VU

(i,:)of

eachpatch

xito

reconstru

ct

afu

ll-sizedim

age

•T

he

num

ber

ofpatch

esn

islarge

(=num

ber

ofpixels)

Onlin

eoptim

izatio

nfo

rdictio

nary

learn

ing

min

U∈

Rn×

m,V

∈C

n∑i=

1 ||xi −

VU

(i,:)|| 22+

λ||U(i,:)||1

C△={V

∈R

p×m

s.t.∀j

=1,...,m

,||V

(:,j)||26

1}.

•Classical

optim

izationaltern

atesbetw

eenU

and

V

•G

ood

results,

but

veryslow

!

Onlin

eoptim

izatio

nfo

rdictio

nary

learn

ing

min

U∈

Rn×

m,V

∈C

n∑i=

1 ||xi −

VU

(i,:)|| 22+

λ||U(i,:)||1

C△={V

∈R

p×m

s.t.∀j

=1,...,m

,||V

(:,j)||26

1}.

•Classical

optim

izationaltern

atesbetw

eenU

and

V.

•G

ood

results,

but

veryslow

!

•O

nlin

ele

arnin

g(M

airal,B

ach,Pon

ce,an

dSap

iro,2009a)

can

–han

dle

poten

tiallyin

finite

datasets

–ad

apt

todyn

amic

trainin

gsets

Denoisin

gre

sult

(Maira

l,B

ach

,Ponce

,Sapiro

,and

Zisse

rman,2009c)

What

does

the

dictio

nary

Vlo

ok

like?

Inpain

ting

a12-M

pix

elphoto

gra

ph

Alte

rnativ

eusa

ges

ofdictio

nary

learn

ing

•U

sesth

e“co

de”

Uas

representation

ofob

servations

forsu

bseq

uen

t

processin

g(R

aina

etal.,

2007;Yan

get

al.,2009)

•A

dap

tdiction

aryelem

ents

tosp

ecific

tasks(M

airalet

al.,2009b

)

–D

iscrimin

ativetrain

ing

forweakly

supervised

pixelclassifi

cation(M

airal

etal.,

2008)

Sparse

Stru

cture

dPCA

(Jenatto

n,O

bozin

ski,

and

Bach

,2009b)

•Learn

ing

sparse

and

structu

red

diction

aryelem

ents:

min

U,V

‖X−

UV

⊤‖2F

+λ

m∑i=

1

{‖ui ‖

2+‖v

i ‖2}

•Stru

ctured

norm

onth

ediction

aryelem

ents

–grou

ped

pen

altyw

ithoverlap

pin

ggrou

ps

toselect

specifi

cclasses

ofsp

arsitypattern

s

–use

priorin

formation

forbetter

reconstru

ctionan

d/or

added

robustn

ess

•Effi

cient

learnin

gpro

cedures

throu

ghη-tricks

(closedform

updates)

Applica

tion

tofa

cedata

base

s(1

/3)

rawdata

(unstru

ctured

)N

MF

•N

MF

obtain

spartially

local

features

Applica

tion

tofa

cedata

base

s(2

/3)

(unstru

ctured

)sp

arsePCA

Stru

ctured

sparse

PCA

•Enforce

selectionof

convex

non

zeropattern

s⇒

robustn

essto

occlu

sion

Applica

tion

tofa

cedata

base

s(3

/3)

•Q

uan

titativeperform

ance

evaluation

onclassifi

cationtask

20

40

60

80

10

01

20

14

05

10

15

20

25

30

35

40

45

Dic

tion

ary

siz

e

% Correct classification

raw

da

taP

CA

NM

FS

PC

Ash

are

d−

SP

CA

SS

PC

Ash

are

d−

SS

PC

A

Topic

models

and

matrix

facto

rizatio

n

•Late

nt

Dirich

let

allo

catio

n(B

leiet

al.,2003)

–For

adocu

men

t,sam

ple

θ∈

Rk

froma

Dirich

let(α)

–For

the

n-th

word

ofth

esam

edocu

men

t,

∗sam

ple

atop

iczn

froma

multin

omial

with

param

eterθ

∗sam

ple

aword

wn

froma

multin

omial

with

param

eterβ(z

n,:)

Topic

models

and

matrix

facto

rizatio

n

•Late

nt

Dirich

let

allo

catio

n(B

leiet

al.,2003)

–For

adocu

men

t,sam

ple

θ∈

Rk

froma

Dirich

let(α)

–For

the

n-th

word

ofth

esam

edocu

men

t,

∗sam

ple

atop

iczn

froma

multin

omial

with

param

eterθ

∗sam

ple

aword

wn

froma

multin

omial

with

param

eterβ(z

n,:)

•In

terp

reta

tion

as

multin

om

ialPCA

(Buntin

ean

dPerttu

,2003)

–M

arginalizin

gover

topic

zn,given

θ,each

word

wn

isselected

from

am

ultin

omial

with

param

eter∑

kz=

1θ

k β(z

,:)=

β⊤θ

–Row

ofβ

=diction

aryelem

ents,

θco

de

fora

docu

men

t

Topic

models

and

matrix

facto

rizatio

n

•Two

diff

ere

nt

view

son

the

sam

epro

ble

m

–In

teresting

parallels

tobe

mad

e

–Com

mon

problem

sto

be

solved

•Stru

cture

on

dictio

nary/

deco

mpositio

nco

effi

cients

with

adap

ted

priors,e.g.,

nested

Chin

eserestau

rant

processes

(Blei

etal.,

2004)

•O

ther

priorsan

dprob

abilistic

formulation

s(G

riffith

san

dGhah

raman

i,

2006;Salakh

utd

inov

and

Mnih

,2008;

Arch

ambeau

and

Bach

,2008)

•Id

entifi

ability

and

inte

rpre

tatio

n/eva

luatio

nofre

sults

•D

iscrimin

ative

task

s(B

leian

dM

cAuliff

e,2008;

Lacoste-Ju

lien

etal.,

2008;M

airalet

al.,2009b

)

•O

ptim

izatio

nand

loca

lm

inim

a

Sparsify

ing

linear

meth

ods

•Sam

epatte

rnth

an

with

kern

elm

eth

ods

–H

igh-d

imen

sional

inferen

cerath

erth

annon

-linearities

•M

aindiff

erence:

ingen

eralno

uniq

ue

way

•Sparse

CCA

(Srip

erum

buduret

al.,2009;

Hard

oon

and

Shaw

e-Taylor,

2008;A

rcham

beau

and

Bach

,2008)

•Sparse

LD

A(M

oghad

dam

etal.,

2006a)

•Sparse

...

Sparse

meth

ods

for

matrice

s

Sum

mary

•Stru

ctured

matrix

factorizationhas

man

yap

plication

s

•A

lgorithm

icissu

es

–D

ealing

with

largedatasets

–D

ealing

with

structu

redsp

arsity

•T

heoretical

issues

–Id

entifi

ability

ofstru

ctures,

diction

ariesor

codes

–O

ther

approach

esto

sparsity

and

structu

re

•N

on-con

vexop

timization

versus

convex

optim

ization

–Con

vexification

throu

ghunbou

nded

diction

arysize

(Bach

etal.,

2008;B

radley

and

Bagn

ell,2009)

-few

perform

ance

improvem

ents

Sparse

meth

ods

for

mach

ine

learn

ing

Outlin

e

•In

troductio

n-O

vervie

w

•Sparse

linear

estim

atio

nw

ithth

eℓ1 -n

orm

–Con

vexop

timization

and

algorithm

s

–T

heoretical

results

•Stru

cture

dsp

arsem

eth

ods

on

vecto

rs

–G

roups

offeatu

res/

Multip

lekern

ellearn

ing

–Exten

sions

(hierarch

icalor

overlappin

ggrou

ps)

•Sparse

meth

ods

on

matrice

s

–M

ulti-task

learnin

g

–M

atrixfactorization

(low-ran

k,sp

arsePCA

,diction

arylearn

ing)

Lin

ks

with

com

pre

ssed

sensin

g

(Bara

niu

k,2007;Candes

and

Wakin

,2008)

•G

oalof

compressed

sensin

g:recover

asign

alw

∈R

pfrom

only

n

measu

remen

tsy

=X

w∈

Rn

•A

ssum

ption

s:th

esign

alis

k-sp

arse,n

much

smaller

than

p

•A

lgorithm

:m

inw∈

Rp‖w‖

1su

chth

aty

=X

w

•Suffi

cient

condition

onX

and

(k,n

,p)

forperfect

recovery:

–Restricted

isometry

property

(allsm

allsu

bm

atricesof

X⊤X

must

be

well-con

dition

ed)

–Such

matrices

arehard

tocom

eup

with

determ

inistically,

but

random

ones

areO

Kw

ithk

logp

=O

(n)

•Random

Xfo

rm

ach

ine

learn

ing?

Why

use

sparse

meth

ods?

•Sparsity

asa

proxyto

interpretab

ility

–Stru

ctured

sparsity

•Sparse

meth

ods

arenot

limited

toleast-sq

uares

regression

•Faster

trainin

g/testing

•B

etterpred

ictiveperform

ance?

–Prob

lems

aresp

arseif

youlo

okat

them

the

right

way

–Prob

lems

aresp

arseif

youm

aketh

emsp

arse

Conclu

sion

-In

tere

sting

questio

ns/

issues

•Im

plicit

vs.exp

licitfeatu

res

–Can

we

algorithm

icallyach

ievelog

p=

O(n

)w

ithexp

licit

unstru

ctured

features?

•N

ormdesign

–W

hat

type

ofbeh

aviorm

aybe

obtain

edw

ithsp

arsity-inducin

g

norm

s?

•O

verfittin

gcon

vexity

–D

owe

actually

need

convexity

form

atrixfactorization

problem

s?

Hirin

gpostd

ocs

and

PhD

students

Euro

pean

Rese

archCouncil

projecton

Sparse

structu

red

meth

odsfo

rm

ach

ine

learn

ing

•PhD

position

s

•1-year

and

2-yearpostd

octoral

position

s

•M

achin

elearn

ing

(theory

and

algorithm

s),com

puter

vision,

audio

processin

g,sign

alpro

cessing

•Loca

ted

indow

nto

wn

Paris

(Ecole

Norm

aleSuperieu

re-

INRIA

)

•http://www.di.ens.fr/~fbach/sierra/

Refe

rence

s

J.A

bern

ethy,

F.B

ach,T

.Evg

enio

u,an

dJ.-P

.Vert.

Anew

appro

achto

collab

orativefilterin

g:

Operator

estimatio

nw

ithsp

ectralreg

ularizatio

n.

Journ

alofM

achin

eLearn

ing

Research

,10:8

03–826,2009.

Y.A

mit,

M.Fin

k,N

.Srebro

,an

dS.U

llman

.U

nco

vering

shared

structu

resin

multiclass

classificatio

n.

InPro

ceedin

gs

ofth

e24th

intern

ational

conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2007.

C.

Arch

ambeau

and

F.

Bach

.Sparse

probab

ilisticpro

jections.

InA

dvan

cesin

Neu

ralIn

formatio

n

Pro

cessing

System

s21

(NIP

S),

2008.

A.

Arg

yriou,

C.A

.M

icchelli,

and

M.

Pontil.

On

spectral

learnin

g.

Journ

alof

Mach

ine

Learn

ing

Research

,2009.

To

appear.

F.

Bach

.H

igh-D

imen

sional

Non-L

inear

Variab

leSelectio

nth

rough

Hierarch

icalK

ernel

Learn

ing.

Tech

nical

Rep

ort0909.0

844,arX

iv,2009a.

F.

Bach

.B

olasso

:m

odel

consisten

tlasso

estimatio

nth

rough

the

bootstrap

.In

Pro

ceedin

gs

of

the

Twen

ty-fifth

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2008a.

F.

Bach

.Consisten

cyof

the

gro

up

Lasso

and

multip

lekern

ellearn

ing.

Journ

alof

Mach

ine

Learn

ing

Research

,9:1

179–1225,2008b.

F.

Bach

.Exp

loring

large

feature

spaces

with

hierarch

icalm

ultip

lekern

ellearn

ing.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s,2008c.

F.B

ach.

Self-co

ncord

ant

analysis

forlo

gistic

regressio

n.

Tech

nical

Rep

ort0910.4

627,A

rXiv,

2009b.

F.B

ach.Consisten

cyoftrace

norm

min

imizatio

n.Jo

urn

alofM

achin

eLearn

ing

Research

,9:1

019–1048,

2008d.

F.B

ach,G

.R.G

.Lan

ckriet,an

dM

.I.

Jordan

.M

ultip

lekern

ellearn

ing,

conic

duality,

and

the

SM

O

algorith

m.

InPro

ceedin

gs

ofth

eIn

ternatio

nal

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2004a.

F.B

ach,R.T

hib

aux,

and

M.I.

Jordan

.Com

putin

greg

ularizatio

npath

sfor

learnin

gm

ultip

lekern

els.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s17,2004b.

F.B

ach,J.

Mairal,

and

J.Ponce.

Convex

sparse

matrix

factorizations.

Tech

nical

Rep

ort0812.1

869,

ArX

iv,2008.

O.B

anerjee,

L.ElG

hao

ui,

and

A.d’A

spremont.

Model

selection

thro

ugh

sparse

maxim

um

likelihood

estimatio

nfor

multivariate

Gau

ssianor

bin

arydata.

The

Journ

alofM

achin

eLearn

ing

Research

,9:

485–516,2008.

R.B

araniu

k.Com

pressivesen

sing.

IEEE

Sig

nal

Pro

cessing

Mag

azine,

24(4

):118–121,2007.

R.G

.B

araniu

k,V

.Cevh

er,M

.F.D

uarte,

and

C.H

egde.

Model-b

asedco

mpressive

sensin

g.

Tech

nical

report,

arXiv:0

808.3

572,2008.

A.B

eckan

dM

.Teb

oulle.

Afast

iterativesh

rinkag

e-thresh

oldin

galg

orithm

forlin

earin

versepro

blem

s.

SIA

MJo

urn

alon

Imag

ing

Scien

ces,2(1

):183–202,2009.

D.B

ertsekas.N

onlin

earpro

gram

min

g.Ath

ena

Scien

tific,

1995.

P.B

ickel,Y

.Rito

v,an

dA

.T

sybako

v.Sim

ultan

eous

analysis

ofLasso

and

Dan

tzigselector.

Annals

of

Statistics,

37(4

):1705–1732,2009.

D.

Blei,

T.L

.G

riffith

s,M

.I.Jord

an,

and

J.B.

Ten

enbau

m.

Hierarch

icalto

pic

models

and

the

nested

Chin

eserestau

rant

process.

Advan

cesin

neu

ralin

formatio

npro

cessing

systems,

16:1

06,2004.

D.M

.B

leian

dJ.

McA

uliff

e.Supervised

topic

models.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s(N

IPS),

volu

me

20,2008.

D.M

.B

lei,A

.Y.

Ng,

and

M.I.

Jordan

.Laten

tdirich

letallo

cation.

The

Journ

alof

Mach

ine

Learn

ing

Research

,3:9

93–1022,2003.

J.F.B

onnan

s,J.

C.G

ilbert,

C.Lem

arechal,

and

C.A

.Sag

astizbal.

Num

ericalO

ptim

ization

Theoretical

and

Practical

Asp

ects.Sprin

ger,

2003.

J.M

.B

orwein

and

A.

S.

Lew

is.Convex

Analysis

and

Nonlin

earO

ptim

ization.

Num

ber

3in

CM

S

Books

inM

athem

atics.Sprin

ger-V

erlag,2000.

L.B

otto

uan

dO

.B

ousq

uet.

The

tradeo

ffs

oflarg

escale

learnin

g.

InA

dvan

cesin

Neu

ralIn

formatio

n

Pro

cessing

System

s(N

IPS),

volu

me

20,2008.

S.P.B

oydan

dL.Van

den

berg

he.

Convex

Optim

ization.

Cam

bridge

University

Press,

2004.

D.B

radley

and

J.D

.B

agnell.

Convex

codin

g.

InPro

ceedin

gs

ofth

ePro

ceedin

gs

ofth

eT

wen

ty-Fifth

Conferen

ceA

nnual

Conferen

ceon

Uncertain

tyin

Artifi

cialIn

telligen

ce(U

AI-0

9),

2009.

F.

Bunea,

A.B

.T

sybako

v,an

dM

.H.

Weg

kamp.

Aggreg

ation

forGau

ssianreg

ression.

Annals

of

Statistics,

35(4

):1674–1697,2007.

W.B

untin

ean

dS.Perttu

.Is

multin

om

ialPCA

multi-faceted

clusterin

gor

dim

ensio

nality

reductio

n.

In

Intern

ational

Worksh

op

on

Artifi

cialIn

telligen

cean

dStatistics

(AIS

TAT

S),

2003.

E.

Can

des

and

T.

Tao

.T

he

Dan

tzigselector:

statisticalestim

ation

when

pis

much

larger

than

n.

Annals

ofStatistics,

35(6

):2313–2351,2007.

E.Can

des

and

M.W

akin.A

nin

troductio

nto

com

pressivesam

plin

g.IE

EE

Sig

nal

Pro

cessing

Mag

azine,

25(2

):21–30,2008.

E.J.

Can

des

and

Y.Plan

.M

atrixco

mpletio

nw

ithnoise.

2009a.

Subm

itted.

E.J.

Can

des

and

Y.Plan

.N

ear-ideal

model

selection

byl1

min

imizatio

n.

The

Annals

ofStatistics,

37

(5A

):2145–2177,2009b.

F.Caro

nan

dA

.D

oucet.

Sparse

Bayesian

nonparam

etricreg

ression.

In25th

Intern

ational

Conferen

ce

on

Mach

ine

Learn

ing

(ICM

L),

2008.

S.S.Chen

,D

.L.D

onoho,an

dM

.A

.Sau

nders.

Ato

mic

deco

mpositio

nby

basis

pursu

it.SIA

MReview

,

43(1

):129–159,2001.

A.

d’A

spremont

and

L.

El

Ghao

ui.

Testin

gth

enullsp

acepro

perty

usin

gsem

idefi

nite

program

min

g.

Tech

nical

Rep

ort0807.3

520v5

,arX

iv,2008.

A.d’A

spremont,

ElL.G

hao

ui,

M.I.

Jordan

,an

dG

.R.G

.Lan

ckriet.A

direct

formulatio

nfor

sparse

PCA

usin

gsem

idefi

nite

program

min

g.

SIA

MReview

,49(3

):434–48,2007.

A.d’A

spremont,

F.B

ach,an

dL.ElG

hao

ui.

Optim

also

lutio

ns

forsp

arseprin

cipal

com

ponen

tan

alysis.

Journ

alofM

achin

eLearn

ing

Research

,9:1

269–1294,2008.

D.L

.D

onoho

and

J.Tan

ner.

Neig

hborlin

essof

random

lypro

jectedsim

plices

inhig

hdim

ensio

ns.

Pro

ceedin

gs

ofth

eN

ational

Acad

emy

ofScien

cesofth

eU

nited

States

ofA

merica,

102(2

7):9

452,

2005.

B.Efro

n,T

.H

astie,I.

Johnsto

ne,

and

R.T

ibsh

irani.

Least

angle

regressio

n.

Annals

of

statistics,32

(2):4

07–451,2004.

M.

Elad

and

M.

Aharo

n.

Imag

eden

oisin

gvia

sparse

and

redundan

trepresen

tations

over

learned

dictio

naries.

IEEE

Tran

sactions

on

Imag

ePro

cessing,15(1

2):3

736–3745,2006.

J.Fan

and

R.Li.

Variab

leSelectio

nV

iaN

onco

ncave

Pen

alizedLikelih

ood

and

ItsO

raclePro

perties.

Journ

alofth

eA

merican

Statistical

Asso

ciation,96(4

56):1

348–1361,2001.

M.

Fazel,

H.

Hin

di,

and

S.P

.B

oyd.

Aran

km

inim

ization

heu

risticw

ithap

plicatio

nto

min

imum

order

systemap

proximatio

n.

InPro

ceedin

gs

ofth

eA

merican

Contro

lConferen

ce,vo

lum

e6,pag

es

4734–4739,2001.

C.

Fevo

tte,N

.B

ertin,

and

J.-L.

Durrieu

.N

onneg

ativem

atrixfactorizatio

nw

ithth

eitaku

ra-saito

diverg

ence.

with

applicatio

nto

music

analysis.

Neu

ralCom

putatio

n,21(3

),2009.

J.Fried

man

,T

.H

astie,H

.H

”oflin

g,

and

R.

Tib

shiran

i.Path

wise

coord

inate

optim

ization.

Annals

of

Applied

Statistics,

1(2

):

302–332,2007.

W.

Fu.

Pen

alizedreg

ressions:

the

bridge

vs.th

eLasso

.Jo

urn

alof

Com

putatio

nal

and

Grap

hical

Statistics,

7(3

):397–416,1998).

T.G

riffith

san

dZ.G

hah

raman

i.In

finite

latent

feature

models

and

the

Indian

buffet

process.

Advan

ces

inN

eural

Inform

ation

Pro

cessing

System

s(N

IPS),

18,2006.

D.

R.

Hard

oon

and

J.Shaw

e-Taylor.

Sparse

canonical

correlation

analysis.

InSparsity

and

Inverse

Pro

blem

sin

Statistical

Theory

and

Eco

nom

etrics,2008.

T.J.

Hastie

and

R.J.

Tib

shiran

i.G

eneralized

Additive

Models.

Chap

man

&H

all,1990.

J.H

uan

gan

dT

.Zhan

g.

The

ben

efit

ofgro

up

sparsity.

Tech

nical

Rep

ort0901.2

962v2

,A

rXiv,

2009.

J.H

uan

g,

S.

Ma,

and

C.H

.Zhan

g.

Adap

tiveLasso

forsp

arsehigh

-dim

ensio

nal

regressio

nm

odels.

Statistica

Sin

ica,18:1

603–1618,2008.

J.H

uan

g,T

.Zhan

g,an

dD

.M

etaxas.Learn

ing

with

structu

redsp

arsity.In

Pro

ceedin

gs

of

the

26th

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2009.

H.Ish

waran

and

J.S.Rao

.Spike

and

slabvariab

leselectio

n:

frequen

tistan

dB

ayesianstrateg

ies.T

he

Annals

ofStatistics,

33(2

):730–773,2005.

L.Jaco

b,G

.O

bozin

ski,an

dJ.-P

.Vert.

Gro

up

Lasso

with

overlap

san

dgrap

hLasso

.In

Pro

ceedin

gs

of

the

26th

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2009.

R.Jen

atton,J.Y

.A

udib

ert,an

dF.B

ach.

Stru

ctured

variable

selection

with

sparsity-in

ducin

gnorm

s.

Tech

nical

report,

arXiv:0

904.3

523,2009a.

R.

Jenatto

n,

G.

Obozin

ski,an

dF.

Bach

.Stru

ctured

sparse

princip

alco

mponen

tan

alysis.Tech

nical

report,

arXiv:0

909.1

440,2009b.

A.

Juditsky

and

A.

Nem

irovski.

On

verifiab

lesu

fficien

tco

nditio

ns

forsp

arsesig

nal

recovery

via

ℓ1 -m

inim

ization.

Tech

nical

Rep

ort0809.2

650v1

,A

rXiv,

2008.

G.

S.

Kim

eldorf

and

G.

Wah

ba.

Som

eresu

ltson

Tch

ebycheffi

ansp

line

functio

ns.

J.M

ath.

Anal.

Applicat.,

33:8

2–95,1971.

S.

Laco

ste-Julien

,F.

Sha,

and

M.I.

Jordan

.D

iscLD

A:

Discrim

inative

learnin

gfor

dim

ensio

nality

reductio

nan

dclassifi

cation.

Advan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s(N

IPS)

21,2008.

G.R.G

.Lan

ckriet,T

.D

eB

ie,N

.Cristian

ini,

M.I.

Jordan

,an

dW

.S.N

oble.

Astatistical

framew

ork

forgen

om

icdata

fusio

n.

Bio

inform

atics,20:2

626–2635,2004a.

G.

R.

G.

Lan

ckriet,N

.Cristian

ini,

L.

ElG

hao

ui,

P.

Bartlett,

and

M.

I.Jord

an.

Learn

ing

the

kernel

matrix

with

semid

efinite

program

min

g.

Journ

alofM

achin

eLearn

ing

Research

,5:2

7–72,2004b.

H.

Lee,

A.

Battle,

R.

Rain

a,an

dA

.N

g.

Effi

cient

sparse

codin

galg

orithm

s.In

Advan

cesin

Neu

ral

Inform

ation

Pro

cessing

System

s(N

IPS),

2007.

Y.Lin

and

H.H

.Zhan

g.Com

ponen

tselectio

nan

dsm

ooth

ing

inm

ultivariate

nonparam

etricreg

ression.

Annals

ofStatistics,

34(5

):2272–2297,2006.

J.Liu

,S.Ji,

and

J.Ye.

Multi-T

askFeatu

reLearn

ing

Via

Effi

cientl2

,-Norm

Min

imizatio

n.

Pro

ceedin

gs

ofth

e25th

Conferen

ceon

Uncertain

tyin

Artifi

cialIn

telligence

(UA

I),2009.

K.

Lounici.

Sup-n

ormco

nverg

ence

ratean

dsig

nco

ncen

tration

property

of

Lasso

and

Dan

tzig

estimators.

Electro

nic

Journ

alofStatistics,

2:9

0–102,2008.

K.Lounici,

A.B

.T

sybako

v,M

.Pontil,

and

S.A

.van

de

Geer.

Takin

gad

vantag

eofsp

arsityin

multi-task

learnin

g.

InConferen

ceon

Com

putatio

nal

Learn

ing

Theory

(CO

LT),

2009.

J.Lv

and

Y.Fan

.A

unifi

edap

proach

tom

odel

selection

and

sparse

recovery

usin

greg

ularized

least

squares.

Annals

ofStatistics,

37(6

A):3

498–3528,2009.

L.

Mackey.

Defl

ation

meth

ods

forsp

arsePCA

.A

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s

(NIP

S),

21,2009.

N.M

aculan

and

G.J.R

.G

ALD

INO

DE

PAU

LA

.A

linear-tim

em

edian

-findin

galg

orithm

forpro

jecting

avector

on

the

simplex

ofR?n

?O

peratio

ns

researchletters,

8(4

):219–222,1989.

J.M

airal,F.B

ach,J.

Ponce,

G.Sap

iro,an

dA

.Zisserm

an.

Discrim

inative

learned

dictio

naries

forlo

cal

imag

ean

alysis.In

IEEE

Conferen

ceon

Com

puter

Visio

nan

dPattern

Reco

gnitio

n,(C

VPR),

2008.

J.M

airal,F.

Bach

,J.

Ponce,

and

G.

Sap

iro.

Onlin

edictio

nary

learnin

gfor

sparse

codin

g.

In

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2009a.

J.M

airal,F.B

ach,J.

Ponce,

G.Sap

iro,an

dA

.Zisserm

an.

Supervised

dictio

nary

learnin

g.

Advan

ces

inN

eural

Inform

ation

Pro

cessing

System

s(N

IPS),

21,2009b.

J.M

airal,F.

Bach

,J.

Ponce,

G.

Sap

iro,

and

A.

Zisserm

an.

Non-lo

calsp

arsem

odels

forim

age

restoration.

InIn

ternatio

nal

Conferen

ceon

Com

puter

Visio

n(IC

CV

),2009c.

H.

M.

Markow

itz.T

he

optim

ization

of

aquad

raticfu

nctio

nsu

bject

tolin

earco

nstrain

ts.N

aval

Research

Logistics

Quarterly,

3:1

11–133,1956.

P.M

assart.Concen

tration

Ineq

ualities

and

Model

Selectio

n:

Eco

led’ete

de

Pro

bab

ilitesde

Sain

t-Flo

ur

23.

Sprin

ger,

2003.

R.

Mazu

mder,

T.

Hastie,

and

R.

Tib

shiran

i.Spectral

Reg

ularizatio

nA

lgorith

ms

forLearn

ing

Larg

e

Inco

mplete

Matrices.

2009.

Subm

itted.

N.M

einsh

ausen

.Relaxed

Lasso

.Com

putatio

nal

Statistics

and

Data

Analysis,

52(1

):374–393,2008.

N.

Mein

shau

senan

dP.

Buhlm

ann.

Hig

h-d

imen

sional

grap

hs

and

variable

selection

with

the

lasso.

Annals

ofstatistics,

34(3

):1436,2006.

N.M

einsh

ausen

and

P.B

uhlm

ann.

Stab

ilityselectio

n.

Tech

nical

report,

arXiv:

0809.2

932,2008.

N.M

einsh

ausen

and

B.Yu.

Lasso

-type

recovery

of

sparse

representatio

ns

forhig

h-d

imen

sional

data.

Annals

ofStatistics,

37(1

):246–270,2008.

C.A

.M

icchelli

and

M.

Pontil.

Learn

ing

the

kernel

functio

nvia

regularizatio

n.

Journ

alof

Mach

ine

Learn

ing

Research

,6(2

):1099,2006.

B.M

oghad

dam

,Y

.W

eiss,an

dS.A

vidan

.G

eneralized

spectral

bounds

forsp

arseLD

A.

InPro

ceedin

gs

ofth

e23rd

intern

ational

conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2006a.

B.

Moghad

dam

,Y

.W

eiss,an

dS.

Avid

an.

Spectral

bounds

forsp

arsePCA

:Exact

and

greed

y

algorith

ms.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s,vo

lum

e18,2006b.

R.M

.N

eal.B

ayesianlearn

ing

forneu

ralnetw

orks.Sprin

ger

Verlag

,1996.

Y.

Nestero

v.In

troductory

lectures

on

convex

optim

ization:

Abasic

course.

Kluw

erA

cadem

icPub,

2003.

Y.

Nestero

v.G

radien

tm

ethods

form

inim

izing

com

posite

objective

functio

n.

Cen

terfor

Operatio

ns

Research

and

Eco

nom

etrics(C

ORE),

Cath

olic

University

ofLouvain

,Tech

.Rep

,76,2007.

G.

Obozin

ski,M

.J.W

ainw

right,

and

M.I.

Jordan

.H

igh-d

imen

sional

unio

nsu

pport

recovery

in

multivariate

regressio

n.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s(N

IPS),

2008.

G.

Obozin

ski,B

.Taskar,

and

M.I.

Jordan

.Jo

int

covariate

selection

and

join

tsu

bsp

aceselectio

nfor

multip

leclassifi

cation

problem

s.Statistics

and

Com

putin

g,pag

es1–22,2009.

B.A

.O

lshau

senan

dD

.J.

Field

.Sparse

codin

gw

ithan

overco

mplete

basis

set:A

strategy

employed

byV

1?

Visio

nResearch

,37:3

311–3325,1997.

M.R.O

sborn

e,B

.Presn

ell,an

dB

.A

.Turlach

.O

nth

elasso

and

itsdual.

Journ

alofCom

putatio

nal

and

Grap

hical

Statistics,

9(2

):319–337,2000.

M.Pontil,

A.A

rgyrio

u,an

dT

.Evg

enio

u.M

ulti-task

feature

learnin

g.In

Advan

cesin

Neu

ralInform

ation

Pro

cessing

System

s,2007.

R.

Rain

a,A

.B

attle,H

.Lee,

B.

Packer,

and

A.Y

.N

g.

Self-tau

ght

learnin

g:

Tran

sferlearn

ing

from

unlab

eleddata.

InPro

ceedin

gs

ofth

e24th

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2007.

A.Rako

tom

amonjy,

F.B

ach,S.Can

u,an

dY

.G

randvalet.

Sim

pleM

KL.

Journ

alofM

achin

eLearn

ing

Research

,9:2

491–2521,2008.

S.W

.Rau

den

bush

and

A.S

.B

ryk.H

ierarchical

linear

models:

Applicatio

ns

and

data

analysis

meth

ods.

Sag

ePub.,

2002.

P.Raviku

mar,

H.Liu

,J.

Laff

erty,an

dL.W

asserman

.SpA

M:Sparse

additive

models.

InA

dvan

cesin

Neu

ralIn

formatio

nPro

cessing

System

s(N

IPS),

2008.

B.Rech

t,W

.X

u,an

dB

.H

assibi.

Null

Space

Conditio

ns

and

Thresh

old

sfor

Ran

kM

inim

ization.

2009.

Subm

itted.

J.D

.M

.Ren

nie

and

N.Srebro

.Fast

maxim

um

marg

inm

atrixfactorizatio

nfor

collab

orativepred

iction.

InPro

ceedin

gs

ofth

e22nd

intern

ational

conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2005.

V.Roth

and

B.Fisch

er.T

he

gro

up-L

assofor

gen

eralizedlin

earm

odels:

uniq

uen

essof

solu

tions

and

efficien

talg

orithm

s.In

Pro

ceedin

gs

of

the

25th

Intern

ational

Conferen

ceon

Mach

ine

Learn

ing

(ICM

L),

2008.

R.Salakh

utd

inov

and

A.M

nih

.Pro

bab

ilisticm

atrixfactorizatio

n.

InA

dvan

cesin

Neu

ralIn

formatio

n

Pro

cessing

System

s,vo

lum

e20,2008.

B.Sch

olko

pfan

dA

.J.

Sm

ola.

Learn

ing

with

Kern

els.M

ITPress,

2001.

M.W

.Seeg

er.B

ayesianin

ference

and

optim

aldesig

nfor

the

sparse

linear

model.

The

Journ

alof

Mach

ine

Learn

ing

Research

,9:7

59–813,2008.

J.Shaw

e-Taylor

and

N.Cristian

ini.

Kern

elM

ethods

forPattern

Analysis.

Cam

bridge

University

Press,

2004.

S.

Sonnen

burg

,G

.Raetsch

,C.

Sch

aefer,an

dB

.Sch

oelko

pf.

Larg

escale

multip

lekern

ellearn

ing.

Journ

alofM

achin

eLearn

ing

Research

,7:1

531–1565,2006.

N.Srebro

,J.

D.M

.Ren

nie,

and

T.S.Jaakko

la.M

aximum

-marg

inm

atrixfactorizatio

n.

InA

dvan

ces

inN

eural

Inform

ation

Pro

cessing

System

s17,2005.

B.

K.

Srip

erum

budur,

D.

A.

Torres,

and

G.

R.

G.

Lan

ckriet.A

d.c.

program

min

gap

proach

toth

e

sparse

gen

eralizedeig

envalu

epro

blem

.Tech

nical

Rep

ort0901.1

504v2

,A

rXiv,

2009.

R.T

ibsh

irani.

Reg

ression

shrin

kage

and

selection

viath

elasso

.Jo

urn

alofT

he

Royal

Statistical

Society

Series

B,58(1

):267–288,1996.

S.A

.Van

De

Geer.

Hig

h-d

imen

sional

gen

eralizedlin

earm

odels

and

the

Lasso

.A

nnals

ofStatistics,

36

(2):6

14,2008.

E.

vanden

Berg

,M

.Sch

mid

t,M

.P.

Fried

lander,

and

K.

Murp

hy.

Gro

up

sparsity

vialin

ear-time

projectio

n.

Tech

nical

Rep

ortT

R-2

008-0

9,D

epartm

ent

of

Com

puter

Scien

ce,U

niversity

of

British

Colu

mbia,

2009.

G.W

ahba.

Splin

eM

odels

forO

bservatio

nal

Data.

SIA

M,1990.

M.

J.W

ainw

right.

Sharp

thresh

old

sfor

noisy

and

hig

h-d

imen

sional

recovery

of

sparsity

usin

gℓ1 -

constrain

edquad

raticpro

gram

min

g.

IEEE

transactio

ns

on

inform

ation

theory,

55(5

):2183,2009.

L.W

asserman

and

K.Roed

er.H

igh

dim

ensio

nal

variable

selection.

Annals

ofstatistics,

37(5

A):2

178,

2009.

D.M

.W

itten,

R.

Tib

shiran

i,an

dT

.H

astie.A

pen

alizedm

atrixdeco

mpositio

n,

with

applicatio

ns

to

sparse

princip

alco

mponen

tsan

dcan

onical

correlation

analysis.

Bio

statistics,10(3

):515–534,2009.

T.

T.

Wu

and

K.

Lan

ge.

Coord

inate

descen

talg

orithm

sfor

lassopen

alizedreg

ression.

Annals

of

Applied

Statistics,

2(1

):224–244,2008.

J.Yan

g,

K.

Yu,

Y.

Gong,

and

T.

Huan

g.

Lin

earsp

atialpyram

idm

atchin

gusin

gsp

arseco

din

gfor

imag

eclassifi

cation.

InIE

EE

Conferen

ceon

Com

puter

Visio

nan

dPattern

Reco

gnitio

n(C

VPR),

2009.

M.Yuan

and

Y.Lin

.M

odel

selection

and

estimatio

nin

regressio

nw

ithgro

uped

variables.

Journ

alof

The

Royal

Statistical

Society

Series

B,68(1

):49–67,2006.

M.Yuan

and

Y.Lin

.O

nth

enon-n

egative

garro

tteestim

ator.Jo

urn

alofT

he

Royal

Statistical

Society

Series

B,69(2

):143–161,2007.

T.Zhan

g.A

dap

tiveforw

ard-b

ackward

greed

yalg

orithm

forsp

arselearn

ing

with

linear

models.

Advan

ces

inN

eural

Inform

ation

Pro

cessing

System

s,22,2008a.

T.

Zhan

g.

Multi-stag

eco

nvex

relaxation

forlearn

ing

with

sparse

regularizatio

n.

Advan

cesin

Neu

ral

Inform

ation

Pro

cessing

System

s,22,2008b.

T.Zhan

g.

On

the

consisten

cyoffeatu

reselectio

nusin

ggreed

yleast

squares

regressio

n.

The

Journ

al

ofM

achin

eLearn

ing

Research

,10:5

55–568,2009.

P.Zhao

and

B.Yu.

On

model

selection

consisten

cyofLasso

.Jo

urn

alofM

achin

eLearn

ing

Research

,

7:2

541–2563,2006.

P.Zhao

,G

.Roch

a,an

dB

.Yu.

Gro

uped

and

hierarch

icalm

odel

selection

thro

ugh

com

posite

abso

lute

pen

alties.A

nnals

ofStatistics,

37(6

A):3

468–3497,2009.

H.Zou.

The

adap

tiveLasso

and

itsoracle

properties.

Journ

alofth

eA

merican

Statistical

Asso

ciation,

101(4

76):1

418–1429,2006.

H.Zou

and

T.H

astie.Reg

ularizatio

nan

dvariab

leselectio

nvia

the

elasticnet.

Journ

alof

the

Royal

Statistical

Society

Series

B(S

tatisticalM

ethodolo

gy),

67(2

):301–320,2005.

H.

Zou

and

R.

Li.

One-step

sparse

estimates

innonco

ncave

pen

alizedlikelih

ood

models.

Annals

of

Statistics,

36(4

):1509–1533,2008.

Theory and algorithms Sparse methods for machine learningfbach/nips2009tutorial/nips... · Sparse...

Documents

Transcript of Theory and algorithms Sparse methods for machine learningfbach/nips2009tutorial/nips... · Sparse...