Kernels, Margins, and Low-dimensional Mappings

Kernels, Margins, and Kernels, Margins, and Low-dimensional MappingsLow-dimensional Mappings

[NIPS 2007 Workshop on TOPOLOGY LEARNING ]

Maria-Florina Balcan, Avrim Blum, Maria-Florina Balcan, Avrim Blum, Santosh VempalaSantosh Vempala

Generic problemGeneric problem

Given a set of images: , want to learn a linear separator to distinguish men from women.

Problem: pixel representation no good.

Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific.

New style advice: Use a Kernel! K( , ) = ( )¢( ).

is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be

“kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

Generic problemGeneric problem

Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific.

New style advice: Use a Kernel! K( , ) = ( )¢ ( ).

is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be

“kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator. E.g., K(x,y) = (x ¢ y + 1)m. :(n-diml space) ! (nm-diml space).

Claim:Claim:Can view new method as way of conducting old

method. Given a kernel [as a black-box program K(x,y)] and

access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit

(small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set [9 almost-as-good separator].

“You give me a kernel, I give you a set of features”

Do this using idea of random projection…

Claim:Claim:Can view new method as way of conducting old

method. Given a kernel [as a black-box program K(x,y)] and

access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit

(small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set [9 almost-as-good separator].

E.g., sample z1,...,zd from D. Given x, define xi = K(x,zi).

Implications:

Practical: alternative to kernelizing the algorithm.

Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

Basic setup, definitionsBasic setup, definitions Instance space X.

X

Distribution D, target c. Use P = (D,c). K(x,y) = (x)¢(y). P is separable with margin in -

space if 9 w s.t. Pr(x,l)2 P[l(w¢(x)/|(x)|) < (|w|=1)

P=(D,c)

+ -+ -

w

Error at margin : replace “0” with “”.

Goal is to use K to

get mapping to low-dim’l

space.

One idea: Johnson-Lindenstrauss One idea: Johnson-Lindenstrauss lemmalemma

If P separable with margin in -space, then with prob 1-, a random linear projection down to space of dimension d = O((1/2)log[1/()]) will have a linear separator of error < . [Arriaga Vempala]

X P=(D,c)

+ -+ -

+ -

+ -

If vectors are r1,r2,...,rd, then can view as features xi = (x)¢ ri.

Problem: uses . Can we do directly, using K as black-box, without computing ?

3 methods (from simplest to best)3 methods (from simplest to best)1. Draw d examples z1,...,zd from D. Use:

F(x) = (K(x,z1), ..., K(x,zd)). [So, “xi” = K(x,zi)]

For d = (8/)[1/2 + ln 1/], if P was separable with margin in -space, then whp this will be separable with error . (but this method doesn’t preserve margin).

2. Same d, but a little more complicated. Separable with error at margin /2.

3. Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/, rather than linear. So, can set ¿ 1/d.All these methods need access to D, unlike JL. Can

this be removed? We show NO for generic K, but may be possible for natural K.

Key factKey factClaim: If 9 perfect w of margin in -space, then if draw

z1,...,zd 2 D for d ¸ (8/)[1/2 + ln 1/], whp (1-) exists w’ in span((z1),...,(zd)) of error · at margin /2.

Proof: Let S = examples drawn so far. Assume |w|=1, |(z)|=1 8 z.

win = proj(w,span(S)), wout = w – win.

Say wout is large if Prz(|wout¢(z)| ¸ /2) ¸ ; else small.

If small, then done: w’ = win. Else, next z has at least prob of improving S.

|wout|2 Ã |wout|2 – (/2)2

Can happen at most 4/2 times. □

So....So....If draw z1,...,zd 2 D for d = (8/)[1/2 + ln 1/], then

whp exists w’ in span((z1),...,(zd)) of error · at margin /2.

So, for some w’ = 1(z1) + ... + d(zd),

Pr(x,l) 2 P [sign(w’ ¢ (x)) l] · .

But notice that w’¢(x) = 1K(x,z1) + ... + dK(x,zd).

) vector (1,...d) is an -good separator in the feature space: xi = K(x,zi).

But margin not preserved because length of target, examples not preserved.

How to preserve margin? (mapping How to preserve margin? (mapping #2)#2)

We know 9 w’ in span((z1),...,(zd)) of error · at margin /2.

So, given a new x, just want to do an orthogonal projection of (x) into that span. (preserves dot-product, decreases |(x)|, so only increases margin).

Run K(zi,zj) for all i,j=1,...,d. Get matrix M. Decompose M = UTU.

(Mapping #2) = (mapping #1)U-1. □

Mapping #2, DetailsMapping #2, Details

Draw a set S={z1, ..., zd} of d = (8/)[1/2 + ln 1/], unlabeled examples from D.

Run K(x,y) for all x,y2S, get M(S)=(K(zi,zj))zi,zj2 S.

Place S into d-dim. space based on K (or M(S)).

X

z1z3

z2

K(z1,z1)=|F2(z1)|2

F2(z1)

F2(z2)

K(z2,z2)

K(z1,z2

)

K(z3,z3)

F2(z3)Rd

F1

Mapping #2, Details, Mapping #2, Details, contcont

What to do with new points?

Extend the embedding F1 to all of X: consider F2: X ! Rd defined as follows: for x 2 X, let F2(x) 2

Rd be the point of smallest length such that F2(x) ¢F2(zi) = K(x,zi), for all i 2 {1, ..., d}.

The mapping is equivalent to orthogonally projecting (x) down to span((z1),…, (zd)).

How to improve dimension?How to improve dimension? Current mapping (F2) gives d = (8/)[1/2 + ln 1/].

Johnson-Lindenstrauss gives d1 = O((1/2) log 1/() ). Nice because can have d¿ 1/.

Answer: just combine the two... Run Mapping #2, then do random projection

down from that. Gives us desired dimension (# features),

though sample-complexity remains as in mapping #2.

X

XX

XX

X

OO

OO

X

XO

OO

O

X

X

X

XX

OO

OO

X

X X

Rd

JLF

X

XOO

O

O

X

X XRN

F2

Rd1

Open ProblemsOpen Problems

For specific natural kernels, like K(x,y) = (1 + x¢y)m, is there an efficient analog to JL, without needing access to D? Or, at least can one at least reduce the

sample-complexity ? (use fewer accesses to D)

Can one extend results (e.g., mapping #1: x [K(x,z1), ..., K(x,zd)]) to more general similarity functions K? Not exactly clear what theorem statement

would look like.

Kernels, Margins, and Low-dimensional Mappings

Documents

Transcript of Kernels, Margins, and Low-dimensional Mappings