Kernels, Margins, and Low-dimensional Mappings
description
Transcript of Kernels, Margins, and Low-dimensional Mappings
Kernels, Margins, and Kernels, Margins, and Low-dimensional MappingsLow-dimensional Mappings
[NIPS 2007 Workshop on TOPOLOGY LEARNING ]
Maria-Florina Balcan, Avrim Blum, Maria-Florina Balcan, Avrim Blum, Santosh VempalaSantosh Vempala
Generic problemGeneric problem
Given a set of images: , want to learn a linear separator to distinguish men from women.
Problem: pixel representation no good.
Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific.
New style advice: Use a Kernel! K( , ) = ( )¢( ).
is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be
“kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.
Generic problemGeneric problem
Old style advice: Pick a better set of features! But seems ad-hoc. Not scientific.
New style advice: Use a Kernel! K( , ) = ( )¢ ( ).
is implicit, high-dimensional mapping. Feels more scientific. Many algorithms can be
“kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator. E.g., K(x,y) = (x ¢ y + 1)m. :(n-diml space) ! (nm-diml space).
Claim:Claim:Can view new method as way of conducting old
method. Given a kernel [as a black-box program K(x,y)] and
access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set [9 almost-as-good separator].
“You give me a kernel, I give you a set of features”
Do this using idea of random projection…
Claim:Claim:Can view new method as way of conducting old
method. Given a kernel [as a black-box program K(x,y)] and
access to typical inputs [samples from D], Claim: Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good [9 large-margin separator in -space for D,c], then this is a good feature set [9 almost-as-good separator].
E.g., sample z1,...,zd from D. Given x, define xi = K(x,zi).
Implications:
Practical: alternative to kernelizing the algorithm.
Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.
Basic setup, definitionsBasic setup, definitions Instance space X.
X
Distribution D, target c. Use P = (D,c). K(x,y) = (x)¢(y). P is separable with margin in -
space if 9 w s.t. Pr(x,l)2 P[l(w¢(x)/|(x)|) < (|w|=1)
P=(D,c)
+ -+ -
w
Error at margin : replace “0” with “”.
Goal is to use K to
get mapping to low-dim’l
space.
One idea: Johnson-Lindenstrauss One idea: Johnson-Lindenstrauss lemmalemma
If P separable with margin in -space, then with prob 1-, a random linear projection down to space of dimension d = O((1/2)log[1/()]) will have a linear separator of error < . [Arriaga Vempala]
X P=(D,c)
+ -+ -
+ -
+ -
If vectors are r1,r2,...,rd, then can view as features xi = (x)¢ ri.
Problem: uses . Can we do directly, using K as black-box, without computing ?
3 methods (from simplest to best)3 methods (from simplest to best)1. Draw d examples z1,...,zd from D. Use:
F(x) = (K(x,z1), ..., K(x,zd)). [So, “xi” = K(x,zi)]
For d = (8/)[1/2 + ln 1/], if P was separable with margin in -space, then whp this will be separable with error . (but this method doesn’t preserve margin).
2. Same d, but a little more complicated. Separable with error at margin /2.
3. Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/, rather than linear. So, can set ¿ 1/d.All these methods need access to D, unlike JL. Can
this be removed? We show NO for generic K, but may be possible for natural K.
Key factKey factClaim: If 9 perfect w of margin in -space, then if draw
z1,...,zd 2 D for d ¸ (8/)[1/2 + ln 1/], whp (1-) exists w’ in span((z1),...,(zd)) of error · at margin /2.
Proof: Let S = examples drawn so far. Assume |w|=1, |(z)|=1 8 z.
win = proj(w,span(S)), wout = w – win.
Say wout is large if Prz(|wout¢(z)| ¸ /2) ¸ ; else small.
If small, then done: w’ = win. Else, next z has at least prob of improving S.
|wout|2 Ã |wout|2 – (/2)2
Can happen at most 4/2 times. □
So....So....If draw z1,...,zd 2 D for d = (8/)[1/2 + ln 1/], then
whp exists w’ in span((z1),...,(zd)) of error · at margin /2.
So, for some w’ = 1(z1) + ... + d(zd),
Pr(x,l) 2 P [sign(w’ ¢ (x)) l] · .
But notice that w’¢(x) = 1K(x,z1) + ... + dK(x,zd).
) vector (1,...d) is an -good separator in the feature space: xi = K(x,zi).
But margin not preserved because length of target, examples not preserved.
How to preserve margin? (mapping How to preserve margin? (mapping #2)#2)
We know 9 w’ in span((z1),...,(zd)) of error · at margin /2.
So, given a new x, just want to do an orthogonal projection of (x) into that span. (preserves dot-product, decreases |(x)|, so only increases margin).
Run K(zi,zj) for all i,j=1,...,d. Get matrix M. Decompose M = UTU.
(Mapping #2) = (mapping #1)U-1. □
Mapping #2, DetailsMapping #2, Details
Draw a set S={z1, ..., zd} of d = (8/)[1/2 + ln 1/], unlabeled examples from D.
Run K(x,y) for all x,y2S, get M(S)=(K(zi,zj))zi,zj2 S.
Place S into d-dim. space based on K (or M(S)).
X
z1z3
z2
K(z1,z1)=|F2(z1)|2
F2(z1)
F2(z2)
K(z2,z2)
K(z1,z2
)
K(z3,z3)
F2(z3)Rd
F1
Mapping #2, Details, Mapping #2, Details, contcont
What to do with new points?
Extend the embedding F1 to all of X: consider F2: X ! Rd defined as follows: for x 2 X, let F2(x) 2
Rd be the point of smallest length such that F2(x) ¢F2(zi) = K(x,zi), for all i 2 {1, ..., d}.
The mapping is equivalent to orthogonally projecting (x) down to span((z1),…, (zd)).
How to improve dimension?How to improve dimension? Current mapping (F2) gives d = (8/)[1/2 + ln 1/].
Johnson-Lindenstrauss gives d1 = O((1/2) log 1/() ). Nice because can have d¿ 1/.
Answer: just combine the two... Run Mapping #2, then do random projection
down from that. Gives us desired dimension (# features),
though sample-complexity remains as in mapping #2.
X
XX
XX
X
OO
OO
X
XO
OO
O
X
X
X
XX
OO
OO
X
X X
Rd
JLF
X
XOO
O
O
X
X XRN
F2
Rd1
Open ProblemsOpen Problems
For specific natural kernels, like K(x,y) = (1 + x¢y)m, is there an efficient analog to JL, without needing access to D? Or, at least can one at least reduce the
sample-complexity ? (use fewer accesses to D)
Can one extend results (e.g., mapping #1: x [K(x,z1), ..., K(x,zd)]) to more general similarity functions K? Not exactly clear what theorem statement
would look like.