Learnability Beyond Uniform Convergence

51
Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem ”Algorithmic Learning Theory”, Lyon 2012 Joint work with: N. Srebro, O. Shamir, K. Sridharan (COLT’09,JMLR’11) A. Daniely, S. Sabato, S. Ben-David (COLT’11) A. Daniely, S. Sabato (NIPS’12) Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 1 / 34

Transcript of Learnability Beyond Uniform Convergence

Page 1: Learnability Beyond Uniform Convergence

Learnability Beyond Uniform Convergence

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

”Algorithmic Learning Theory”,Lyon 2012

Joint work with:N. Srebro, O. Shamir, K. Sridharan (COLT’09,JMLR’11)A. Daniely, S. Sabato, S. Ben-David (COLT’11)A. Daniely, S. Sabato (NIPS’12)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 1 / 34

Page 2: Learnability Beyond Uniform Convergence

The Fundamental Theorem of Learning Theory

For Binary Classification

UniformConvergence

Learnablewith ERM

Learnable

Finite VC

trivialtrivial

NFL (W’96)

VC’71

VC = Vapnik and Chervonenkis, W = Wolpert

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 2 / 34

Page 3: Learnability Beyond Uniform Convergence

The Fundamental Theorem of Learning Theory

For Regression

UniformConvergence

Learnablewith ERM

Learnable

Finite fat-shattering

trivialtrivial

KS’94,BLW’96,ABCH’97

BLW’96,ABCH’97

BLW = Bartlett, Long, Williamson. ABCH = Alon, Ben-David, Cesa-Bianchi, Hausler. KS = Kearns and Schapire

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 3 / 34

Page 4: Learnability Beyond Uniform Convergence

For general learning problems?

UniformConvergence

Learnablewith ERM

Learnabletrivialtrivial

?

Not true

Not true in “Convex learning problems” !Not true even in “multiclass categorization” !

What is learnable ? How to learn ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

Page 5: Learnability Beyond Uniform Convergence

For general learning problems?

UniformConvergence

Learnablewith ERM

Learnabletrivialtrivial

X

Not true

Not true in “Convex learning problems” !Not true even in “multiclass categorization” !

What is learnable ? How to learn ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

Page 6: Learnability Beyond Uniform Convergence

For general learning problems?

UniformConvergence

Learnablewith ERM

Learnabletrivialtrivial

X

Not true

Not true in “Convex learning problems” !Not true even in “multiclass categorization” !

What is learnable ? How to learn ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

Page 7: Learnability Beyond Uniform Convergence

Outline

1 Definitions

2 Learnability without uniform convergence

3 Characterizing Learnability using Stability

4 Characterizing Multiclass Learnability

5 Analyzing specific, practically relevant, classes

6 Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 5 / 34

Page 8: Learnability Beyond Uniform Convergence

The General Learning Setting (Vapnik)

Hypothesis class HExamples domain Z with unknown distribution DLoss function ` : H×Z → R

Given: Training set S ∼ DmGoal: Solve:

minh∈H

L(h) where L(h) = Ez∼D

[`(h, z)]

in the Probably (w.p. ≥ 1− δ) Approximately Correct (up to ε) sense

Training loss: LS(h) =1

m

m∑i=1

`(h, zi)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

Page 9: Learnability Beyond Uniform Convergence

The General Learning Setting (Vapnik)

Hypothesis class HExamples domain Z with unknown distribution DLoss function ` : H×Z → R

Given: Training set S ∼ DmGoal: Solve:

minh∈H

L(h) where L(h) = Ez∼D

[`(h, z)]

in the Probably (w.p. ≥ 1− δ) Approximately Correct (up to ε) sense

Training loss: LS(h) =1

m

m∑i=1

`(h, zi)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

Page 10: Learnability Beyond Uniform Convergence

Examples

Binary classification:

Z = X × {0, 1}h ∈ H is a predictor h : X → {0, 1}`(h, (x, y)) = 1[h(x) 6= y]

Multiclass categorization:

Z = X × Yh ∈ H is a predictor h : X → Y`(h, (x, y)) = 1[h(x) 6= y]

k-means clustering:

Z = Rd

H ⊂ (Rd)k specifies k cluster centers`((µ1, . . . , µk), z) = minj ‖µj − z‖

Density Estimation:

h is a parameter of a density ph(z)`(h, z) = − log ph(z)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 7 / 34

Page 11: Learnability Beyond Uniform Convergence

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ε, δ)

PS∼Dm

[∀h ∈ H, |LS(h)− L(h)| ≤ ε] ≥ 1− δ

Learnable: ∃A s.t. for m ≥ mPAC(ε, δ),

PS∼Dm

[L(A(S)) ≤ min

h∈HL(h) + ε

]≥ 1− δ

ERM:An algorithm that returns A(S) ∈ argminh∈H LS(h)

Learnable by arbitrary ERM (with rate mERM(ε, δ))Like “Learnable” but A should be an ERM.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Page 12: Learnability Beyond Uniform Convergence

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ε, δ)

PS∼Dm

[∀h ∈ H, |LS(h)− L(h)| ≤ ε] ≥ 1− δ

Learnable: ∃A s.t. for m ≥ mPAC(ε, δ),

PS∼Dm

[L(A(S)) ≤ min

h∈HL(h) + ε

]≥ 1− δ

ERM:An algorithm that returns A(S) ∈ argminh∈H LS(h)

Learnable by arbitrary ERM (with rate mERM(ε, δ))Like “Learnable” but A should be an ERM.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Page 13: Learnability Beyond Uniform Convergence

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ε, δ)

PS∼Dm

[∀h ∈ H, |LS(h)− L(h)| ≤ ε] ≥ 1− δ

Learnable: ∃A s.t. for m ≥ mPAC(ε, δ),

PS∼Dm

[L(A(S)) ≤ min

h∈HL(h) + ε

]≥ 1− δ

ERM:An algorithm that returns A(S) ∈ argminh∈H LS(h)

Learnable by arbitrary ERM (with rate mERM(ε, δ))Like “Learnable” but A should be an ERM.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Page 14: Learnability Beyond Uniform Convergence

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ε, δ)

PS∼Dm

[∀h ∈ H, |LS(h)− L(h)| ≤ ε] ≥ 1− δ

Learnable: ∃A s.t. for m ≥ mPAC(ε, δ),

PS∼Dm

[L(A(S)) ≤ min

h∈HL(h) + ε

]≥ 1− δ

ERM:An algorithm that returns A(S) ∈ argminh∈H LS(h)

Learnable by arbitrary ERM (with rate mERM(ε, δ))Like “Learnable” but A should be an ERM.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

Page 15: Learnability Beyond Uniform Convergence

For Binary Classification

UniformConvergence

Learnablewith ERM

Learnable

Finite VC

trivialtrivial

NFL (W’96)

VC’71

mUC(ε, δ) ≈ mERM(ε, δ) ≈ mPAC(ε, δ) ≈ VC(H) log(1/δ)ε2

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 9 / 34

Page 16: Learnability Beyond Uniform Convergence

Outline

1 Definitions

2 Learnability without uniform convergence

3 Characterizing Learnability using Stability

4 Characterizing Multiclass Learnability

5 Analyzing specific, practically relevant, classes

6 Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 10 / 34

Page 17: Learnability Beyond Uniform Convergence

Counter Example — Stochastic Convex Optimization

Consider the family of problems:

H is a convex set with maxh∈H ‖h‖ ≤ 1

For all z, `(h, z) is convex and Lipschitz w.r.t. h

Claim:

Problem is learnable by the rule:

argminh∈H

λm2 ‖h‖

2 + 1m

m∑i=1

`(h, zi)

No uniform convergence

Not learnable by ERM

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

Page 18: Learnability Beyond Uniform Convergence

Counter Example — Stochastic Convex Optimization

Consider the family of problems:

H is a convex set with maxh∈H ‖h‖ ≤ 1

For all z, `(h, z) is convex and Lipschitz w.r.t. h

Claim:

Problem is learnable by the rule:

argminh∈H

λm2 ‖h‖

2 + 1m

m∑i=1

`(h, zi)

No uniform convergence

Not learnable by ERM

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

Page 19: Learnability Beyond Uniform Convergence

Counter Example — Stochastic Convex Optimization

Proof (of “not learnable by arbitrary ERM”)

1-Mean + missing features

z = (α, x), α ∈ {0, 1}d, x ∈ Rd, ‖x‖ ≤ 1

`(h, (α, x)) =√∑

i αi(hi − xi)2

Take P[αi = 1] = 1/2, P[x = µ] = 1

Let h(i) be s.t.

h(i)j =

{1− µj if j = i

µj o.w.

If d is large enough, exists i such that h(i) is an ERM

But L(h(i)) ≥ 1/√

2

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

Page 20: Learnability Beyond Uniform Convergence

Counter Example — Stochastic Convex Optimization

Proof (of “not learnable by arbitrary ERM”)

1-Mean + missing features

z = (α, x), α ∈ {0, 1}d, x ∈ Rd, ‖x‖ ≤ 1

`(h, (α, x)) =√∑

i αi(hi − xi)2

Take P[αi = 1] = 1/2, P[x = µ] = 1

Let h(i) be s.t.

h(i)j =

{1− µj if j = i

µj o.w.

If d is large enough, exists i such that h(i) is an ERM

But L(h(i)) ≥ 1/√

2

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

Page 21: Learnability Beyond Uniform Convergence

Counter Example — Stochastic Convex Optimization

Proof (of “not even learnable by a unique ERM”)

Perturb the loss a little bit:

`(h, (α, x)) =

√∑i

αi(hi − xi)2 + ε∑i

2−i(hi − 1)2

Now loss is strictly convex — unique ERM

But the unique ERM does not generalize (as before)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 13 / 34

Page 22: Learnability Beyond Uniform Convergence

For general learning problems?

UniformConvergence

Learnablewith ERM

Learnabletrivialtrivial

X

Not true

Not true in “Convex learning problems” ! 3Not true even in “multiclass categorization” !

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 14 / 34

Page 23: Learnability Beyond Uniform Convergence

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X | − 1}Let n : 2X → Y be defined by binary encoding

H = {hT : T ⊂ X} where

hT (x) =

{0 x /∈ Tn(T ) x ∈ T

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Page 24: Learnability Beyond Uniform Convergence

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X | − 1}Let n : 2X → Y be defined by binary encoding

H = {hT : T ⊂ X} where

hT (x) =

{0 x /∈ Tn(T ) x ∈ T

Claim: No uniform convergence: mUC ≥ |X |/εTarget function is h∅For any training set S, take T = X \ SLS(hT ) = 0 but L(hT ) = P[T ]

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Page 25: Learnability Beyond Uniform Convergence

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X | − 1}Let n : 2X → Y be defined by binary encoding

H = {hT : T ⊂ X} where

hT (x) =

{0 x /∈ Tn(T ) x ∈ T

Claim: H is Learnable: mPAC ≤ 1ε

Let T be the targetA(S) = hT if (x, n(T )) ∈ SA(S) = h∅ if S = {(x1, 0), . . . , (xm, 0)}In the 1st case, L(A(S)) = 0.In the 2nd case, L(A(S)) = P[T ]With high probability, if P[T ] > ε then we’ll be in the 1st case

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

Page 26: Learnability Beyond Uniform Convergence

Counter Example — Multiclass

CorollarymUCmPAC

≈ |X |.If |X | → ∞ then the problem is learnable but there is no uniformconvergence!

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 16 / 34

Page 27: Learnability Beyond Uniform Convergence

Outline

1 Definitions

2 Learnability without uniform convergence

3 Characterizing Learnability using Stability

4 Characterizing Multiclass Learnability

5 Analyzing specific, practically relevant, classes

6 Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 17 / 34

Page 28: Learnability Beyond Uniform Convergence

Characterizing Learnability using Stability

Theorem

A sufficient and necessary condition for learnability is the existence ofAsymptotic ERM (AERM) which is stable.

UniformConvergence

ERM is stable ∃ stable AERM

Learnable

RMP’05,MNPR’06, trivial

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 18 / 34

Page 29: Learnability Beyond Uniform Convergence

More formally

Definition (Stability)

We say that A is εstable(m)-replace-one stable if for all D,

ES,z′,i|`(A(S(i)); z′)− `(A(S); z′)| ≤ εstable(m).

Definition (AERM)

We say that A is an AERM (Asymptotic Empirical Risk Minimizer) withrate εerm(m) if for all D:

ES∼Dm

[LS(A(S))−minh∈H

LS(h)] ≤ εerm(m)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 19 / 34

Page 30: Learnability Beyond Uniform Convergence

More formally

Definition (Stability)

We say that A is εstable(m)-replace-one stable if for all D,

ES,z′,i|`(A(S(i)); z′)− `(A(S); z′)| ≤ εstable(m).

Definition (AERM)

We say that A is an AERM (Asymptotic Empirical Risk Minimizer) withrate εerm(m) if for all D:

ES∼Dm

[LS(A(S))−minh∈H

LS(h)] ≤ εerm(m)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 19 / 34

Page 31: Learnability Beyond Uniform Convergence

Proof sketch: (Stable AERM is sufficient and necessary forLearnability)

Sufficient:

For AERM: stability ⇒ generalization

AERM+generalization ⇒ consistency

Necessary:

∃ consistent A ⇒∃ consistent and generalizing A′ (using subsampling)

Consistent+generalizing ⇒ AERM

AERM+generalizing ⇒ stable

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 20 / 34

Page 32: Learnability Beyond Uniform Convergence

Intermediate Summary

Learnability ⇐⇒ ∃ stable AERM

But, how do we find one?

And, is there a combinatorial notion of learnability (like VCdimension) ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 21 / 34

Page 33: Learnability Beyond Uniform Convergence

Outline

1 Definitions

2 Learnability without uniform convergence

3 Characterizing Learnability using Stability

4 Characterizing Multiclass Learnability

5 Analyzing specific, practically relevant, classes

6 Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 22 / 34

Page 34: Learnability Beyond Uniform Convergence

Why multiclass learning

Practical relevance

A simple twist of binary classification

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 23 / 34

Page 35: Learnability Beyond Uniform Convergence

The Natarajan Dimension

Natarajan dimension: Maximal size of N-shattered set where:

C is N-shattered by H if ∃f1, f2 ∈ H s.t. ∀x ∈ C, f1(x) 6= f2(x), and forevery T ⊆ C exists h ∈ H with

h(x) =

{f1(x) if x ∈ Tf2(x) if x ∈ C \ T

When |Y| = 2, Natarajan dimension equals to VC dimension

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 24 / 34

Page 36: Learnability Beyond Uniform Convergence

The Natarajan Dimension

Natarajan dimension: Maximal size of N-shattered set where:

C is N-shattered by H if ∃f1, f2 ∈ H s.t. ∀x ∈ C, f1(x) 6= f2(x), and forevery T ⊆ C exists h ∈ H with

h(x) =

{f1(x) if x ∈ Tf2(x) if x ∈ C \ T

When |Y| = 2, Natarajan dimension equals to VC dimension

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 24 / 34

Page 37: Learnability Beyond Uniform Convergence

Does Natarajan dimension characterize multiclasslearnability ?

Theorem (Natarajan’89, Ben-David et al 95)

If H is a class of functions with Natarajan dimension d then

d+ ln(1/δ)

ε≤ mPAC(ε, δ) ≤ d ln(|Y|) ln(1/ε) + ln(1/δ)

ε.

Remark:

A large gap when Y is large

Uniform convergence rate does depend on Y

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 25 / 34

Page 38: Learnability Beyond Uniform Convergence

Does Natarajan dimension characterize multiclasslearnability ?

Theorem (Natarajan’89, Ben-David et al 95)

If H is a class of functions with Natarajan dimension d then

d+ ln(1/δ)

ε≤ mPAC(ε, δ) ≤ d ln(|Y|) ln(1/ε) + ln(1/δ)

ε.

Remark:

A large gap when Y is large

Uniform convergence rate does depend on Y

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 25 / 34

Page 39: Learnability Beyond Uniform Convergence

How to design good ERM algorithm?

Consider again our counter example: Y = {0, . . . , 2|X | − 1} andH = {hT : T ⊂ X} with

hT (x) =

{0 x /∈ Tn(T ) x ∈ T

Bad ERM:

If S = (x1, 0), . . . , (xm, 0) return hT with T = X \ {x1, . . . , xm}Good ERM

If S = (x1, 0), . . . , (xm, 0) return h∅

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 26 / 34

Page 40: Learnability Beyond Uniform Convergence

How to design good ERM algorithm?

Consider again our counter example: Y = {0, . . . , 2|X | − 1} andH = {hT : T ⊂ X} with

hT (x) =

{0 x /∈ Tn(T ) x ∈ T

Bad ERM:

If S = (x1, 0), . . . , (xm, 0) return hT with T = X \ {x1, . . . , xm}Good ERM

If S = (x1, 0), . . . , (xm, 0) return h∅

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 26 / 34

Page 41: Learnability Beyond Uniform Convergence

How to design a good ERM algorithm?

Definition

A has an essential range r if ∀h ∈ H, ∃Y ′(h) with |Y ′(h)| ≤ r s.t. for allS labeled by h we have A(S) ∈ Y ′(h)

A good ERM is an ERM that has a small essential range

A Principle for Designing Good ERMs

Theorem

If a learner has an “essential” range r then

mA(ε, δ) ≤ d ln(r/ε) + ln(1/δ)

ε

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 27 / 34

Page 42: Learnability Beyond Uniform Convergence

Characterizing Multiclass Learnability

Conjecture

For any H of Natarajan dimension d,

d+ ln(1/δ)

ε≤ mPAC(ε, δ) ≤ d ln(d/ε) + ln(1/δ)

ε.

Cannot rely on uniform convergence / arbitrary ERM

Maybe there’s always an ERM with a small essential range ?

Holds for symmetric classes

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 28 / 34

Page 43: Learnability Beyond Uniform Convergence

Characterizing Multiclass Learnability

Conjecture

For any H of Natarajan dimension d,

d+ ln(1/δ)

ε≤ mPAC(ε, δ) ≤ d ln(d/ε) + ln(1/δ)

ε.

Cannot rely on uniform convergence / arbitrary ERM

Maybe there’s always an ERM with a small essential range ?

Holds for symmetric classes

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 28 / 34

Page 44: Learnability Beyond Uniform Convergence

Outline

1 Definitions

2 Learnability without uniform convergence

3 Characterizing Learnability using Stability

4 Characterizing Multiclass Learnability

5 Analyzing specific, practically relevant, classes

6 Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 29 / 34

Page 45: Learnability Beyond Uniform Convergence

Sample Complexity of Specific classes

Enables a rigorous comparison of known multiclass algorithms

Previous analyses (e.g. ASS’01,BL’07): how the binary error translatesto multiclass error

Multiclass predictors:

One-vs-All (OvA)Multiclass SVM (MSVM): arg maxi(Wx)iTree Classifiers (TC), with O(|Y|) nodesError Correcting Output Codes (ECOC), with code-length O(|Y|)

Use linear predictors in Rd as the binary classifiers

Theorem

The sample complexity of all the above classes is Θ(d |Y|).

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 30 / 34

Page 46: Learnability Beyond Uniform Convergence

Sample Complexity of Specific classes

Enables a rigorous comparison of known multiclass algorithms

Previous analyses (e.g. ASS’01,BL’07): how the binary error translatesto multiclass error

Multiclass predictors:

One-vs-All (OvA)Multiclass SVM (MSVM): arg maxi(Wx)iTree Classifiers (TC), with O(|Y|) nodesError Correcting Output Codes (ECOC), with code-length O(|Y|)

Use linear predictors in Rd as the binary classifiers

Theorem

The sample complexity of all the above classes is Θ(d |Y|).

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 30 / 34

Page 47: Learnability Beyond Uniform Convergence

Comparing Approximation Error

Definition

We say that H essentially contains H′ if for any distribution, theapproximation error of H is at most the approximation error of H′.H strictly contains H′ if, in addition, there is a distribution for whichthe approximation error of H is strictly smaller than that of H′.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 31 / 34

Page 48: Learnability Beyond Uniform Convergence

Comparing Approximation Error

MSVM 3 3 3

OvA 3 3 7

TC/ECOC 3 7 7

* Assuming tree structure and ECOC code are chosen randomly

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 32 / 34

Page 49: Learnability Beyond Uniform Convergence

Comparing Approximation Error

TC OvA MSVM random ECOCEst. d|Y| d|Y| d|Y| d|Y|Approx. ≥ MSVM ≥ MSVM best incomparableerror ≈ 1/2 if d� |Y| ≈ 1/2 if d� |Y|

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 33 / 34

Page 50: Learnability Beyond Uniform Convergence

Open Questions

Equivalence between uniform convergence and learnability breaks evenin multiclass problems

What characterizes multiclass learnability ?

What is the corresponding learning rule ?

What characterizes learnability in the general learning setting ?

What is the corresponding learning rule ?

THANKS

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 34 / 34

Page 51: Learnability Beyond Uniform Convergence

Open Questions

Equivalence between uniform convergence and learnability breaks evenin multiclass problems

What characterizes multiclass learnability ?

What is the corresponding learning rule ?

What characterizes learnability in the general learning setting ?

What is the corresponding learning rule ?

THANKS

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 34 / 34