Outline - users.stat.umn.edu

41
Outline Background Binary Support Vector Machine (SVM) and -Learning Multicategory -Learning and SVM Optimization for Multicategory -Learning Statistical Learning Theory for Multicategory -Learning Numerical Examples Summary and Future Work 2

Transcript of Outline - users.stat.umn.edu

Page 1: Outline - users.stat.umn.edu

Outline

� Background

� Binary Support Vector Machine (SVM) and � -Learning

� Multicategory � -Learning and SVM

� Optimization for Multicategory � -Learning

� Statistical Learning Theory for Multicategory � -Learning

� Numerical Examples

� Summary and Future Work

2

Page 2: Outline - users.stat.umn.edu

An Example

� Letter imagine recognition: Train machine to recognize hand-writtenEnglish letters and classify them correctly.

� Dataset Letter from Statlog collection: Each sample contains

– a vector with 16 primitive numerical attributes: � � � � ��� ;

– a response variable representing 26 capital letters: � (classlabel) � � � � � � � � � ��� �� � � � � � � � .

� Goal: Build a classifier using the training data to recoginize new letters.

� A 3-class Example: consider letters D, O, Q.

3

Page 3: Outline - users.stat.umn.edu

x1

x2

Letter D

Letter O Letter Q

Plot of samples for letters D, O, Q using the first two attributes in � .4

Page 4: Outline - users.stat.umn.edu

x1

x2

Letter D

Letter O Letter Q

Seperable case: many possible partitions.5

Page 5: Outline - users.stat.umn.edu

x1

x2

Letter D

Letter O Letter Q

Seperable case: many possible partitions.5

Page 6: Outline - users.stat.umn.edu

Literature Review

� Traditional statistical methodsLinear/Quadratic Discriminant Analysis, Nearest Neighbor, LogisticRegression, etc.

� Machine learning

– Active research in computer science, engineering, etc.

– Methods: Neural Network, Boosting, SVM, � -Learning.

� Goal: Maximize generalization ability.

6

Page 7: Outline - users.stat.umn.edu

Machine Learning and Statistics

� StatisticsEstimate conditional probabilities to yield classification, e.g., CART,Logistic Regression, etc.

� Machine learningMargins � SVM (Boser, Guyon, & Vapnik, 1992, Vapnik, 1995),Boosting (Freund & Schapire, 1997), etc.

� Theoretical foundation

– Statistics: Function estimation.

– Machine learning: Vapnik-Chervonenkis theory.

� Level of difficultyClassification is easier than function estimation.

7

Page 8: Outline - users.stat.umn.edu

Multicategory Problem

� � -class

– Construct decision function vector � � �� � � � � � � �� � via sample

� � � � � � � � � , i.i.d. � unknown � � � � � � .

– � : � � �� � � represents class � ;may not be a probability.

– Classifier: argmax � ��� � � �� � � � � .

� AccuracyGeneralization Error (GE): Err � � � � � ��� �� argmax � � � � .

� GoalSeek � to minimize Err � � � directly.

8

Page 9: Outline - users.stat.umn.edu

x1

x2

Letter D

Letter O Letter Q

f1>max(f2,f3)

f2>max(f1,f3)

f3>max(f1,f2)

Find � � �� � � �� � � � � and use argmax � � � � to do classification.9

Page 10: Outline - users.stat.umn.edu

Difficulties

� Class label representation is not unique.

– Binary: � �� � or � � ��� � (SVM, � -learning).

– Multicategory:

1. Scalar: � � � � � � � � . � �

2. Vector: � � �� � � � � � � � � � � � � � �� � � � with in � th coordinate

represents class � (Lee, Lin, &Wahba, 2001).

� Generalize the concept of Err � � � given the new coding system.

� Generalize the concept of margins (only available for binary problem).

� Existence of no dominating class, i.e., � �� � � � �� � when

� � � , where � � � � � � ��� � � � � � .– Conventional “One-vs-Rest”: suboptimal (Lee et al., 2001).

1. Perform � binary classifiers sequentially to obtain � .2. Check whether � � � � � � ; � � � � � � � � .

10

Page 11: Outline - users.stat.umn.edu

Binary Case

� Begin with binary case in the new setting for motivation.

� Usual setting: only one� , classifier sign �� � with � � ��� � .

� New setting: � � �� � � �� � and classifier argmax � � � � .

– sign �� � � � � � suffices: classify � into class 2 if� � � � ��� � � � � � .

– Remove redundancy: sum-to-zero constraint�

� � � � � � � .

� Linear:� � � � � � �� � � � ; � � �� , � � � .

� Margins

– Functional margin for an instance � � � � � � � :� � � � � � � � � � � � �

� � � � � � � � in the usual �� � setting.

– Separation margin:� � � � � � �� � � � � �

1. The vertical Euclidean distance between� � � � � � � .2.� � � : Euclidean norm in �� .

11

Page 12: Outline - users.stat.umn.edu

SVM for Binary Case

� Separable: Find optimal hyperplane s.t.

– Maximize � � � � subj. to� � � � � � � � � � � � ��� �(1). “zero”-training error; (2). scaling of � � � � �� � .

� Nonseparable: “zero”-error not attainable � “slack variables” �� � � � �

� �� � � ��� � � � �� � � � � � � �

� � � � � subj. to

� � � � � � � � � � � � �� � � � � � � � �� � � � � � � � ��� , � � � isa tuning parameter.

� � solve � � � � �� � � � � � � � via minimizing

��

� � � � � � � �

� � � � � � � � � � � � � � � � � � � � subj. to

� � � � � � � � , � � ��� � � � � � � �� (Hinge loss).

� Support Vectors determine the solution.

– Instances with � � � in halfspace � �� � � � � � � �� � � � � � .

– Instances with � � � � in � � � �� � � � � � � � � ��� � .12

Page 13: Outline - users.stat.umn.edu

f1(x)−f2(x)=1

f2(x)−f1(x)=1

f1(x)−f2(x)=0

margin=1/||w2||

Class 2

Class 1

Plot of the decision boundary defined by� � � � � � � � � � � � � , the

separation margin �

� �� � � �

, and three SVs on� � � � � � � � � � � � � .

13

Page 14: Outline - users.stat.umn.edu

-Learning for Binary Case

� Proposed by Shen, Tseng, Zhang, Wong (JASA, 2003) under usualbinary ��� � setting.

� New setting: � � �� � � �� � and classifier argmax � � � � .

� Goal: Minimize GE, Err � � � � �� � � � sign ��� � � � � � �� � � �� .

� Solve � � � � �� � � � � � � � via minimizing

��

� � � � � � � �

� � � � � � � � � � � � � � � � � � �

subj. to

� � � � � � � � .

� � sign � � (non-increasing): Solve scaling problem.

� � � ��� � � � � � � � ��� � ,

� ��� � � � sign ��� � o.w.

� Potentials over SVM both theoretically and numerically.

14

Page 15: Outline - users.stat.umn.edu

−2 −1 0 1 2

−2

02

46

u

L(u)

L= 1−signL= psi1

15

Page 16: Outline - users.stat.umn.edu

Multicategory Framework

� Classifer: argmax � � � � with � � �� � � � � � � � � � .

� Important conceptMultiple comparison: � � � � � � � � � � � � � � � � � ��� � �� � � �

– Compare class � with rest � � classes.

– � � � � � � � � � � � � � � � � � � when � � � .

� � yields correct classification for � � � � � if � � � � � � � �� � � .

� Multivariate sign

sign ��� � � � if� min � min ��� � � � � � � � � � � �� � �

� if� min� � �

� GE: Err � � � � �� � � � sign � � � � � � �� .

16

Page 17: Outline - users.stat.umn.edu

Multicategory Framework, cont’d

� Multivariate � functionFor � � � � and � � � ,

� � � ��� � � � if� � � � ��� � � � � � � � � � ��� � � � � �

� ��� � � � sign ��� � o.w.(1)

� is nonincreasing in each� � .

� Include binary � function.

� A specific � for implementation

� ��� � � � sign ��� � if� min � � � � � ,

� � � � min � o.w.(2)

17

Page 18: Outline - users.stat.umn.edu

−2

−1

0

1

2

u1

−2

−1

0

1

2

u2

00.

51

1.5

2ps

i(u1,

u2)

Perspective plot of the 3-class � function as defined in (2).18

Page 19: Outline - users.stat.umn.edu

Multicategory -Learning

� Multicategory � -learning: Find � via minimizing

��

� ��� �� � � �

� � � � � � � � � � � � � , with constraint

� � � � � � � � .

� Linear:� �� � � � � � � .

� Nonlinear: Apply linear learning to a nonlinear feature space �

induced by kernel �� �� �� � � � � � .

– satisfies Mercer Theorem (Courant & Hilber, 1959);

– � � � � � with � � � � (Wahba, 1998);

– Representer Theorem (Kimeldorf & Wahba, 1971)

� � � � � � ��� � � � � � � � ;

� �� � � � � � � � � � �� � , where� � � � � � � � � � � � .

� Reduce to binary � -learning when � � � .

19

Page 20: Outline - users.stat.umn.edu

Multicategory SVM

� Multivariate hinge loss

� � ��� � � � if� min� and� � � � min � o.w.

� Multicategory SVM

Find � via minimizing ��

� �� �� � � � �

� � � � � � � � � � � � � � , with

constraint

� � � � � � � � .

� Reduce to the binary SVM when � � � .

� Differ from other SVM Multiclass extensions (Weston & Watkins, 1998,Vapnik, 1998, Lee et al., 2001).

20

Page 21: Outline - users.stat.umn.edu

Generalized Margin Interpretation

� Generalized functional margin for � � � � � � � : � �� � � � � � � � � � � .

– Indicate correctness and strength of classification.

– Reduce to� � � � � � � � � � � when � � � .

� Generalized separation margin � : � �� � � � � � �� � � � , where � � �

is the Euclidean distance between� � � � � � � .

� Separable: Multicategory � -learning and SVM find � s.t.

– Instances with � � � � fall into the convex polyhedron

� � � � � min � � � � � � � � � � ; � � � � � � � � .

– � is maximized.

21

Page 22: Outline - users.stat.umn.edu

Support Vectors

� Separable: Instances on the boundary of polyhedrons � ;

� � � � � � � � .

� Nonseparable: Instances with � � � � falling outside � and on theboundary of � ; � � � � � � � � .

� Multicategory � -learning and SVM retain the property of SVs.

� Sparsity of solutions, i.e. small number of SVs, is desirable since datareduction can be achieved.

22

Page 23: Outline - users.stat.umn.edu

Ployhedron one

Ployhedron two Ployhedron three

f1−f2=1

f1−f2=0

f2−f1=1

f1−f3=1

f1−f3=0

f3−f1=1

f2−f3=1

f3−f2=1f2−f3=0

Illustration of margins and SVs in a 3-class separable example.23

Page 24: Outline - users.stat.umn.edu

Deterministic Nonconvex Minimization

� Minimization involved in � -learning is nonconvex.

� Unexplored in Statistics.

� D.C. programming (Global minimization)Key: D.C. decomposition (Diff. Convex func., i.e., Convex� Concave).

– DCA (An and Tao, J. Global Optimization, 1997).

– Outer approx. (Blanquero & Carrizosa, J. Global Optimization,2000).

24

Page 25: Outline - users.stat.umn.edu

D.C. Decomposition

� Decompose � : �

�� � � �� , where

�� � � if� min� � ;� � min otherwise.

� Yield a d.c. decomposition of � � � � � �� .

– � � ��� � � �

� �� �� � � � �

� � � � � � � � � � � � � � is convex;

– �� ��� � � �

� � �� � � � � � � � � � � is concave in

�� .

� � -Learning, subj. to sum-to-zero constraint, solves

� ���� � ��� � � � � ��� � � �� ��� � � (3)

where

�� � vec ��� � � � � � ��� � � and

�� ��

� �� ���

.

� Nice interpretation

– � � : Convex cost function of SVM;

– �� : Bias correction for generalization.25

Page 26: Outline - users.stat.umn.edu

−2 −1 0 1 2

−6

−4

−2

02

46

u

L(u)

L= psiL= psi1L= psi2

Plot of D.C. decomposition �

�� � � �� for � � � .

26

Page 27: Outline - users.stat.umn.edu

D.C. Algorithm

� Idea: �� ��� � � �� ��� � � � � � �� ��� � � ��� � �� � � , affine minorization;

� : subgradient.

� Solving a sequence of convex subproblems:

Given

�� � , obtain

�� �� � via solving � �� �� � � ��� � � � � �� ��� � � ��� � .

– Employing Lagrange multipliers.

– Solving the dual problem using QP.

� Algorithm 1Step 1: (Initial)

���� .

Step 2: (Iteration) At iteration � , compute

�� �� � by solving QP.

Step 3: (Stopping) If��� �� � � �� �� � ��

Sol:

��� � argmin � � �� � �� � .

27

Page 28: Outline - users.stat.umn.edu

D.C. Algorithm, cont’d

� Theorem: (Convergence of Algorithm 1 )

� ��� � � is nonincreasing, � � � � �� � ��� � � ��� � �� �� � ��� � , and

� � � � �� ��� �� � � �� � � �� � � . Moreover, Algorithm 1 terminates

finitely.

� Convergence in 20 steps. Complexity � QP.

� The solution may not be global.

� Choice of initial values

– Important for performance of the final solution.

– Use SVM’s solution.

28

Page 29: Outline - users.stat.umn.edu

Outer Approximation

� Idea: � � ��� � � � �� � � � � � � � ��� � � � ��� � �� � � � � � ��� � � � (affine

minorizations).

� Algorithm 2: Solve a sequence of concave problems � �� �� � � ��� �

via vertex enumeration, � � � �� ��� � � � with

�� � �� � � � � � ��� � �� � � � � � ��� � � � .

� Theorem: (Convergence of Algorithm 2 )

The sequence

�� � converges to global optima, i.e.,

� � � � �� � ��� �� � � � � �� �� � ��� � , and � � ��� � � � � �� �� � ��� � � � �

when stop.

� Comparison

Algorithm 1 : Good for large � ; may not be global. � �

Algorithm 2 : Good for small � , � , and large� ; global.

29

Page 30: Outline - users.stat.umn.edu

Theory for Multicategory -Learning

� Class of candidate classification partitions:

� � � � � � � � � � �� � � � � �� � � � � � , induced by function class � .

� Ideal performance: Err ��

� � � � �� � Err � � � , where

�� is a Bayes rule.

� Actual performance: Err ��

� � .

� Comparison: (Actual-Ideal)

– � ��

� ��

� � � Err ��

� � � Err ��

� ��� � .

– � ��

� ��

� � � � � � � ��

� � � �� � � � � � �� � � � � � � � �

�� �� � � � � � .

� Important formula for � � � ��

� � : � � �� � � � � � � � � � � � �

� � sign ��

� � � � � � � � sign � � � � � � � � �� .

– Reveal dramatic difference between binary and multicategoryproblems.

– Do not suffer the difficulty of no dominating class.30

Page 31: Outline - users.stat.umn.edu

Theory

ofMulticategory

-Learning,cont’d

Assum

ptionA

:(A

pproximation)

For

��

(posseq)

as��

�,

����

s.t.

�� � ��� ���

;i.e.,���� �� ��

�� � �� ����

�.

Assum

ptionB

:(B

oundarybehavior) �

���

� ��

and�� ��

s.t.

��

�� �

� max�� �

argmax �

�� ����

�� � �for� ��

� .

Assum

ptionC

:(M

etricentropy)���� �� ����

��

� ���

�� ��� �

,where

��

�� �� �

� �� ������ �� �� ��

�� �

��� �

��� �!

� "� ���

��!#

,

"� �� ��� �

� �� ���� �

� �� �

��

� �� ��

���

�� �� ,

#�#

��

� �� �� �

����

�� �

� � �� ��� � �

�� ��

� � ,

���

�� �

� �� ��� � � ,and�� �� �

��

� � � � �

.

Assum

ptionD

:

satisfies(1).

31

Page 32: Outline - users.stat.umn.edu

Theory of Multicategory -Learning, cont’d

� Theorem: (Accuracy of � -learning: argmax ��

� � ) For a constant � � � � ,

� � ��

� ��

� ��� �� � � � � �� � � � � � � �� � �� � � �� � � �

� � �� � �� � � (4)

provided that � � � � � �� � � , where �� � � �� � � �� � �� �� � � � � ,

and � � � satisfying Assumption C.

� Corollary

� � ��

� ��

� � � � � � � �� � � � � � ��

� ��

� � � � � � �� � �

provided that� � �� � � � � � �� � �� � �� � � is bounded away from zero.

� Allow to study the dependence of � ��

� ��

� � on � and� simultaneouslywith � � � �� � .

32

Page 33: Outline - users.stat.umn.edu

TheoreticalE

xample:

Linear

Class:

���� �

�� ����� �

� ��

� �� ��

���� � �

� �

� �� ��� .

Input:��

� �� � �

,

��

isa

constant.

��� �

� :is

uniformin�

;� � �� ��

for

� � ��� �

�� ��

� ��� ��� ;� �

��

o.w.

�� ��� ��� ��

� � ��

���� �

��

� ;��� ��� ��� �

�� �

���

�� ���

when

� �� �

� �� ��� � � �

� ���

���

��

���� �

�� ,providedthat

��

���� �

��

� .

Rate

isnear

optimalw

hen�is

finite.

33

Page 34: Outline - users.stat.umn.edu

TheoreticalE

xample:

Nonlinear

��

� ���

� �� ��

� � �

� ��� �� �� � �

� �

� �� ���

with

� ���� �

� � ���

� �

.

Input:��

� �� �

.

��

� �� :

Sam

eas

LinearE

xample

with

.

�� ��� ��� ��

� � ���

���� �

���

� ,��� ��� ��� �

�� �

����

�� ���

when� �

� �� �� ��� � � �

� ���

���

���

���� �

���

providedthat

���

���� �

���

� .

Sam

erate

asLinear

Learningifthe

orderofthe

polynomial�

isfixed;

nearoptim

al.

34

Page 35: Outline - users.stat.umn.edu

Simulated Examples

� Performance: multicategory � -learning vs SVM.

� Improvement of � -learning over SVM: ��� � SVM � � � � � � � � (SVM),where � �� � � � �� � � Bayes error, and� �� � denotes the testing error.

� Each training sample:� � � � . The testing and Bayes errors arecomputed via independent testing samples of size ��� .

� Perform linear learning and a grid search on � .

� Results are obtained via averaging 100 repeated simulations.

35

Page 36: Outline - users.stat.umn.edu

Simulated Examples: Data Generation

� Generate � � � � �� � from bivariate � -dist. with d.f. =1, 3 in cases 1, 2.

� Randomly assign � �� � � � to itslabel index for each � � � � �� � .

� Generate � � � � �� � : � � � � � �

� � and �� � �� � �� with

� � � � �� � � � � � � , � � � � � ,

� � � � � � for classes 1–3, respec-tively.

Class 1Class 2

Class 3

36

Page 37: Outline - users.stat.umn.edu

Case 1: d.f.=1; Bayes error=0.247; improv. of � over SVM is 43.22%.

Case 2: d.f.=3; Bayes error=0.146; improv. of � over SVM is 20.41%.

Case Train(s.e.) Test(s.e.) � ��� ) SVs(s.e.)

d.f.=1 SVM .400(.147) .431(.141) .184 141.76(10.97)

� -L .320(.124) .349(.121) .102 64.64(15.43)

d.f.=3 SVM .145(.027) .151(.005) .005 71.81(11.02)

� -L .143(.029) .150(.003) .004 41.29(13.51)

� � -learning has smaller testing error and consequently has bettergeneralization ability than that of SVM.

� When d.f.=1, moments of bivariate � -dist. do not exist and SVM fails toaccomplish data reduction while � -learning has much smaller numberof SVs.

37

Page 38: Outline - users.stat.umn.edu

Applications: Letter Imagine Recognition

� 3-class example (Letters D, O, Q).

� Search optimal � over grid pointsin � � � � � � � � .

� Improvement

��� � SVM � � � � � � � � � SVM � .

� Observations

– � -learning does better in test-ing than SVM.

– On average, � -learning re-duces SVs of SVM. Thepercent of reduction, however,varies.

Testing errors of SVM and � -learning

Case SVM � -L Improv.

1 .083 .079 3.39%

2 .073 .063 12.24%

3 .086 .076 11.41%

4 .072 .072 0%

5 .088 .085 3.74%

6 .077 .073 5.45%

7 .075 .072 4.39%

8 .079 .075 5.92%

9 .093 .091 1.51%

10 .090 .086 4.11%

� SVs 51.1 40.8

38

Page 39: Outline - users.stat.umn.edu

Summary

� Propose a novel methodology for � -learning and SVM.

� Develop a learning theory for multicategory � -learning.

� Propose optimization methods to solve nonconvex minimization.

� � -learning is robust to outliers. In contrast, any classifier withunbounded loss function such as SVM suffers difficulty from extremeoutliers.

� Numerical study suggests that � -learning yields an even more “sparse”solution than SVM.

39

Page 40: Outline - users.stat.umn.edu

Future Directions

� Real applicationsMicroarray classification data, text recognition, etc.

� Choices of tuning parameters and kernels.

� Variable selection � -learning and SVM (Lasso, Tibshirani, JRSS, 1996;Basis Pursuit, Chen, Donoho, and Saunders, SIAM, 1998).

� Extensions to nonstandard case: treat � classes unequally.

40

Page 41: Outline - users.stat.umn.edu

References

� Liu, Y. and Shen, X. (2003). On multicategory � -learning and supportvector machine. J. Amer. Statist. Assoc. Under review.

� Liu, Y., Shen, X., and Doss, H. (2003). Multicategory � -learning andsupport vector machine: computational tools. J. Comput. Graph.Statist. Tentatively accepted.

� Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On

� -learning. J. Amer. Statist. Assoc. 98, 724-734.

� An, H. L. T., and Tao, P. D. (1997). Solving a class of linearlyconstrained indefinite quadratic problems by D.C. algorithms. J. GlobalOptim. 11, 253-285.

� Shen, X., and Wong, W. H. (1994). Convergence rate of sieveestimates. Ann. Statist. 22, 580-615.

� Lee, Y., Lin, Y., and Wahba, G. (2003). Multicategory Support VectorMachines, theory, and application to the classification of microarraydata and satellite radiance data. J. Amer. Statist. Assoc. To appear.

41