ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on...

Post on 12-Jul-2020

21 views 0 download

Transcript of ETHEM ALPAYDIN © The MIT Press, 2010ethem/i2ml2e/2e_v1-0/i2ml2...Simpler models are more robust on...

ETHEM ALPAYDIN© The MIT Press, 2010

alpaydin@boun.edu.trhttp://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Why Reduce Dimensionality? Reduces time complexity: Less computation

Reduces space complexity: Less parameters

Saves the cost of observing the feature

Simpler models are more robust on small datasets

More interpretable; simpler explanation

Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Feature Selection vs Extraction Feature selection: Choosing k<d important features,

ignoring the remaining d – k

Subset selection algorithms

Feature extraction: Project the

original xi , i =1,...,d dimensions to

new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Subset Selection There are 2d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature

j = argmini E ( F xi )

Add xj to F if E ( F xj ) < E ( F )

Hill-climbing O(d2) algorithm Backward search: Start with all features and remove

one at a time, if possible. Floating search (Add k, remove l)

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Principal Components Analysis (PCA)

Find a low-dimensional space such that when x is projected there, information loss is minimized.

The projection of x on the direction of w is: z = wTx

Find w such that Var(z) is maximized

Var(z) = Var(wTx) = E[(wTx – wTμ)2]

= E[(wTx – wTμ)(wTx – wTμ)]

= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

where Var(x)= E[(x – μ)(x –μ)T] = ∑

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Maximize Var(z) subject to ||w||=1

∑w1 = αw1 that is, w1 is an eigenvector of ∑

Choose the one with the largest eigenvalue for Var(z) to be max

Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑

and so on.

01 1222222

wwwwwww

TTT max

7

111111

wwwww

TT max

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

What PCA doesz = WT(x – m)

where the columns of W are the eigenvectors of ∑, and mis sample mean

Centers the data at the origin and rotates the axes

8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Proportion of Variance (PoV) explained

when λi are sorted in descending order

Typically, stop at PoV>0.9

Scree graph plots of PoV vs k, stop at “elbow”

How to choose k ?

dk

k

21

21

9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Factor Analysis Find a small number of factors z, which when combined

generate x :

xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi

where zj, j =1,...,k are the latent factors with

E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,

εi are the noise sources

E* εi += ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,

and vij are the factor loadings

12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

PCA vs FA PCA From x to z z = WT(x – µ)

FA From z to x x – µ = Vz + ε

13

x z

z x

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Factor Analysis In FA, factors zj are stretched, rotated and translated to

generate x

14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multidimensional Scaling

srsr

srsr

srsr

srsr

E

,

,

||

|

2

2

2

2

xx

xxxgxg

xx

xxzz

X

15

Given pairwise distances between N points,

dij, i,j =1,...,N

place on a low-dim map s.t. distances are preserved.

z = g (x | θ ) Find θ that min Sammon stress

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Map of Europe by MDS

16

Map from CIA – The World Factbook: http://www.cia.gov/

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Linear Discriminant Analysis Find a low-dimensional

space such that when x is projected, classes are well-separated.

Find w that maximizes

t

ttT

t

tt

ttT

rmsr

rm

ss

mmJ

2

1

2

11

2

2

2

1

2

21

xwxw

w

17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

TBBT

TT

TTmm

2121

2121

2

21

2

21

mmmmww

wmmmmw

mwmw

SS where

18

Between-class scatter:

Within-class scatter:

21

2

1

2

1

111

111

2

1

2

1

SSSS

S

S

WWT

t

t

Ttt

Tt

t

TttT

t

t

tT

ss

r

r

rms

where

where

ww

mxmx

wwwmxmxw

xw

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Fisher’s Linear Discriminant

21

1mmw

Wc S

19

Find w that max

LDA soln:

Parametric soln:

ww

mmw

ww

www

WT

T

WT

BT

JSS

S2

21

,~| when iiCp μ

μμ 21

1

Nx

w

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

K>2 Classes

Tit

it

t

tii

K

iiW r mxmx

SSS 1

20

Within-class scatter:

Between-class scatter:

Find W that max

K

ii

K

i

T

iiiBK

N11

1mmmmmm S

WSW

WSWW

WT

BT

J

The largest eigenvectors of SW-1SB

Maximum rank of K-1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Isomap Geodesic distance is the distance along the manifold that

the data lies in, as opposed to the Euclidean distance in the input space

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Isomap Instances r and s are connected in the graph if

||xr-xs||<e or if xs is one of the k neighbors of xr

The edge length is ||xr-xs||

For two nodes r and s not connected, the distance is equal to the shortest path between them

Once the NxN distance matrix is thus formed, use MDS to find a lower-dimensional mapping

23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

24

-150 -100 -50 0 50 100 150-150

-100

-50

0

50

100

150Optdigits after Isomap (with neighborhood graph).

0

0

7

4

6

2

55

08

71

9 5

3

0

4

7

84

7

85

9

1

2

0

6

1

8

7

0

7

6

9

1

93

94

9

2

1

99

6

43

2

8

2

7

1

4

6

2

0

4

6

37 1

0

2

2

5

2

4

81

73

0

3 377

9

13

3

4

3

4

2

889 8

4

71

6

9

4

0

1 3

6

2

Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Locally Linear Embedding1. Given xr find its neighbors xs

(r)

2. Find Wrs that minimize

3. Find the new coordinates zr that minimize

25

2

r s

srrs

rXE )()|( xWxW

2

r s

srrs

r zzE )()|( WWz

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

LLE on Optdigits

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.51

1

1

1

1

1

00

7

4

6

2

5

5

0

8

7

1

9

5

3

0

47

84

7

8

5

91

2

0

6

18

7

0

76

91

9

3

9

4

9

2

1

99

6

432

82

7

14

6

2

0

46

3

7

1

0

22 52

48

1

7

3

0

33

77

91

334

342

88

98 4

7

1

6 94

0

1

36

2

Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html