Lecture 7: Unsupervised Learningaz/lectures/ml/2011/lect7.pdf · 2011. 3. 7. · Lecture 7:...

20
Lecture 7: Unsupervised Learning C4B Machine Learning Hilary 2011 A. Zisserman • Dimensionality reduction – Principal Component Analysis algorithm applications • Isomap Non-linear map applications clustering dimensionality reduction

Transcript of Lecture 7: Unsupervised Learningaz/lectures/ml/2011/lect7.pdf · 2011. 3. 7. · Lecture 7:...

  • Lecture 7: Unsupervised LearningC4B Machine Learning Hilary 2011 A. Zisserman

    • Dimensionality reduction – Principal Component Analysis• algorithm• applications

    • Isomap• Non-linear map• applications

    • clustering

    • dimensionality reduction

  • Dimensionality Reduction

    Why reduce dimensionality?

    1. Intrinsic dimension of data: often data is measured in high dimensions, but its actual variation lives on a low dimensional surface (plus noise)

    Example

    x1

    x1

    x3

    x3

    x2

    x2

    data lives on low dimensional surface

    • 64× 64 bitmap → {0,1}4096

    • There is irrelevant noise (variation in stroke width)

    • and a much smaller dimension of variations in the digit2. Feature extraction, rather than feature selection

    • new features are a linear combination of originals (not a subset)

    3. Visualization

  • Projection to lower dimensions

    Dimensionality reduction usually involves determining a projection

    where D >> d, and often d = 2.

    If the projection is linear, then it can be written as a matrix

    RD Rd

    d×D=

    Principal Component Analysis (PCA)

    Determine a set of (orthogonal) axes which best represent the data

    u1

    u2RD

    Rd

    a linear projection d×D=

  • Principal Component Analysis (PCA)

    Determine a set of (orthogonal) axes which best represent the data

    Step 1: compute a vector to the data centroid, c

    Step 2: compute the principal axes, ui

    c

    u1

    u2

    Principal Component Analysis (PCA)

    cGiven a set of N data points xi ∈ RD

    1. Centre the data: compute the centroid

    c =1

    N

    NXi=1

    xi

    and transform the data so that c becomes the new origin

    xi ← xi − c

  • Principal Component Analysis (PCA)

    uu

    2a. Compute the first principal axis: determine the direction that

    best explains (or approximates) the data. Find a direction (unit

    vector) u such that Xi

    ³xi − (u>xi)u

    ´2is minimized. Or equivalently such thatX

    i

    ³u>xi

    ´2is maximized. This is the direction of maximum variation.

    Introduce a Lagrange multiplier to enforce that kuk = 1, and findthe stationary point of

    L =Xi

    ³u>xi

    ´2+ λ(1− u>u)

    w.r.t. u.

    L =Xi

    u>³xixi

    >´u+ λ(1− u>u)= u>Su+ λ(1− u>u)

    where S is the D ×D symmetrix matrix S = Pi xixi>. ThendLdu

    = 2Su− 2λu = 0

    and hence

    Su = λu

    i.e. u is an eigen-vector of S. Thus the variationXi

    ³u>xi

    ´2= u>Su = λu>u = λ

    is maximised by the eigen-vector, u1, corresponding to the largest

    eigen-vector, λ1, of S. u1 is the first principal component.

  • 2b. Now compute the next axis, which has the most variation and is

    orthogonal to u1.

    This must again be an eigen-vector of S — since Su = λu gives all the

    stationary points of the variation — and hence is given by u2, the eigen-

    vector corresponding to the second largest eigen-value of S. Why?

    u2 is the second principal component.

    Continuing in this manner it can be seen that the d principal compo-

    nents of the data are the d eigen-vectors of S with largest eigen-value.

    u1

    u2

  • Example

    Data: three points x1,x2,x3 ∈ R3:

    x1 =

    ⎛⎜⎝ 111

    ⎞⎟⎠ x2 =⎛⎜⎝ 221

    ⎞⎟⎠ x3 =⎛⎜⎝ 331

    ⎞⎟⎠Centroid x̄ = (2,2,1)>, and so the centred data is:

    x1 =

    ⎛⎜⎝ −1−10

    ⎞⎟⎠ x2 =⎛⎜⎝ 000

    ⎞⎟⎠ x3 =⎛⎜⎝ 110

    ⎞⎟⎠Write X = [x1,x2,x3], then

    S =Xi

    xixi> = XX> =

    ⎡⎢⎣ −1 0 1−1 0 10 0 0

    ⎤⎥⎦⎡⎢⎣ −1 −1 00 0 01 1 0

    ⎤⎥⎦ =⎡⎢⎣ 2 2 02 2 00 0 0

    ⎤⎥⎦

    x1

    x2

    x3

    x

    y Z=1

    and its eigen-vector decomposition is:

    S = [u1,u2,u3]

    ⎡⎢⎣ λ1 0 00 λ2 00 0 λ3

    ⎤⎥⎦ [u1,u2,u3]>

    =

    ⎡⎢⎢⎢⎣1√2

    1√2

    0

    1√2− 1√

    20

    0 0 1

    ⎤⎥⎥⎥⎦⎡⎢⎣ 2 0 00 0 00 0 0

    ⎤⎥⎦⎡⎢⎢⎢⎣

    1√2

    1√2

    0

    1√2− 1√

    20

    0 0 1

    ⎤⎥⎥⎥⎦Then yi = u1

    >xi and yi = {−√2,0,

    √2} for

    the points xi.

    x1

    x2

    x3

    u1u2

  • Given data {x1,x2, . . . ,xN} ∈ RD

    1. Compute the centroid and centre the data

    c =1

    N

    NXi=1

    xi

    and transform the data so that c becomes the new origin

    xi ← xi − c

    2. Write the centred data as X = [x1,x2, . . . ,xN ] and compute the covariance matrix

    S =1

    N

    Xi

    xixi> =

    1

    NXX>

    3. Compute the eigen-decomposition of S

    S = UDU>

    4. Then the principal components are the columns ui of U ordered by the magnitudeof the eigen-values

    5. The dimensionality of the data is reduced to d by the projection

    y = Ud> x

    where Ud are the first d columns of U, and y is a d-vector.

    The PCA Algorithm

    Notes

    • The PCA is a linear transformation that rotates the data so that it is maximally decorrelated

    • Often each coordinate is first transformed independently to have unit variance. Why?

    • A limitation of PCA is the linearity – it can’t “fit” curved surfaces well. We will return to this problem later.

  • Example: Visualization

    Suppose we are given high dimensional data and want to get some idea of its distribution

    e.g. the “iris” dataset: • three classes, 50 instances for each class, 4 attributes

    x1 =

    ⎛⎜⎜⎜⎝0.29110.5909−0.5942−0.8400

    ⎞⎟⎟⎟⎠ x2 =⎛⎜⎜⎜⎝

    0.24050.3636−0.5942−0.8400

    ⎞⎟⎟⎟⎠ . . . x150 =⎛⎜⎜⎜⎝0.49370.36360.47830.4400

    ⎞⎟⎟⎟⎠yi ∈ {1,2,3}

    -1.5 -1 -0.5 0 0.5 1 1.5-0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    First two principal component

    • data can be visualized

    • in this case data can (almost) be classified using 2 principal components as new feature vectors

  • The eigen-vectors U provide an orthogonal basis for any x ∈ RD

    x =DXj

    (uj>x)uj

    The PCA approximation with d principal components is

    x̃ =dXj

    (uj>x)uj

    and so the error is

    x− x̃ =DXd+1

    (uj>x)uj

    Using uj>uk = δjk, the squared error is

    kx− x̃k2 =⎛⎝ DXd+1

    (uj>x)uj

    ⎞⎠2 = DXd+1

    uj>(xx>)uj

    How much is lost by the PCA approximation?

    Hence the mean squared error is

    1

    N

    NXi

    kxi − x̃ik2 =1

    N

    DXd+1

    uj>⎛⎝ NXi

    xixi>⎞⎠uj = DX

    d+1

    uj>Suj

    and since Suj = λuj

    1

    N

    NXi

    kxi − x̃ik2 =DXd+1

    uj>Suj =

    DXd+1

    λj

    • the (squared reconstruction) error is given by the sum of the eigenvalues for the unused eigenvectors

    • the error is minimized by choosing the smallest eigen-values (as expected)

  • Example: Compression

    Natural application: can choose how “much” of data to keep• Represent image by patches of size s x s pixels • Compute PCA for all patches (each patch is a s2-vector)• Project patch onto d principal components

    Original image Splitting to patches

    s = 16, D = s2 = 256

    compressed image d = 20

    0 50 100 150 200 250 3000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5Reconstruction error

    output dimesion

    MsE

    rr

    0 50 100 150 200 250 3000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5Reconstruction error

    output dimesion

    MsE

    rr

    ratio (compressed/original)= 31.25%

    d = 40

    ratio (compressed/original)= 15.63%

    d = 20

  • Example: Graphics – PCA for faces

    3D PCA

    3D faces

    CyberScan faces

    Thomas Vetter, Sami Romdhani, Volker Blanz

  • Example: 3D PCA for faces

    Fitting to an image

  • original

    fitted

    Isomap

  • Limitations of linear methods

    • dimensionality reduction

    The images in each row cannot be expressed as a linear combination of the others

    The Swiss Roll Problem

    • Would like to unravel local structure

    • Preserve the intrinsic “manifold” structure

    • Need more than linear methods

  • Isomap

    Starting point – MDS linear method

    Another formulation of PCA (called Multi-Dimensional Scaling) arranges the low-dimensional points so as to minimize the discrepancy between the pairwise distances in the original space and the pairwise distances in the low-d space.

    ( )2||||||||∑ −−−=ij

    jijiCost yyxx

    high-D distance

    low-d distance

    slide credit: Geoffrey Hinton

  • Isomap

    Instead of measuring actual Euclidean distances between points (in high dimensional space) measure the distances along the manifold and then model these intrinsic distances.

    • The main problem is to find a robust way of measuring distances along the manifold.

    • If we can measure manifold distances, the global optimisation is easy: it’s just PCA.

    2-D

    1-D

    If we measure distances along the manifold, d(1,6) > d(1,4)

    1

    46

    slide credit: Geoffrey Hinton

    How Isomap measures intrinsic distances

    • Connect each datapoint to its K nearest neighbours in the high-dimensional space.

    • Put the true Euclidean distance on each of these links.

    • Then approximate the manifold distance between any pair of points as the shortest path in this graph.

    A

    B

    slide credit: Geoffrey Hinton

  • Intrinsic distances by shortest paths between neighbours

    ( )2||||||||∑ −−−=ij

    jijiCost yyxx

    high-D intrinsic distance

    low-d distance

    Example 1

    2000 64x64 hand images

  • Example 2

    Unsupervised embedding of the digits 0-4 from MNIST. Not all the data is displayed

    Example 3

    Unsupervised embedding of 100 flower classes based on their shape.

  • Background reading

    • Bishop, chapter 12

    • Other dimensionality reduction methods:• Multi-dimensional scaling (MDS), • Locally Linear Embedding (LLE)

    • More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml