Object Orie’d Data Analysis, Last Time Kernel Embedding –Use linear methods in a non-linear way...

Object Orie’d Data Analysis, Last Time

• Kernel Embedding– Use linear methods in a non-linear way

• Support Vector Machines– Completely Non-Gaussian Classification

• Distance Weighted Discrimination– HDLSS Improvement of SVM

– Used in microarray data combination

– Face Data, Male vs. Female

Support Vector MachinesForgotten last time,

Important Extension:

Multi-Class SVMs

Hsu & Lin (2002)

Lee, Lin, & Wahba (2002)

• Defined for “implicit” version

• “Direction Based” variation???

Distance Weighted Discrim’n 2=d Visualization:

Pushes PlaneAway FromData

All PointsHave SomeInfluence

n

i iw r1,

1min

Distance Weighted Discrim’n Maximal Data Piling

HDLSS Discrim’n Simulations

Main idea:

Comparison of

• SVM (Support Vector Machine)

• DWD (Distance Weighted Discrimination)

• MD (Mean Difference, a.k.a. Centroid)

Linear versions, across dimensions


Overall Approach:

• Study different known phenomena– Spherical Gaussians

– Outliers

– Polynomial Embedding

• Common Sample Sizes

• But wide range of dimensions

25 nn

1600,400,100,40,10d


Spherical Gaussians:


Spherical Gaussians:• Same setup as before• Means shifted in dim 1 only,• All methods pretty good• Harder problem for higher dimension• SVM noticeably worse• MD best (Likelihood method)• DWD very close to MS• Methods converge for higher

dimension??

2.21


Outlier Mixture:


Outlier Mixture:80% dim. 1 , other dims 020% dim. 1 ±100, dim. 2 ±500, others 0• MD is a disaster, driven by outliers• SVM & DWD are both very robust• SVM is best• DWD very close to SVM (insig’t

difference)• Methods converge for higher dimension??

Ignore RLR (a mistake)

2.21


Wobble Mixture:


Wobble Mixture:80% dim. 1 , other dims 020% dim. 1 ±0.1, rand dim ±100, others

0• MD still very bad, driven by outliers• SVM & DWD are both very robust• SVM loses (affected by margin push)• DWD slightly better (by w’ted influence)• Methods converge for higher dimension??


2.21


Nested Spheres:

HDLSS Discrim’n SimulationsNested Spheres:

1st d/2 dim’s, Gaussian with var 1 or C2nd d/2 dim’s, the squares of the 1st dim’s(as for 2nd degree polynomial embedding)

• Each method best somewhere• MD best in highest d (data non-Gaussian)• Methods not comparable (realistic)• Methods converge for higher

dimension??• HDLSS space is a strange place


HDLSS Discrim’n SimulationsConclusions:

• Everything (sensible) is best sometimes• DWD often very near best• MD weak beyond Gaussian

Caution about simulations (and examples):• Very easy to cherry pick best ones• Good practice in Machine Learning

– “Ignore method proposed, but read paper for useful comparison of

others”


Caution: There are additional players

E.g. Regularized Logistic Regression

looks also very competitive

Interesting Phenomenon:

All methods come together

in very high dimensions???

1717

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, I

For dim’al Standard Normal dist’n:

Euclidean Distance to Origin (as ):

- Data lie roughly on surface of sphere of radius

- Yet origin is point of highest density???

- Paradox resolved by:

density w. r. t. Lebesgue Measure

d

d

dd

d

IN

Z

Z

Z ,0~1

)1(pOdZ

d

1818

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, II

For dim’al Standard Normal dist’n: indep. of

Euclidean Dist. between and (as ):

Distance tends to non-random constant:

Can extend to Where do they all go???

(we can only perceive 3 dim’ns)

d

d

dd INZ ,0~2

)1(221 pOdZZ

1Z

1Z 2Z

nZZ ,...,1

1919

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, III

For dim’al Standard Normal dist’n: indep. of

High dim’al Angles (as ):

- Everything is orthogonal??? - Where do they all go???

(again our perceptual limitations) - Again 1st order structure is non-random

d

d

dd INZ ,0~2

)(90, 2/121

dOZZAngle p

1Z

2020

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, I

Assume , let

Study Subspace Generated by Data

Hyperplane through 0, of dimension

Points are “nearly equidistant to 0”, & dist

Within plane, can “rotate towards Unit Simplex”

All Gaussian data sets are“near Unit Simplex

Vertices”!!!

“Randomness” appears only in rotation of simplex

n

d ddn INZZ ,0~,...,1

d

d

Hall, Marron & Neeman (2005)

2121

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, II

Assume , let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant, dist

Points lie at vertices of “regular hedron”

Again “randomness in data” is only in rotation

Surprisingly rigid structure in data?

1n

d ddn INZZ ,0~,...,1

d2

d~

n

2222

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Simulation View: shows “rigidity after rotation”

2323

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Straightforward Generalizations:

non-Gaussian data: only need moments

non-independent: use “mixing conditions”

Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi)

All based on simple “Laws of Large Numbers”

2424

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, IV

Explanation of Observed (Simulation) Behavior:

“everything similar for very high d ”

2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class

i.e. everything is a support vector

i.e. all sensible directions show “data piling”

so “sensible methods are all nearly the same”

Including 1 - NN

2525

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, V

Further Consequences of Geometric Representation

1. Inefficiency of DWD for uneven sample size(motivates weighted version, work in progress)

2. DWD more stable than SVM(based on deeper limiting distributions)(reflects intuitive idea feeling sampling

variation)(something like mean vs. median)

3. 1-NN rule inefficiency is quantified.

2626

UNC, Stat & OR

The Future of Geometrical Representation?

HDLSS version of “optimality” results?

“Contiguity” approach? Params depend on d?

Rates of Convergence?

Improvements of DWD?

(e.g. other functions of distance than inverse)

It is still early days …

2727

UNC, Stat & OR

NCI 60 DataNCI 60 Data

Recall from Sept. 6 & 8

NCI 60 Cell Lines

Interesting benchmark, since same cells

Data Web available:http://discover.nci.nih.gov/

datasetsNature2000.jsp

Both cDNA and Affymetrix Platforms

2828

UNC, Stat & OR

NCI 60: Fully Adjusted Data, NCI 60: Fully Adjusted Data, Melanoma Melanoma ClusterCluster

BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257

2929

UNC, Stat & OR

NCI 60: Fully Adjusted Data, NCI 60: Fully Adjusted Data, Leukemia ClusterLeukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR

3030

UNC, Stat & OR

NCI 60: Views using DWD Dir’ns (focus on NCI 60: Views using DWD Dir’ns (focus on biology)biology)

3131

UNC, Stat & OR

Real Clusters in NCI 60 Data?Real Clusters in NCI 60 Data?

From Sept. 8: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did notDeeper Approach:

Formal Hypothesis Testing

3232

UNC, Stat & OR

HDLSSHDLSS Hypothesis Testing Hypothesis Testing

Approach: DiProPerm TestDirection – Projection – Permutation

Ideas: Find an appropriate Direction vector Project data into that 1-d subspace Construct a 1-d test statistic Analyze significance by Permutation

3333

UNC, Stat & OR

HDLSSHDLSS Hypothesis Testing – DiProPerm test Hypothesis Testing – DiProPerm test

DiProPerm Test

Context:

Given 2 sub-populations, X & Y

Are they from the same distribution?

Or significantly different?

H0: LX = LY vs. H1: LX ≠ LY

3434

UNC, Stat & OR


Reasonable Direction vectors:

Mean Difference

SVM

Maximal Data Piling

DWD (used in the following)

Any good discrimination direction…

3535

UNC, Stat & OR


Reasonable Projected 1-d statistics:

Two sample t-test (used here)

Chi-square test for different

variances

Kolmogorov - Smirnov

Any good distributional test…

3636

UNC, Stat & OR


DiProPerm Test Steps:1. For original data:

Find Direction vector Project Data, Compute True Test Statistic

2. For (many) random relabellings of data: Find Direction vector Project Data, Compute Perm’d Test Stat

3. Compare: True Stat among population of Perm’d Stat’s Quantile gives p-value

3737

UNC, Stat & OR


Remarks: Generally can’t use standard null

dist’ns… e.g. Students t-table, for t-statistic Because Direction and Projection

give nonstandard context I.e. violate traditional assumptions E.g. DWD finds separating directions Giving completely invalid test This motivates Permutation approach

3838

UNC, Stat & OR

Improved Statistical Power - NCI 60 Improved Statistical Power - NCI 60 Melanoma Melanoma

3939

UNC, Stat & OR

Improved Statistical Power - NCI 60 Leukemia Improved Statistical Power - NCI 60 Leukemia

4040

UNC, Stat & OR

Improved Statistical Power - NCI 60 NSCLCImproved Statistical Power - NCI 60 NSCLC

4141

UNC, Stat & OR

Improved Statistical Power - NCI 60 RenalImproved Statistical Power - NCI 60 Renal

4242

UNC, Stat & OR

Improved Statistical Power - NCI 60 CNSImproved Statistical Power - NCI 60 CNS

4343

UNC, Stat & OR

Improved Statistical Power - NCI 60 OvarianImproved Statistical Power - NCI 60 Ovarian

4444

UNC, Stat & OR

Improved Statistical Power - NCI 60 ColonImproved Statistical Power - NCI 60 Colon

4545

UNC, Stat & OR

Improved Statistical Power - NCI 60 BreastImproved Statistical Power - NCI 60 Breast

4646

UNC, Stat & OR

Improved Statistical Power - SummaryImproved Statistical Power - Summary

Type cDNA -t Affy -t Comb -t

Affy-P Comb-P

Melanoma

36.8 39.9 51.8 e-7 0

Leukemia

18.3 23.8 27.5 0.12 0.00001

NSCLC 17.3 25.1 23.5 0.18 0.02

Renal 15.6 20.1 22.0 0.54 0.04

CNS 13.4 18.6 18.9 0.62 0.21

Ovarian 11.2 20.8 17.0 0.21 0.27

Colon 10.3 17.4 16.3 0.74 0.58

Breast 13.8 19.6 19.3 0.51 0.16

4747

UNC, Stat & OR


Many Open Questions on DiProPerm Test:

Which Direction is “Best”?

Which 1-d Projected test statistic?

Permutation vs. altern’es

(bootstrap?)???

How do these interact?

What are asymptotic properties?

Independent Component Analysis

Idea: Find dir’ns that maximize indepen’ce

Motivating Context: Signal ProcessingBlind Source Separation

References:• Cardoso (1989)• Cardoso & Souloumiac (1993)• Lee (1998)• Hyvärinen and Oja (1999)• Hyvärinen, Karhunen and Oja (2001)


ICA, motivating example:Cocktail party problem

Hear several simultaneous conversations

would like to “separate them”

Model for “conversations”:time series:

and ts1 ts2


Cocktail Party Problem


ICA, motivating example:Cocktail party problem

What the ears hear:Ear 1: Mixed version of signals:

Ear 2: A second mixture:

tsatsatx 2121111

tsatsatx 2221212


What the ears hear: Mixed versions


Goal: Recover “signal”

from “data”

for unknown “mixture matrix” ,

where , for all

Goal is to find “separating weights”, ,

so that , for all

Problem: would be fine,

but is unknown

)(

)()(

2

1

ts

tsts

)(

)()(

2

1

tx

txtx

2221

1211

aa

aaA

sAx t

W

xWs t1AW

A


Solution 1: PCA


Solution 2: ICA


“Solutions” for Cocktail Party example:Approach 1: PCA

(on “population of 2-d vectors”)Directions of Greatest Variability do not

solve this problemApproach 2: ICA

(will describe method later)Independent Component directions do

solve the problem(modulo “sign changes” and

“identification”)


Relation to FDA: recall “data matrix”

Signal Processing: focus on rows ( time series, for )

Functional Data Analysis: focus on columns ( data vectors)

Note: same 2 different viewpoints as dual problems in PCA

dnd

n

n

XX

XX

XXX

1

111

1

d nt ,...,1

n


FDA Style Scatterplot View - Signals

nttsts ,...,1:)(),( 21


FDA Style Scatterplot View - Data

nttxtx ,...,1:)(),( 21


FDA Style Scatterplot View:

• Scatterplots give hint how blind recovery is possible

• Affine Transformation

stretches indep’t signals into dependent

• Inversion is key to ICA

(even when is unknown)

sAx

A


Why not PCA?• Finds direction of greatest variability• Wrong direction for signal separation


ICA Step 1:• “sphere the data” (i.e. find linear transfo to make mean = , cov = )• i.e. work with • requires of full rank

(at least , i.e. no HDLSS)• search for independence beyond

linear and quadratic structure

0I

ˆ 2/1 XZ

Xdn

Independent Component AnalysisICA Step 2:• Find directions that make (sphered)

data as independent as possible• Worst case: Gaussian

Sphered data are independentInteresting “converse application” of

C.L.T.:• For and independent

(& non-Gaussian)• is “more Gaussian” for • so maximal independence comes

from least Gaussian directions

1S 2S

211 1 SuuSX 2

1u


ICA Step 2:• Find dir’ns that make (sphered) data

as independent as possibleRecall “independence” means:

Joint distribution is product of Marginals

In cocktail party example:• Happens only when rotated so

support parallel to axes• Otherwise have blank areas, • while marginals are non-zero


Parallel Idea (and key to algorithm):

Find directions that max non-Gaussianity

Reason:

• starting from independent coordinates

most projections are Gaussian

(since projection is “linear combo”)

Mathematics behind this:

Diaconis and Freedman (1984)


Worst case for ICA:

• Gaussian marginals

• Then sphered data are independent

• So have independence in all directions

• Thus can’t find useful directions

Gaussian distribution is characterized by:

Independent & spherically symmetric


Criteria for non-Gaussianity / independence:

• kurtosis ( , 4th order cumulant)

• negative entropy

• mutual information

• nonparametric maximum likelihood

• “infomax” in neural networks

• interesting connections between these

224 3 EXEX


Matlab Algorithm (optimizing any of above):

FastICA

• http://www.cis.hut.fi/projects/ica/fastica/

• Numerical gradient search method

• Can find directions iteratively

• Or by simultaneous optimization

• Appears fast, with good defaults

• Should we worry about local optima???

http://www.cis.hut.fi/projects/ica/fastica/

http://www.cis.hut.fi/projects/ica/fastica/


Notational summary:

1. First sphere data:

2. Apply ICA: find rotation to

make rows of

independent

3. Can transform back to original data

scale:

ˆ 2/1 XZ

SW

ZWS SS

SSS 2/1

Independent Component AnalysisIdentifiability problem 1:

Generally can’t order rows of (& )

Since for a permutation matrix (pre-multiplication by swaps rows)

(post-multiplication by swaps columns)for each col’n, i.e.

So and are also solutions (i.e. )

SS S

PPP

SSSS sPPAsAz 1 zPWsP SS

SPSSPW ZPWPS SS

Independent Component AnalysisIdentifiability problem 1: Row Order

Saw this in Cocktail Party Example

FastICA: orders by non-Gaussian-ness?


Identifiability problem 2: Can’t find scale of elements of

Since for a (full rank) diagonal matrix (pre-mult’n by is scalar mult’n of rows)(post-mult’n by is scalar mult’n of col’s)for each col’n, i.e.

So and are also solutions

s

DDD

SSSS sDDAsAz 1 zDWsD SS

SDS SDW

Independent Component AnalysisIdentifiability problem 2: Signal Scale

Not so clear in Cocktail Party Example


Signal Processing Scale identification: (Hyvärinen and Oja)

Choose scale so each signal has unit average energy:

• Preserves energy along rows of data matrix

• Explains same scales in Cocktail Party Example

)(tsi

t

i ts2)(


Would like to do:• More toy examples• Illustrating how non-Gaussianity

works

Like to see some?

Check out old course notes:http://www.stat.unc.edu/postscript/papers/marron/Teaching/CornellFDA/Lecture03-11-02/FDA03-11-02.pdf

http://www.stat.unc.edu/postscript/papers/marron/Teaching/CornellFDA/Lecture03-25-02/FDA03-25-02.pdf





Independent Component AnalysisOne more “Would like to do”:ICA testing of multivariate Gaussianity

Usual approaches: 1-d tests on marginals

New Idea: use ICA to find “least Gaussian

Directions”, and base test on those.

Koch, Marron and Chen (2004)

Unfortunately Not Covered

• DWD & Micro-array Outcomes Data• Windup from FDA04-22-02.doc

– General Conclusion– Validation

Object Orie’d Data Analysis, Last Time Kernel Embedding –Use linear methods in a non-linear way...

Documents

Transcript of Object Orie’d Data Analysis, Last Time Kernel Embedding –Use linear methods in a non-linear way...