Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

• Studied from Dual Viewpoint

NCI 60 Data

• Visualization – found (DWD) directions that showed clusters of cancer types

• Investigated with DiProPerm test

HDLSS hypothesis testing

HDLSS Asymptotics

Interesting Idea from Travis Gaydos:

Interpret from viewpoint of dual space

Recall from Aug. 25: for

• Distance to origin:

• Pairwise Distance:

• Angle from origin:

INZZ d ,0~,21

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor891OODA-2007/10-25-07.ppt

HDLSS Asymptotics – Dual View

Would be interesting to try:

• Study (i.e. explore conditions for):

– Consistency

– Strong Inconsistency

for PCA direction vectors, from this viewpoint

Perhaps other things as well…

NCI 60 DataRecall from: • Aug. 28• Aug. 30

NCI 60 Cancer Cell Lines Microarray Data

• Explored Data Combination

• cDNA & Affymetrix Measurements

• Right answer is known



Real Clusters in NCI 60 Data

Simple Visual Approach:

• Randomly relabel data (Cancer Types)

• Recompute DWD dir’ns & visualization

• Get heuristic impression from this

Deeper Approach

• Formal Hypothesis Testing

(Done later)

Real Clusters in NCI 60 Data?

From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not

Deeper Approach:

Formal Hypothesis Testing


HDLSS Hypothesis Testing

Approach: DiProPerm Test

DIrection – PROjection – PERMutation

Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation

DiProPerm Simple Example 1, Totally Separate

Results:

Random relabelling gives much smaller Ts

Quantiles (over 1000 sim’s) give p-val of 0

I.e. Strongly conclusive

Conclude sub-populations are different

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now

Needed final verification of Cross-platform

Normal’nSummary of Results:• P-values

– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases

(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile

(because more data gives more power)

Comparison with previous heuristics…

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)

Revisit Real Data (Cont.)

Sungkyu Jung Question:

How are those results driven by sample size?

Add sample size to above table….

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma - 18 C N S – 12 NSCLC - 18

Leukemia - 12 Ovarian - 8 Breast - 12

Renal - 14 Colon - 12

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)

Revisit Real Data (Cont.)Sungkyu Jung Question:

How are those results driven by sample size?

Add sample size to above table….

Good idea: Surprising result perhaps indeed due to larger sample size

DiProPerm Test

Particulate Matter Data

Consulting Class Project, for:

Lindsay Whicher, Penn Watkinson, EPA

Analysis by: Chihoon Lee

DiProPerm – Particulate Matter

Data:

• Measure Heart Rate of Rats

• Over time (several days)

• Treat with Particulate Matter

• Study effect

• See differences between treatments?

• Statistically significant?


Notes on curve view of data:

• Clear day – night effect

• Apparent changes after treatment

• Stronger effect for higher dose

• Effect diminishes over time

• Statistically significant differences?

• How does “signal” compare to “noise”?


Alternate view of data:

• Each curve is a “data point”

• Study distribution of these “points”

• Show replicates as points

• To indicate “signal” vs. “noise” issues


Notes on PCA & DWD dir’n scatterplots:

• Dose effect looks strong (PC2 direction)

• Systematic Pattern of colors

• Ordered by doses

• Suggests important differences

• Statistically significant differences?

• How does “signal” compare to “noise”?

Address by DiProPerm tests


Look for differences over 48 hours:

• Run DiProPerm

• Test Control vs. High Dose

• Study difference over long time scale


DiProPerm Results:

• P-value = 0.056

• Not quite significant

• “Noise” just overtakes “signal”

• Perhaps Interval of 48 hours is too long

• So try smaller interval

Day 0, 9 AM – 3 PM


DiProPerm Results: Day 0, 9 AM – 3 PM

Group Comparison p-value

Control vs. High 0.002

Control vs. Mid 0.004

Control vs. Low 0.059

Low vs. High 0.009

Low vs. Mid 0.083

Mid vs. High 0.060


DiProPerm Results: Day 0, 9 AM – 3 PM

• Results consistent with data curves:– C vs. H strongly different– C vs. M & L vs. H significantly different– Others not quite significant

• For more, related, results, see

Wichers, Lee, et al (2007)

HDLSSHDLSS Hypothesis Testing – DiProPerm test

Chuck Perou’s 500 Breast Cancer data

Based on data Merging (using DWD):

• UNC– Geo, # = 102– Unp, # = 93

• NKI # = 512– Pub, # = 220– 97, # = 97

Perou 500 DataSimple PCA view combined data

Hard to see

structure

Perou 500 DataPCA view – colored by cancer type

Shows up

in PC1

(vs. others)

(good data

Combo by

DWD)

Perou 500 DataPCA view – add symbols for source

No obvious

source

effect

(good data

Combo by

DWD)

Perou 500 DataRotate axes for better type separation

Carefully

chosen

DWD views

Separates

quite well

Perou 500 Data

How distinct are classes?

Compare “signal” vs. “noise”

Measure statistical significance

Using DiProPerm test

Perou 500 DataDiProPerm test: Normal vs. Rest

Pval

= 0.22

Not

strong

evidence

Perou 500 DataDiProPerm test: Normal vs. Rest

Pval = 0.22, Not strong evidence

OK, since “normal” means:

biopsy missed tumor

But mostly from cancer patients

Instead compare with “true normals”

Perou 500 DataDiProPerm test: True Normal vs. Rest

Pval = 2.30E-06 , Very strong evidence

Makes sense.

Perou 500 Data

DiProPerm test: Cancer classes

• Luminals vs {Her2 & Basals}, pval = 0

• Her2 vs Basals, pval = 0

• Lum A vs Lum b, pval = 0.0068

All strongly conclusive

Adds statistical significance to early results

Perou 500 Data

Interesting questions:

• Was the DWD combination essential?

• Were individual groups sign’t anyway?

• What was value of DWD combo?

Perou 500 DataDiProPerm Luminals vs {Her2 & Basals}:

• All Combined, p-val = 0

• UNC Combo, p-val = 0

• NKI Combo, p-val = 0

• UNC GEO, p-val = 0

• UNC UnP, p-val = 3e-14

• NKI Pub, p-val = 1e-11

• NKI 97, p-val = 0.00078

Perou 500 DataDiProPerm Luminal A vs Luminal B:

• All Combined, p-val = 0.0068

• UNC Combo, p-val = 0.214

• NKI Combo, p-val = 0.014

• UNC GEO, p-val = 0.298

• UNC UnP, p-val = 0.396

• NKI Pub, p-val = 0.052

• NKI 97, p-val = 0.246

(shows clear value to combining)

Perou 500 DataDiProPerm Her2 vs Basals:

• All Combined, p-val = 0

• UNC Combo, p-val = 0

• NKI Combo, p-val = 0

• UNC GEO, p-val = 0

• UNC UnP, p-val = 0.02

• NKI Pub, p-val = 0.00008

• NKI 97, p-val = 0.246

Perou 500 DataDraw back to DiProPerm here:

• Classes found by clustering

• Different from e.g. NCI 60 classes

• So maybe not surprising they are

Different from random

• I.e. find significant differences

• Does this really mean:

Cluster is really there???

Needs deeper thought…

HDLSSHDLSS Hypothesis Testing – DiProPerm test

Many Open Questions on DiProPerm Test:Which Direction is “Best”?Which 1-d Projected test statistic?Permutation vs. alternatives

(bootstrap?)???How do these interact?What are asymptotic properties?

ClusteringIdea: Given data

• Assign each object to a class

• Of similar objects

• Completely data driven

• I.e. assign labels to data

• “Unsupervised Learning”

Contrast to Classification (Discrimination)

• With predetermined classes

• “Supervised Learning”

nXX ,...,

1

ClusteringImportant References:

• McQueen (1967)

• Hartigan (1975)

• Kaufman and Rousseeuw (2005),

K-means ClusteringMain Idea: for data

Partition indices

among classes

Given index sets

• that partition

• represent clusters by “class mean”

where (within class means)

nXX ,...,

1

ni ,...,1

KCC ,...,1

KCC ,...,1

jCi

ij

j XC

X#

1

n,,1

K-means ClusteringGiven index sets

Measure how well clustered, using

Within Class Sum of Squares

Weak point: not very interpretable

KCC ,...,1

2

1

jCi

ji

K

j

XX

K-means ClusteringCommon Variation:

Put on scale of proportions (i.e. in [0,1])

By dividing “within class SS”

by “overall SS”

Gives Cluster Index:

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

K-means ClusteringNotes on Cluster Index:

• CI = 0 when all data at cluster means

• CI small when gives tight clustering

(within SS contains little variation)

• CI big when gives poor clustering

(within SS contains most of variation)

• CI = 1 when all cluster means are same

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

KCC ,...,1

KCC ,...,1

K-means Clustering

Clustering Goal:

• Given data

• Choose classes

• To miminize

KCC ,...,1

nXX ,...,

1

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

2-means Clustering

Study CI, using simple 1-d examples

• Varying Standard Deviation

2-means Clustering

2-means Clustering

Study CI, using simple 1-d examples

• Varying Standard Deviation

• Varying Mean

2-means Clustering

Object Orie’d Data Analysis, Last Time

Documents

Transcript of Object Orie’d Data Analysis, Last Time