Object Orie’d Data Analysis, Last Time

75
Object Orie’d Data Analysis, Last Time HDLSS Asymptotics • Studied from Dual Viewpoint NCI 60 Data • Visualization – found (DWD) directions that showed clusters of cancer types • Investigated with DiProPerm test HDLSS hypothesis testing

description

Object Orie’d Data Analysis, Last Time. HDLSS Asymptotics Studied from Dual Viewpoint NCI 60 Data Visualization – found (DWD) directions that showed clusters of cancer types Investigated with DiProPerm test HDLSS hypothesis testing. HDLSS Asymptotics. Interesting Idea from Travis Gaydos: - PowerPoint PPT Presentation

Transcript of Object Orie’d Data Analysis, Last Time

Page 1: Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

• Studied from Dual Viewpoint

NCI 60 Data

• Visualization – found (DWD) directions that showed clusters of cancer types

• Investigated with DiProPerm test

HDLSS hypothesis testing

Page 2: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics

Interesting Idea from Travis Gaydos:

Interpret from viewpoint of dual space

Recall from Aug. 25: for

• Distance to origin:

• Pairwise Distance:

• Angle from origin:

INZZ d ,0~,21

)1(2/1

2/1

1

2, p

d

ijij

OdZZ

)1(2 2/12/1

1

22,1,21 p

d

iji OdZZZZ

)(90cos, 2/1

1 21

2,1,121

dO

ZZ

ZZZZAngle p

d

i

ji

Page 3: Object Orie’d Data Analysis, Last Time

HDLSS Asymptotics – Dual View

Would be interesting to try:

• Study (i.e. explore conditions for):

– Consistency

– Strong Inconsistency

for PCA direction vectors, from this viewpoint

Perhaps other things as well…

Page 4: Object Orie’d Data Analysis, Last Time

NCI 60 DataRecall from: • Aug. 28• Aug. 30

NCI 60 Cancer Cell Lines Microarray Data

• Explored Data Combination

• cDNA & Affymetrix Measurements

• Right answer is known

Page 5: Object Orie’d Data Analysis, Last Time

Real Clusters in NCI 60 Data

Simple Visual Approach:

• Randomly relabel data (Cancer Types)

• Recompute DWD dir’ns & visualization

• Get heuristic impression from this

Deeper Approach

• Formal Hypothesis Testing

(Done later)

Page 6: Object Orie’d Data Analysis, Last Time

Real Clusters in NCI 60 Data?

From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not

Deeper Approach:

Formal Hypothesis Testing

Page 7: Object Orie’d Data Analysis, Last Time

HDLSS Hypothesis Testing

Approach: DiProPerm Test

DIrection – PROjection – PERMutation

Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation

Page 8: Object Orie’d Data Analysis, Last Time

DiProPerm Simple Example 1, Totally Separate

Results:

Random relabelling gives much smaller Ts

Quantiles (over 1000 sim’s) give p-val of 0

I.e. Strongly conclusive

Conclude sub-populations are different

Page 9: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Is there benefit to data combo by DWD?

• More data more power?

• Will study later now

Page 10: Object Orie’d Data Analysis, Last Time

Needed final verification of Cross-platform

Normal’nSummary of Results:• P-values

– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases

(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile

(because more data gives more power)

Comparison with previous heuristics…

Page 11: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)

Page 12: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)

Sungkyu Jung Question:

How are those results driven by sample size?

Add sample size to above table….

Page 13: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma - 18 C N S – 12 NSCLC - 18

Leukemia - 12 Ovarian - 8 Breast - 12

Renal - 14 Colon - 12

Statistically Sign’t (as expected)

Not Sign’t (as expected)

Surprising result (not consistent with vis’n)

Page 14: Object Orie’d Data Analysis, Last Time

Revisit Real Data (Cont.)Sungkyu Jung Question:

How are those results driven by sample size?

Add sample size to above table….

Good idea: Surprising result perhaps indeed due to larger sample size

Page 15: Object Orie’d Data Analysis, Last Time

DiProPerm Test

Particulate Matter Data

Consulting Class Project, for:

Lindsay Whicher, Penn Watkinson, EPA

Analysis by: Chihoon Lee

Page 16: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Data:

• Measure Heart Rate of Rats

• Over time (several days)

• Treat with Particulate Matter

• Study effect

• See differences between treatments?

• Statistically significant?

Page 17: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Page 18: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Notes on curve view of data:

• Clear day – night effect

• Apparent changes after treatment

• Stronger effect for higher dose

• Effect diminishes over time

• Statistically significant differences?

• How does “signal” compare to “noise”?

Page 19: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Alternate view of data:

• Each curve is a “data point”

• Study distribution of these “points”

• Show replicates as points

• To indicate “signal” vs. “noise” issues

Page 20: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Page 21: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Notes on PCA & DWD dir’n scatterplots:

• Dose effect looks strong (PC2 direction)

• Systematic Pattern of colors

• Ordered by doses

• Suggests important differences

• Statistically significant differences?

• How does “signal” compare to “noise”?

Address by DiProPerm tests

Page 22: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Look for differences over 48 hours:

• Run DiProPerm

• Test Control vs. High Dose

• Study difference over long time scale

Page 23: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Page 24: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

DiProPerm Results:

• P-value = 0.056

• Not quite significant

• “Noise” just overtakes “signal”

• Perhaps Interval of 48 hours is too long

• So try smaller interval

Day 0, 9 AM – 3 PM

Page 25: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

DiProPerm Results: Day 0, 9 AM – 3 PM

Group Comparison p-value

Control vs. High 0.002

Control vs. Mid 0.004

Control vs. Low 0.059

Low vs. High 0.009

Low vs. Mid 0.083

Mid vs. High 0.060

Page 26: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

DiProPerm Results: Day 0, 9 AM – 3 PM

• Results consistent with data curves:– C vs. H strongly different– C vs. M & L vs. H significantly different– Others not quite significant

• For more, related, results, see

Wichers, Lee, et al (2007)

Page 27: Object Orie’d Data Analysis, Last Time

DiProPerm – Particulate Matter

Page 28: Object Orie’d Data Analysis, Last Time

HDLSSHDLSS Hypothesis Testing – DiProPerm test

Chuck Perou’s 500 Breast Cancer data

Based on data Merging (using DWD):

• UNC– Geo, # = 102– Unp, # = 93

• NKI # = 512– Pub, # = 220– 97, # = 97

Page 29: Object Orie’d Data Analysis, Last Time

Perou 500 DataSimple PCA view combined data

Hard to see

structure

Page 30: Object Orie’d Data Analysis, Last Time

Perou 500 DataPCA view – colored by cancer type

Shows up

in PC1

(vs. others)

(good data

Combo by

DWD)

Page 31: Object Orie’d Data Analysis, Last Time

Perou 500 DataPCA view – add symbols for source

No obvious

source

effect

(good data

Combo by

DWD)

Page 32: Object Orie’d Data Analysis, Last Time

Perou 500 DataRotate axes for better type separation

Carefully

chosen

DWD views

Separates

quite well

Page 33: Object Orie’d Data Analysis, Last Time

Perou 500 Data

How distinct are classes?

Compare “signal” vs. “noise”

Measure statistical significance

Using DiProPerm test

Page 34: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm test: Normal vs. Rest

Pval

= 0.22

Not

strong

evidence

Page 35: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm test: Normal vs. Rest

Pval = 0.22, Not strong evidence

OK, since “normal” means:

biopsy missed tumor

But mostly from cancer patients

Instead compare with “true normals”

Page 36: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm test: True Normal vs. Rest

Pval = 2.30E-06 , Very strong evidence

Makes sense.

Page 37: Object Orie’d Data Analysis, Last Time

Perou 500 Data

DiProPerm test: Cancer classes

• Luminals vs {Her2 & Basals}, pval = 0

• Her2 vs Basals, pval = 0

• Lum A vs Lum b, pval = 0.0068

All strongly conclusive

Adds statistical significance to early results

Page 38: Object Orie’d Data Analysis, Last Time

Perou 500 Data

Interesting questions:

• Was the DWD combination essential?

• Were individual groups sign’t anyway?

• What was value of DWD combo?

Page 39: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm Luminals vs {Her2 & Basals}:

• All Combined, p-val = 0

• UNC Combo, p-val = 0

• NKI Combo, p-val = 0

• UNC GEO, p-val = 0

• UNC UnP, p-val = 3e-14

• NKI Pub, p-val = 1e-11

• NKI 97, p-val = 0.00078

Page 40: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm Luminal A vs Luminal B:

• All Combined, p-val = 0.0068

• UNC Combo, p-val = 0.214

• NKI Combo, p-val = 0.014

• UNC GEO, p-val = 0.298

• UNC UnP, p-val = 0.396

• NKI Pub, p-val = 0.052

• NKI 97, p-val = 0.246

(shows clear value to combining)

Page 41: Object Orie’d Data Analysis, Last Time

Perou 500 DataDiProPerm Her2 vs Basals:

• All Combined, p-val = 0

• UNC Combo, p-val = 0

• NKI Combo, p-val = 0

• UNC GEO, p-val = 0

• UNC UnP, p-val = 0.02

• NKI Pub, p-val = 0.00008

• NKI 97, p-val = 0.246

Page 42: Object Orie’d Data Analysis, Last Time

Perou 500 DataDraw back to DiProPerm here:

• Classes found by clustering

• Different from e.g. NCI 60 classes

• So maybe not surprising they are

Different from random

• I.e. find significant differences

• Does this really mean:

Cluster is really there???

Needs deeper thought…

Page 43: Object Orie’d Data Analysis, Last Time

HDLSSHDLSS Hypothesis Testing – DiProPerm test

Many Open Questions on DiProPerm Test:Which Direction is “Best”?Which 1-d Projected test statistic?Permutation vs. alternatives

(bootstrap?)???How do these interact?What are asymptotic properties?

Page 44: Object Orie’d Data Analysis, Last Time

ClusteringIdea: Given data

• Assign each object to a class

• Of similar objects

• Completely data driven

• I.e. assign labels to data

• “Unsupervised Learning”

Contrast to Classification (Discrimination)

• With predetermined classes

• “Supervised Learning”

nXX ,...,

1

Page 45: Object Orie’d Data Analysis, Last Time

ClusteringImportant References:

• McQueen (1967)

• Hartigan (1975)

• Kaufman and Rousseeuw (2005),

Page 46: Object Orie’d Data Analysis, Last Time

K-means ClusteringMain Idea: for data

Partition indices

among classes

Given index sets

• that partition

• represent clusters by “class mean”

where (within class means)

nXX ,...,

1

ni ,...,1

KCC ,...,1

KCC ,...,1

jCi

ij

j XC

X#

1

n,,1

Page 47: Object Orie’d Data Analysis, Last Time

K-means ClusteringGiven index sets

Measure how well clustered, using

Within Class Sum of Squares

Weak point: not very interpretable

KCC ,...,1

2

1

jCi

ji

K

j

XX

Page 48: Object Orie’d Data Analysis, Last Time

K-means ClusteringCommon Variation:

Put on scale of proportions (i.e. in [0,1])

By dividing “within class SS”

by “overall SS”

Gives Cluster Index:

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

Page 49: Object Orie’d Data Analysis, Last Time

K-means ClusteringNotes on Cluster Index:

• CI = 0 when all data at cluster means

• CI small when gives tight clustering

(within SS contains little variation)

• CI big when gives poor clustering

(within SS contains most of variation)

• CI = 1 when all cluster means are same

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

KCC ,...,1

KCC ,...,1

Page 50: Object Orie’d Data Analysis, Last Time

K-means Clustering

Clustering Goal:

• Given data

• Choose classes

• To miminize

KCC ,...,1

nXX ,...,

1

n

ii

Ciji

K

j

K

XX

XX

CCCI j

1

2

2

1

1 ,,

Page 51: Object Orie’d Data Analysis, Last Time

2-means Clustering

Study CI, using simple 1-d examples

• Varying Standard Deviation

Page 52: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 53: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 54: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 55: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 56: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 57: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 58: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 59: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 60: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 61: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 62: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 63: Object Orie’d Data Analysis, Last Time

2-means Clustering

Study CI, using simple 1-d examples

• Varying Standard Deviation

• Varying Mean

Page 64: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 65: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 66: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 67: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 68: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 69: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 70: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 71: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 72: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 73: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 74: Object Orie’d Data Analysis, Last Time

2-means Clustering

Page 75: Object Orie’d Data Analysis, Last Time

2-means Clustering