Object Orie’d Data Analysis, Last Time
description
Transcript of Object Orie’d Data Analysis, Last Time
Object Orie’d Data Analysis, Last Time
HDLSS Asymptotics
• Studied from Dual Viewpoint
NCI 60 Data
• Visualization – found (DWD) directions that showed clusters of cancer types
• Investigated with DiProPerm test
HDLSS hypothesis testing
HDLSS Asymptotics
Interesting Idea from Travis Gaydos:
Interpret from viewpoint of dual space
Recall from Aug. 25: for
• Distance to origin:
• Pairwise Distance:
• Angle from origin:
INZZ d ,0~,21
)1(2/1
2/1
1
2, p
d
ijij
OdZZ
)1(2 2/12/1
1
22,1,21 p
d
iji OdZZZZ
)(90cos, 2/1
1 21
2,1,121
dO
ZZ
ZZZZAngle p
d
i
ji
HDLSS Asymptotics – Dual View
Would be interesting to try:
• Study (i.e. explore conditions for):
– Consistency
– Strong Inconsistency
for PCA direction vectors, from this viewpoint
Perhaps other things as well…
NCI 60 DataRecall from: • Aug. 28• Aug. 30
NCI 60 Cancer Cell Lines Microarray Data
• Explored Data Combination
• cDNA & Affymetrix Measurements
• Right answer is known
Real Clusters in NCI 60 Data
Simple Visual Approach:
• Randomly relabel data (Cancer Types)
• Recompute DWD dir’ns & visualization
• Get heuristic impression from this
Deeper Approach
• Formal Hypothesis Testing
(Done later)
Real Clusters in NCI 60 Data?
From Aug. 30: Simple Visual Approach: Randomly relabel data (Cancer Types) Recompute DWD dir’ns & visualization Get heuristic impression from this Some types appeared signif’ly different Others did not
Deeper Approach:
Formal Hypothesis Testing
HDLSS Hypothesis Testing
Approach: DiProPerm Test
DIrection – PROjection – PERMutation
Ideas:Find an appropriate Direction vectorProject data into that 1-d subspaceConstruct a 1-d test statisticAnalyze significance by Permutation
DiProPerm Simple Example 1, Totally Separate
Results:
Random relabelling gives much smaller Ts
Quantiles (over 1000 sim’s) give p-val of 0
I.e. Strongly conclusive
Conclude sub-populations are different
Needed final verification of Cross-platform
Normal’n
• Is statistical power actually improved?
• Is there benefit to data combo by DWD?
• More data more power?
• Will study later now
Needed final verification of Cross-platform
Normal’nSummary of Results:• P-values
– Combined Results Better for 7 out of 8 cases– Combined signficant, Affy not, in 3 cases
(Lukemia, NSCLC, Renal)– Shows combining platforms often worthwhile
(because more data gives more power)
Comparison with previous heuristics…
Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):
Strong Clust’s Weak Clust’s Not Clust’s
Melanoma C N S NSCLC
Leukemia Ovarian Breast
Renal Colon
Statistically Sign’t (as expected)
Not Sign’t (as expected)
Surprising result (not consistent with vis’n)
Revisit Real Data (Cont.)
Sungkyu Jung Question:
How are those results driven by sample size?
Add sample size to above table….
Revisit Real Data (Cont.)Previous Heuristic Results (from rand re-ord):
Strong Clust’s Weak Clust’s Not Clust’s
Melanoma - 18 C N S – 12 NSCLC - 18
Leukemia - 12 Ovarian - 8 Breast - 12
Renal - 14 Colon - 12
Statistically Sign’t (as expected)
Not Sign’t (as expected)
Surprising result (not consistent with vis’n)
Revisit Real Data (Cont.)Sungkyu Jung Question:
How are those results driven by sample size?
Add sample size to above table….
Good idea: Surprising result perhaps indeed due to larger sample size
DiProPerm Test
Particulate Matter Data
Consulting Class Project, for:
Lindsay Whicher, Penn Watkinson, EPA
Analysis by: Chihoon Lee
DiProPerm – Particulate Matter
Data:
• Measure Heart Rate of Rats
• Over time (several days)
• Treat with Particulate Matter
• Study effect
• See differences between treatments?
• Statistically significant?
DiProPerm – Particulate Matter
DiProPerm – Particulate Matter
Notes on curve view of data:
• Clear day – night effect
• Apparent changes after treatment
• Stronger effect for higher dose
• Effect diminishes over time
• Statistically significant differences?
• How does “signal” compare to “noise”?
DiProPerm – Particulate Matter
Alternate view of data:
• Each curve is a “data point”
• Study distribution of these “points”
• Show replicates as points
• To indicate “signal” vs. “noise” issues
DiProPerm – Particulate Matter
DiProPerm – Particulate Matter
Notes on PCA & DWD dir’n scatterplots:
• Dose effect looks strong (PC2 direction)
• Systematic Pattern of colors
• Ordered by doses
• Suggests important differences
• Statistically significant differences?
• How does “signal” compare to “noise”?
Address by DiProPerm tests
DiProPerm – Particulate Matter
Look for differences over 48 hours:
• Run DiProPerm
• Test Control vs. High Dose
• Study difference over long time scale
DiProPerm – Particulate Matter
DiProPerm – Particulate Matter
DiProPerm Results:
• P-value = 0.056
• Not quite significant
• “Noise” just overtakes “signal”
• Perhaps Interval of 48 hours is too long
• So try smaller interval
Day 0, 9 AM – 3 PM
DiProPerm – Particulate Matter
DiProPerm Results: Day 0, 9 AM – 3 PM
Group Comparison p-value
Control vs. High 0.002
Control vs. Mid 0.004
Control vs. Low 0.059
Low vs. High 0.009
Low vs. Mid 0.083
Mid vs. High 0.060
DiProPerm – Particulate Matter
DiProPerm Results: Day 0, 9 AM – 3 PM
• Results consistent with data curves:– C vs. H strongly different– C vs. M & L vs. H significantly different– Others not quite significant
• For more, related, results, see
Wichers, Lee, et al (2007)
DiProPerm – Particulate Matter
HDLSSHDLSS Hypothesis Testing – DiProPerm test
Chuck Perou’s 500 Breast Cancer data
Based on data Merging (using DWD):
• UNC– Geo, # = 102– Unp, # = 93
• NKI # = 512– Pub, # = 220– 97, # = 97
Perou 500 DataSimple PCA view combined data
Hard to see
structure
Perou 500 DataPCA view – colored by cancer type
Shows up
in PC1
(vs. others)
(good data
Combo by
DWD)
Perou 500 DataPCA view – add symbols for source
No obvious
source
effect
(good data
Combo by
DWD)
Perou 500 DataRotate axes for better type separation
Carefully
chosen
DWD views
Separates
quite well
Perou 500 Data
How distinct are classes?
Compare “signal” vs. “noise”
Measure statistical significance
Using DiProPerm test
Perou 500 DataDiProPerm test: Normal vs. Rest
Pval
= 0.22
Not
strong
evidence
Perou 500 DataDiProPerm test: Normal vs. Rest
Pval = 0.22, Not strong evidence
OK, since “normal” means:
biopsy missed tumor
But mostly from cancer patients
Instead compare with “true normals”
Perou 500 DataDiProPerm test: True Normal vs. Rest
Pval = 2.30E-06 , Very strong evidence
Makes sense.
Perou 500 Data
DiProPerm test: Cancer classes
• Luminals vs {Her2 & Basals}, pval = 0
• Her2 vs Basals, pval = 0
• Lum A vs Lum b, pval = 0.0068
All strongly conclusive
Adds statistical significance to early results
Perou 500 Data
Interesting questions:
• Was the DWD combination essential?
• Were individual groups sign’t anyway?
• What was value of DWD combo?
Perou 500 DataDiProPerm Luminals vs {Her2 & Basals}:
• All Combined, p-val = 0
• UNC Combo, p-val = 0
• NKI Combo, p-val = 0
• UNC GEO, p-val = 0
• UNC UnP, p-val = 3e-14
• NKI Pub, p-val = 1e-11
• NKI 97, p-val = 0.00078
Perou 500 DataDiProPerm Luminal A vs Luminal B:
• All Combined, p-val = 0.0068
• UNC Combo, p-val = 0.214
• NKI Combo, p-val = 0.014
• UNC GEO, p-val = 0.298
• UNC UnP, p-val = 0.396
• NKI Pub, p-val = 0.052
• NKI 97, p-val = 0.246
(shows clear value to combining)
Perou 500 DataDiProPerm Her2 vs Basals:
• All Combined, p-val = 0
• UNC Combo, p-val = 0
• NKI Combo, p-val = 0
• UNC GEO, p-val = 0
• UNC UnP, p-val = 0.02
• NKI Pub, p-val = 0.00008
• NKI 97, p-val = 0.246
Perou 500 DataDraw back to DiProPerm here:
• Classes found by clustering
• Different from e.g. NCI 60 classes
• So maybe not surprising they are
Different from random
• I.e. find significant differences
• Does this really mean:
Cluster is really there???
Needs deeper thought…
HDLSSHDLSS Hypothesis Testing – DiProPerm test
Many Open Questions on DiProPerm Test:Which Direction is “Best”?Which 1-d Projected test statistic?Permutation vs. alternatives
(bootstrap?)???How do these interact?What are asymptotic properties?
ClusteringIdea: Given data
• Assign each object to a class
• Of similar objects
• Completely data driven
• I.e. assign labels to data
• “Unsupervised Learning”
Contrast to Classification (Discrimination)
• With predetermined classes
• “Supervised Learning”
nXX ,...,
1
ClusteringImportant References:
• McQueen (1967)
• Hartigan (1975)
• Kaufman and Rousseeuw (2005),
K-means ClusteringMain Idea: for data
Partition indices
among classes
Given index sets
• that partition
• represent clusters by “class mean”
where (within class means)
nXX ,...,
1
ni ,...,1
KCC ,...,1
KCC ,...,1
jCi
ij
j XC
X#
1
n,,1
K-means ClusteringGiven index sets
Measure how well clustered, using
Within Class Sum of Squares
Weak point: not very interpretable
KCC ,...,1
2
1
jCi
ji
K
j
XX
K-means ClusteringCommon Variation:
Put on scale of proportions (i.e. in [0,1])
By dividing “within class SS”
by “overall SS”
Gives Cluster Index:
n
ii
Ciji
K
j
K
XX
XX
CCCI j
1
2
2
1
1 ,,
K-means ClusteringNotes on Cluster Index:
• CI = 0 when all data at cluster means
• CI small when gives tight clustering
(within SS contains little variation)
• CI big when gives poor clustering
(within SS contains most of variation)
• CI = 1 when all cluster means are same
n
ii
Ciji
K
j
K
XX
XX
CCCI j
1
2
2
1
1 ,,
KCC ,...,1
KCC ,...,1
K-means Clustering
Clustering Goal:
• Given data
• Choose classes
• To miminize
KCC ,...,1
nXX ,...,
1
n
ii
Ciji
K
j
K
XX
XX
CCCI j
1
2
2
1
1 ,,
2-means Clustering
Study CI, using simple 1-d examples
• Varying Standard Deviation
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
Study CI, using simple 1-d examples
• Varying Standard Deviation
• Varying Mean
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering
2-means Clustering