Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...
-
Upload
nathan-harper -
Category
Documents
-
view
219 -
download
0
Transcript of Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...
1Microarray Workshop
Introduction to Classification Issues in Microarray Data Analysis
Jane FridlyandJean Yee Hwa Yang
University of California, San Francisco
Elsinore, DenmarkMay 17-21, 2004
2Microarray Workshop
Brief Overview of the Life-Cycle
3Microarray Workshop
Biological verification and interpretation
Microarray experiment
Experimental design
Image analysis
Pre-processing
Biological question
TestingEstimation DiscriminationAnalysis
Clustering
Life Cycle
Quality measurement
Failed
Pass
4Microarray Workshop
• The steps outlined in the “Life Cycle” need to be carefully thought through and re-adjusted for each data type/platform combination. Experimental design will impact what questions should be asked and may be answered once the data are collected.
• To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.
Sir RA Fisher
5Microarray Workshop
****
*
GeneChip Affymetrix
cDNA microarray
Nylon membrane
Agilent: Long oligo Ink Jet
Illumina Bead Array
CGH
SAGE
DifferentTechnologies
6Microarray Workshop
Some statistical issues
• Designing gene expression experiments.• Acquiring the raw data: image analysis.• Assessing the quality of the data.• Summarizing and removing artifacts from the data.• Interpretation and analysis of the data:
- Discovering which genes are differentially expressed- Discovering which genes exhibit interesting expression patterns- Detection of gene regulatory mechanisms. - Classification of samples- And many others…
Lots of other bioinformatics issues …
For a review see Smyth, Yang and Speed, “Statistical issues in microarray data analysis”, In: Functional Genomics: Methods and Protocols, Methods in Molecular Biology, Humana Press, March 2003
7Microarray Workshop
Short-oligonucleotide chip data:• quality assessment,• background correction, • probe-level normalization,• probe set summary
Two-color spotted array data:• quality assessment; diagnostic plots,• background correction, • array normalization.
CEL, CDF files gpr, gal files
probes by sample matrix of log-ratios or log-intensities
Analysis of expression data:• Identify D.E. genes, estimation and testing,• clustering, and • discrimination.
Qu
alit
y as
sess
men
tP
re-p
roce
ssin
g
Array CGH data:•quality assessment; diagnostic plots,•, background correction • clones summary; • array normalization.
UCSF spot file
Imag
e an
alys
isA
nal
ysis
8Microarray Workshop
Linear ModelsSpecific examplesT-testsF-testsEmpirical bayesSAM
Examples• Identify differential expression genes among
two or more tumor subtypes or different cell treatments.
• Look for genes that have different time profiles between different mutants.
• Looking for genes associated with survival.
Linear Models
9Microarray Workshop
Clustering
Algorithms•Hierarchical clustering•Self-organizing maps•Partition around medoids (pam)
Examples• We can cluster cell samples (cols),
the identification of new / unknown tumor sub classes or cell sub types using gene expression profiles.
• We can cluster genes (rows) , using large numbers of yeast experiments, to identify groups of co-expressed genes.
10Microarray Workshop
Discrimination
Classification rules • DLDA or DQDA• k-nearest neighbor (knn) • Support vector machine (svm)• Classification tree
Gene 1Mi1 < -0.67
Gene 2Mi2 > 0.18
B-ALL
AML
T-ALL
yes
yes
no
no
B-ALL T-ALL AML
Learning set
? Questions• Identification of groups of
genes that predictive of a particular class of tumors?
• Can I use the expression profile of cancer patients to predict survival?
11Microarray Workshop
Annotation
Riken ID
ZX00049O01
GenBank accession
AV128498
Locuslink
15903
Biochemical
pathways
(KEGG)
Nucleotide Sequence
TCGTTCCATTTTTCTTTAGGGGGTCTTTCCCCGTCTTGGGGGGGAGGAAAAGTTCTGCTGCCCTGATTATGAACTCTATAATAGAGTATATAGCTTTTGTACCTTTTTTACAGGAAGGTGCTTTCTGTAATCATGTGATGTATATTAAACTTTTTATAAAAGTTAACATTTTGCATAAT AAACCATTTTTG
Bay Genomic
s ES cells
UniGene
Mm.110
MGD
MGI:96398
Name
Inhibitor of DNA binding 3
Gene Symbol
Idb3
Swiss-Prot
P20109
GO
GO:0000122
GO:0005634
GO:0019904
Map PositionChromosome:4
66.0 cM
PubMed 12858547 2000388
etc
Literature
12Microarray Workshop
What is your questions?
• What are the targets genes for my knock-out gene?• Look for genes that have different time profiles between different cell types.
Gene discovery, differential expression
• Is a specified group of genes all up-regulated in a specified conditions?Gene set, differential expression
• Can I use the expression profile of cancer patients to predict survival?• Identification of groups of genes that predictive of a particular class of tumors?
Class prediction, classification
• Are there tumor sub-types not previously identified? • Are there groups of co-expressed genes?
Class discovery, clustering
• Detection of gene regulatory mechanisms. • Do my genes group into previously undiscovered pathways?
Clustering. Often expression data alone is not enough, need to incorporate sequence and other information
13Microarray Workshop
Classification
14Microarray Workshop
Gene expression data Two color spotted array
Data on G genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
15Microarray Workshop
Classification
• Task: assign objects to classes (groups) on the basis of measurements made on the objects
• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)
• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations
16Microarray Workshop
Example: Tumor Classification
• Reliable and precise classification essential for successful cancer treatment
• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables
• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous
• Characterize molecular variations among tumors by monitoring gene expression (microarray)
• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)
17Microarray Workshop
Tumor Classification Using Gene Expression Data
Three main types of statistical problems associated with tumor classification:
• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)
• Classification of malignancies into known classes (supervised learning – discrimination)
• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).
18Microarray Workshop
Clustering
19Microarray Workshop
Generic Clustering Tasks
• Estimating number of clusters
• Assign samples to the groups
• Assessing strength/confidence of cluster assignments for individual objects
20Microarray Workshop
What to cluster
• Samples: To discover novel subtypes of the existing groups or entirely new partitions. Their utility needs to be confirmed with other types of data, e.g. clinical information.
• Genes: To discover groups of co-regulated genes/ESTs and use these groups to infer function where it is unknown using members of the groups with known function.
21Microarray Workshop
Basic principles of clustering
Aim: to group observations or variables that are “similar” based on predefined criteria.
Issues: Which genes / arrays to use? Which similarity or dissimilarity measure?
Which method to use to join clusters/observations? Which clustering algorithm?
How to validate the resulting clusters?
It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific and varies depending on what is being clustered, genes or arrays.
22Microarray Workshop
Array Data
For each gene, calculate a summary statistics and/or
adjusted p-values
Clustering
Clusteringof genes
Set of candidate DE genes.Biological verification
Descriptive interpretation
Similarity metrics
Clustering algorithm
23Microarray Workshop
Array Data
Set of samples to cluster
Clustering
Clusteringof samplesand genes
Set of genes to use in clustering (DO NOT use class labels in
the set determination).
Descriptive Interpretation
of genes separatingnovel subgroups of the samples
Similarity metrics
Clustering algorithm
Validation of clusters with clinical data
24Microarray Workshop
Which similarity or dissimilarity measure?
• A metric is a measure of the similarity or dissimilarity between two data objects
• Two main classes of metric:- Correlation coefficients (similarity)
- Compares shape of expression curves- Types of correlation:
- Centered.- Un-centered.- Rank-correlation
- Distance metrics (dissimilarity)- City Block (Manhattan) distance- Euclidean distance
25Microarray Workshop
• Pearson Correlation Coefficient (centered correlation)Sx = Standard deviation of x
Sy = Standard deviation of y
• Others include Spearman’s and Kendall’s
n
i y
i
x
in S
yy
S
xx
11
1
Correlation (a measure between -1 and 1)
Positive correlation Negative correlation
You can use absolute correlation to capture both positive and negative correlation
26Microarray Workshop
Potential pitfalls
Correlation = 1
27Microarray Workshop
Distance metrics
• City Block (Manhattan) distance:- Sum of differences across
dimensions- Less sensitive to outliers - Diamond shaped clusters
• Euclidean distance:- Most commonly used
distance- Sphere shaped cluster- Corresponds to the
geometric distance into the multidimensional space
i
ii yxYXd ),( i
ii yxYXd 2)(),(
where gene X = (x1,…,xn) and gene Y=(y1,…,yn)
X
Y
Condition 1
Co
nd
itio
n 2
Condition 1
X
Y
Co
nd
itio
n 2
28Microarray Workshop
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
29Microarray Workshop
How to Compute Group Similarity?
Given two groups g1 and g2,
•Single-link algorithm: s(g1,g2)= similarity of the closest pair
•Complete-link algorithm: s(g1,g2)= similarity of the furtherest pair•Average-link algorithm: s(g1,g2)= average of similarity of all pairs
Four Popular Methods:
•Centroid algorithm: s(g1,g2)= distance between centroids of the two clusters
Supplementary slide
Adapted from internet
30Microarray Workshop
xx
Single (nearest neighbor)Leads to the “cluster chains”
Distance between centroids
Distance between clustersExamples of clustering methods
Average (Mean) linkage
xx
Complete (furtherest neighbor): Leads to small compact clusters
31Microarray Workshop
Comparison of the Three Methods
• Single-link- Elongated clusters - Individual decision, sensitive to outliers
• Complete-link- Compact clusters - Individual decision, sensitive to outliers
• Average-link or centroid- “In between” - Group decision, insensitive to outliers
• Which one is the best? Depends on what you need!
Adapted from internet
32Microarray Workshop
Clustering algorithms
• Clustering algorithm comes in 2 basic flavors
Partitioning Hierarchical
33Microarray Workshop
Partitioning methods
• Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.
• Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Ideally, dissimilarity between clusters will be maximized while it is minimized within clusters.
• Examples:- k-means, self-organizing maps (SOM), PAM, etc.;- Fuzzy (each object is assigned probability of being in a cluster): needs stochastic model, e.g. Gaussian
mixtures.
34Microarray Workshop
K = 2
Partitioning methods
35Microarray Workshop
K = 4
Partitioning methods
36Microarray Workshop
Example of a partitioning algorithm K-Means or PAM (Partitioning Around
Medoids)
1. Given a similarity function
2. Start with k randomly selected data points
3. Assume they are the centroids (medoids) of k clusters
4. Assign every data point to a cluster whose centroid (medoid) is the closest to the data point
5. Recompute the centroid (medoid) for each cluster
6. Repeat this process until the similarity-based objective function converges
37Microarray Workshop
Mixture Model for Clustering
P(X|Cluster1)
P(X|Cluster2)
P(X|Cluster3)
P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)2| ( , )i i iX Cluster N I is a cluster prior~
Adapted from internet
38Microarray Workshop
Mixture Model Estimation
• Likelihood function (generally Gaussian)
• Parameters: e.g., i, i, I
• Using EM algorithm- Similar to “soft” K-mean
• Number of clusters can be determined using a model-selection criterion, e.g. BIC (Raftery and Fraley, 1998)
21
221
( )( ) exp( )
2i
ki
ii i
xp x
Adapted from internet
39Microarray Workshop
Hierarchical methods
• Hierarchical clustering methods produce a tree or dendrogram.
• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.
• The tree can be built in two distinct ways- bottom-up: agglomerative clustering (usually used).- top-down: divisive clustering.
40Microarray Workshop
Agglomerative Methods
• Start with n mRNA sample (or G gene) clusters
• At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters
The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furtherest pair of points in the two clusters)
Supplementary slide
41Microarray Workshop
Divisive Methods
• Start with only one cluster
• At each step, split clusters into two parts
• Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram)
• Disadvantage: Computational difficulties when considering all possible divisions into two groups
Divisive methods are rarely utilized in microarray data analysis.
Supplementary slide
42Microarray Workshop
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Illustration of points In two dimensional space
1
5
34
2
43Microarray Workshop
1 5 2 3 4
1 5 2 3 4
1,2,5
3,41,5
1,2,3,4,5
Agglomerative
Tree re-ordering?
1
5
34
2
1 52 3 4
44Microarray Workshop
Partitioning vs. hierarchical
Partitioning:Advantages• Optimal for certain criteria.• Objects automatically assigned to
clustersDisadvantages• Need initial k;• Often require long computation
times.• All objects are forced into a
cluster.
HierarchicalAdvantages• Faster computation.• Visual.Disadvantages• Unrelated objects are
eventually joined• Rigid, cannot correct later for
erroneous decisions made earlier.
• Hard to define clusters – still need to know “where to cut”.
Note that hierarchical clustering results may be used as the starting points for the partitioning or model-based algorithms
45Microarray Workshop
Clustering microarray data
• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.
Examples:• We can cluster cell samples (cols), e.g. the identification of new / unknown tumor classes or
cell subtypes using gene expression profiles.
• We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes.
• We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.
46Microarray Workshop
Estimating number of clusters using silhouette (see PAM)
Define silhouette width of the observation is :
S = (b-a)/max(a,b)
Where a is the average dissimilarity to all the points in the cluster and bIs the minimum distance to any of the objects in the other clusters.
Intuitively, objects with large S are well-clustered while the ones with small Stend to lie between clusters.
How many clusters: Perform clustering for a sequence of the number of clustersk and choose the number of components corresponding to the largest averagesilhouette.
Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering sampes.
47Microarray Workshop
Estimating Number of Clusters with Silhouette (ctd)
Compute average silhouette for k=3And compare it with the results forother k’s.
48Microarray Workshop
Estimating number of clusters using reference distribution
r
k
r rk D
nW
1 2
1
Idea: Define a goodness of clustering score to minimize, e,g. pooled Within clusters Sum of Squares (WSS) around the cluster means, reflecting compactness of clusters.
where n and D are the number of points in the cluster and sum of all pairwise distances, respectively.
)log())(log()( *kknn WWEkGap
Then gap statistic for k clusters is defined as:
Where E*n is the average under a sample of the same size from the reference distribution. Reference distribution can be generated either parametrically (e.g. from a multivariate) or non-parametrically (e.g. by sampling from marginal distributions of the variables. The first local maximum is chosen to be the number of clusters (slightly more complicated rule) (Tibshirani et al, 2001)
Adapted from internet
49Microarray Workshop
Estimating number of clusters
There are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ).
The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework.
It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering.
50Microarray Workshop
Confidence in of the individual cluster assignments
Want to assign confidence to individual observations of being in their assigned clusters.
•Model-based clustering: natural probability interpretation
•Partitioning methods: silhouette
•Dudoit and Fridlyand (2003) have presented a resampling-based approach that assigns confidence by computing how proportion of resampling timesthat an observation ends up in the assigned cluster.
51Microarray Workshop
Tight clustering (genes)Identifies small stable gene clusters by not attempting to cluster all the genes.Thus, it does not necessitate estimation of the number of clusters and assignment of all points into the clusters. Aids interpretability and validity of the results. (Tseng et al, 2003)
Algorithm:
For sequence of k > k0:
1. Identify the set of genes that are consistently grouped together when genes are repeatedly sub-sampled. Order those sets by size. Consider the top largest q sets for each k.
2. Stop when for (k, (k+1)), the two sets are nearly identical. Take the set corresponding to (k+1). Remove that set from the dataset.
3. Set k0 = k0 -1 and repeat the procedure.
52Microarray Workshop
Two-way clustering of genes and samples.
Refer to the methods that use samples and genes simulteneously to extract information. These methods are not yet well developed.
Some examples of the approaches include Block Clustering (Hartigan, 1972) which repeatedly rearranges rows and columns to obtain the largest reduction of total within block variance.
Another method is based on Plaid Models (Lazzeroni and Owen, 2002)
Friedman and Meulmann (2002) present an algorithm allowing to cluster samples based on the subsets of attributes, i.e. each group of samples could have beencharacterized by different gene sets.
53Microarray Workshop
Applications of clustering to themicroarray data
Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,.
•Three subtypes of lymphoma (FL, CLL and DLBCL) have different genetic signatures. (81 cases total)
•DLBCL group can be partitioned into two subgroups with significantly different survival. (39 DLBCL cases)
54Microarray Workshop
Taken from Nature February, 2000Paper by A Alizadeh et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,
Clustering both cell samplesand genes
55Microarray Workshop
Clustering cell samplesDiscovering sub-groups
Taken from Alizadeh et al (Nature, 2000)
56Microarray Workshop
Attempt at validationof DLBCL subgroups
Taken from Alizadeh et al (Nature, 2000)
57Microarray Workshop
Yeast Cell Cycle (Cho et al, 1998)6 × 5 SOM with 828 genes
Taken from Tamayo et al, (PNAS, 1999)
Clustering genesFinding different patterns in the data
58Microarray Workshop
Summary
Which clustering method should I use?- What is the biological question?- Do I have a preconceived notion of how many clusters there
should be?- Hard or soft boundaries between clusters
Keep in mind:- Clustering cannot NOT work. That is, every clustering
methods will return clusters.- Clustering helps to group / order information and is a
visualization tool for learning about the data. However, clustering results do not provide biological “proof”.
- Clustering is generally used as an exploratory and hypotheses generation tool.
59Microarray Workshop
Discrimination
60Microarray Workshop
Predefined Class
{1,2,…K}
1 2 K
Objects
Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)
Aim: predict Y from X.
X = {red, square} Y = ?
Y = Class Label = 2
X = Feature vector {colour, shape}
Classification rule ?
61Microarray Workshop
Discrimination and Allocation
Learning SetData with
known classes
ClassificationTechnique
Classificationrule
Data with unknown classes
ClassAssignment
Discrimination
Prediction
62Microarray Workshop
?Bad prognosis
recurrence < 5yrsGood Prognosis
recurrence > 5yrs
ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..
ObjectsArray
Feature vectorsGene
expression
Predefine classesClinical
outcome
new array
Learning set
Classificationrule
Good PrognosisMatesis > 5
63Microarray Workshop
B-ALL T-ALL AML
ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.
ObjectsArray
Feature vectorsGene
expression
Predefine classes
Tumor type
?
new array
Learning set
ClassificationRule
T-ALL
64Microarray Workshop
Classification Rule
-Classification procedure,-Feature selection,
-Parameters [pre-determine, estimable],
Distance measure,Aggregation methods
Performance Assessmente.g. Cross validation
• One can think of the classification rule as a black box, some methods provides more insight into the box.
• Performance assessment needs to be looked at for all classification rule.
65Microarray Workshop
Classification rule Maximum likelihood discriminant rule
• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest.
• For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by
C(X) = argmaxk pk(X)
66Microarray Workshop
Gaussian ML discriminant rules
• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the ML classifier is
C(X) = argmink {(X - k) k-1
(X - k)’ + log| k |}
• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)
• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities
67Microarray Workshop
ML discriminant rules - special cases
[DLDA] Diagonal linear discriminant analysisclass densities have the same diagonal covariance matrix = diag(s1
2, …, sp2)
[DQDA] Diagonal quadratic discriminant analysis)class densities have different diagonal covariance matrix k= diag(s1k
2, …, spk2)
Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation).
68Microarray Workshop
The Logistic Regression Model
2-class case: log[p/(1-p)] = + X + e
p is the probability that the event Y occurs given the observed gene expression pattern, p(Y=1 | X)
p/(1-p) is the "odds ratio" log[p/(1-p)] is the log odds ratio, or "logit"
This can easily be generalized to multiclass outcome and to more general dependences than linear. Also, logistic regression makes fewer assumptions on the marginal distribution of the variables. However, the results are generally very similat to LDA. (Hastie et al, 2003)
t
69Microarray Workshop
Classification with SVMs
Generalization of the ideas of separating hyperplanes in the original space.Linear boundaries between classes in higher-dimensional space lead tothe non-linear boundaries in the original space.
Adapted from internet
70Microarray Workshop
Nearest neighbor classification
• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:- find the k observations in the learning set closest to X- predict the class of X by majority vote, i.e., choose
the class that is most common among those k observations.
• The number of neighbors k can be chosen by cross-validation (more on this later).
71Microarray Workshop
Nearest neighbor rule
72Microarray Workshop
Classification tree
• Partition the feature space into a set of rectangles, then fit a simple model in each one
• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)
• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier
73Microarray Workshop
Classification tree
Gene 1Mi1 < -0.67
Gene 2Mi2 > 0.18
0
2
1
yes
yes
no
no 0
1
2
Gene 1
Gene 2
-0.67
0.18
74Microarray Workshop
Three aspects of tree construction
• Split selection rule: - Example, at each node, choose split maximizing decrease in
impurity (e.g. Gini index, entropy, misclassification error).
• Split-stopping:
- Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate.
• Class assignment:
- Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node.
Supplementary slide
75Microarray Workshop
Another component in classification rule:aggregating classifiers
Training Set
X1, X2, … X100
Classifier 1Resample 1
Classifier 2Resample 2
Classifier 499Resample 499
Classifier 500Resample 500
Examples:BaggingBoosting
Random Forest
Aggregateclassifier
76Microarray Workshop
Aggregating classifiers:Bagging
Training Set (arrays)X1, X2, … X100
Tree 1Resample 1
X*1, X*2, … X*100
Lets the treevote
Tree 2Resample 2
X*1, X*2, … X*100
Tree 499Resample 499X*1, X*2, … X*100
Tree 500Resample 500X*1, X*2, … X*100
Testsample
Class 1
Class 2
Class 1
Class 1
90% Class 110% Class 2
77Microarray Workshop
Other classifiers include…
• Neural networks
• Projection pursuit
• Bayesian belief networks
• …
78Microarray Workshop
Why select features
• Lead to better classification performance by removing variables that are noise with respect to the outcome
• May provide useful insights into etiology of a disease
• Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)
79Microarray Workshop
Why select features?
Correlation plotData: Leukemia, 3 class
No feature selection
Top 100 feature selection
Selection based on variance
-1 +1
80Microarray Workshop
Approaches to feature selection
• Methods fall into three basic category- Filter methods- Wrapper methods- Embedded methods
• The simplest and most frequently used methods are the filter methods.
Adapted from A. Hartemnick
81Microarray Workshop
Filter methods
Rp
Feature selection Rs
s << pClassifier design
•Features are scored independently and the top s are used by the classifier
•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc
Easy to interpret. Can provide some insight into the disease markers.
Adapted from A. Hartemnick
82Microarray Workshop
Problems with filter method
• Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information
• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)
• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.
Supplementary slide
Adapted from A. Hartemnick
83Microarray Workshop
Dimension reduction: a variant on a filter method
• Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc)
• Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks)
• Those methods tend not to work better than simple filter methods.
Supplementary slide
Adapted from A. Hartemnick
84Microarray Workshop
Wrapper methods
Rp
Feature selection Rs
s << pClassifier design
•Iterative approach: many feature subsets are scored based on classification performance and best is used.
•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc
Adapted from A. Hartemnick
85Microarray Workshop
Problems with wrapper methods
• Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated
• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.
• Easy to overfit.
p
Supplementary slide
Adapted from A. Hartemnick
86Microarray Workshop
Embedded methods
• Attempt to jointly or simultaneously train both a classifier and a feature subset
• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.
• Intuitively appealing
Some examples: tree-building algorithms, shrinkage methods (LDA, kNN)
Adapted from A. Hartemnick
87Microarray Workshop
Performance assessment
• Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.
• One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.
• Assessing performance of the classifier based on- Cross-validation.- Test set- Independent testing on future dataset
88Microarray Workshop
Diagram of performance assessment
Training set
Performance assessment
TrainingSet
Independenttest set
Classifier
Classifier
Resubstitution estimation
Test set estimation
89Microarray Workshop
Performance assessment (II)
• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller
variance- Computationally intensive.
• Leave-one-out cross validation (LOOCV).
(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)
Supplementary slide
90Microarray Workshop
Performance assessment (I)
• Resubstitution estimation: error rate on the learning set.- Problem: downward bias
• Test set estimation:
1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T.
2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T).
- L and T must be independent and identically distributed (i.i.d).- Problem: reduced effective sample size
Supplementary slide
91Microarray Workshop
Diagram of performance assessment
Training set
Performance assessment
TrainingSet
Independenttest set
(CV) Learningset
(CV) Test set
Classifier
Classifier
Classifier
Resubstitution estimation
Test set estimation
Cross Validation
92Microarray Workshop
Performance assessment (III)
• Common practice to do feature selection using the learning , then CV only for model building and classification.
• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased.
• Features (variables) should be selected only from the learning set used to build the model (and not the entire set)
93Microarray Workshop
Comparison study
• Leukemia data – Golub et al. (1999)- n = 72 samples, - G = 3,571 genes,- 3 classes (B-cell ALL, T-cell ALL, AML).
• Reference:S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87
94Microarray Workshop
Leukemia data, 3 classes: Test set error rates;150 LS/TS runs
95Microarray Workshop
Results
• In the main comparison, NN and DLDA had the smallest error rates.
• Aggregation improved the performance of CART classifiers.
• For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers.
96Microarray Workshop
Comparison study – discussion (I)
• “Diagonal” LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions.
• Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics.
• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions.
97Microarray Workshop
Summary (I)
• Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge.
• Cross-validation . It is of utmost importance to cross-validate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features- how many neighbors- pooled or unpooled variance- classifier itself.
If this is not done, it is possible to wrongly declare having discrimination power when there is none.
98Microarray Workshop
Summary (II)
• Generalization error rate estimation. It is necessary to keep sampling scheme in mind.
• Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier.
• We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples.
99Microarray Workshop
Some performance assessment quantities
Assume 2-class problemclass 1 = no event ~ null hypothesis. E.g. , no recurrenceclass 2 = event ~ alternative hypothesis. E.g., recurrence
All quantities are estimated on the available dataset (test set if available)
• Misclassification error rate: proportion of misclassified samples• Lift: proportion of correct class 2 predictions divided by the
proportion of class 2 cases Proportion (class 2 is true | class 2 is detected) / Proportion (class is
2)• Odds ratio: measure of association between true and predicted
labels.
100Microarray Workshop
Some performance assessment quantities (ctd)
• Sensitivity: proportion of correct class 2 predictions
Prob(detect class 2| class 2 is true) ~ power
• Specificity: proportion of correct class 1 predictions
Prob(declare class 1 | class 1 is true ) = 1 –
Prob(detect class 2 | class 1 is true) ~ 1 – type I error
101Microarray Workshop
Some performance assessment quantities (ctd)
• Positive Predictive Value (PPV): proportion of class 2 cases among predicted class 2 cases (should be applicable to the population)
Prob(class 2 is true | class 2 is detected) = P(detect class 2 | class 2 is true) x Prob(class 2 is true )/Prob(detect class 2) =
sensitivity x Prob(class is 2)/
[sensitivity x Prob(class is 2) + (1-specificity) x (1-Prob(class2))]Note that PPV is the only quantity explicitely incorporating population
proportions: i.e., prevalence of class 2 in the population of interest ( Prob(class is 2)) as well as sensitivity and specificity.
If the prevalence is low, specificity of the test has to be very high to be clinically useful.
102Microarray Workshop
Reference 1Retrospective studyL van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002..
Learning set
Bad Good
ClassificationRule
Reference 2Cohort studyM Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec 2002.
Reference 3Prospective trials.Aug 2003Clinical trialshttp://www.agendia.com/
Feature selection.Correlation with class
labels, very similar to t-test.
Using cross validation toselect 70 genes
295 samples selected from Netherland Cancer Institute
tissue bank (1984 – 1995).
Results” Gene expression profile is a morepowerful predictor then standard systems based on clinical and histologic criteria
Agendia (formed by reseachers from the Netherlands Cancer Institute)Has started in Oct, 2003
1) 5000 subjects [Health Council of the Netherlands]2) 5000 subjects New York based Avon Foundation.
Custom arrays are made by Agilent including 70 genes + 1000 controls
Case studies
103Microarray Workshop
Van’t Veer breast cancer study study
Investigate whether tumor ability for metastasis is
obtained later in development or inherent in the initial
gene expression signature.
• Retrospective sampling of node-negative women: 44 non-recurrences within 5 years of surgery and 34 recurrences. Additionally, 19 test sample (12 recur. and 7 non-recur)
• Want to demonstrate that gene expression profile is significantly associated with recurrence independent of the other clinical variables.
Nature, 2002
104Microarray Workshop
Predictor development
• Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there are significant enrichment for such genes in the dataset.
• Rank-order genes on the basis of their correlation• Optimize number of genes in the classifier by using CV-1
Classification is made on the basis of the correlations of the expression profile of leave-out-out sample with the mean expression of the remaining samples from the good and bad prognosis patients, respectively.
N. B.: The correct way to select genes is within rather than outside cross-validation, resulting in different set of markers for each CV iteration
N. B. : Optimizing number of variables and other parameters should be done via 2-level cross-validation if results are to be assessed on the training set.
The classification indicator is included into the logistic model along with other clinical variables. It is shown that gene expression profile has the strongest effect. Note that some of this may be due to overfitting for the threshold parameter.
105Microarray Workshop
Van ‘t Veer, et al., 2002
106Microarray Workshop
van de Vuver’s breast data(NEJM, 2002)
• 295 additional breast cancer patients, mix of node-negative and node-positive samples.
• Want to use the predictor that was developed to identify patients at risk for metastasis.
• The predicted class was significantly associated with time to recurrence in the multivariate cox-proportional model.
107Microarray Workshop
108Microarray Workshop
Some examples of wrong answers and questions in microarray data analysis
109Microarray Workshop
Biological verification and interpretation
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
TestingEstimation DiscriminationAnalysis
Clustering
Life Cycle
Quality measurement
Failed
Pass
110Microarray Workshop
Prediction I: estimating misclassification error
Performance of the classifiers on the future samples needs to be assessed while taking population proportions into the account.
Question: Build a classifier to predict a rare (1/100) subclass of cancer and estimate its misclassification rate in the population.
Design: Retrospectively collect equal numbers of rare and common subtypes and build a classifier. Estimate its future performance using cross-validation on the collected set.
Issues: Population proportions of the two types differ from the proportions in the study. For instance, if 0/50 of rare subtype and 10/50 of common subtype were misclassified (10/100), then in population, we expect to observe 1 rare instance and 99 common ones and will misclassify approximately 20/100 samples.
Conclusion: If a dataset is not representative of population distributions, one needs to think hard about how to do the “translation”. (e.g., Positive Predictive Value on the future samples vs Specificity and Sensitivity on the current ones).
111Microarray Workshop
50% 43% 10% 1% 0.1% One per 2500
Prevalence
Specificity90% 91 88 53 9 1 0.495% 95 94* 69 17 2 0.8**99% 99 99 92 50 9 499.9% 99.9 99.9 99 91 50 29
Prediction II: Prevalence vs PPV (ctd)
Assumes a constant sensitivity of 100%.
*PPV reported by Petricoin et al (2002)
**Correct PPV assuming prevalence of ovariann cancer in general population is1/2500.
Note that discovering discriminatory power is not the same as demonstrating a clinicalutility of the classifier.l
Adapted from the comment in Lancer by Rockhill
112Microarray Workshop
Experimental design
Proper randomization is essential in experimental design.
Question: Build a predictor to diagnose ovarian cancer
Design: Tissue from Normal women and Ovarian cancer patients arrives at different times.
Issues: Complete confounding between tissue type and time of processing.
This phenomenom is very common in the absence of carefully thought-through design.
Post-mortem diagnosis: lack of randomization.
113Microarray Workshop
Clustering I
The procedure should not bias results towards desired conclusions.
Question: Do expression data cluster according to the survival status.
Design: Identify genes with high t-statistic for comparison short and long survivors. Use these genes to cluster samples. Get excited that samples cluster according to survival status.
Issues: The genes were already selected based on the survival status. Therefore, it would rather be surprising if samples did *not* cluster according to their survival.
Conclusion: None are possible with respect to clustering as variable selection was driven by class distinction.
114Microarray Workshop
Clustering II
P-values for differential expression are only valid when the class labels are independent of the current dataset.
Question: Identify genes distinguishing among “interesting” subgroups.Design: Cluster samples into K groups. For each gene, compute F-
statistic and its associated p-value to test for differential expression among two subgroups.
Issues: Same data was used to create groups as to test for DEs – p-values are invalid.
Conclusion: None with respect to DEs p-values. Nevertheless, it is possible to select genes with high value of the statistic and test hypotheses about functional enrichment with, e.g., Gene Ontology. Also, can cluster these genes and use the results to generate new hypotheses.
115Microarray Workshop
AcknowledgementsUCSF /CBMB• Ajay Jain• Mark Segal• UCSF Cancer Center
Array Core• Jain Lab
SFGH• Agnes Paquet • David Erle• Andrea Barczac• UCSF Sandler Genomics
Core Facility.
UCB• Terry Speed• Sandrine Dudoit
116Microarray Workshop
Some references1. Hastie, Tibshirani, Friedman “The Elements of Statistical Learning”, Springer, 20012. Speed (editor) “Statistical Analysis of Gene Expression Microarray Data”, Chapman & Hall/CRC, 20033. Alizadeh et al, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 20004. Van ‘t Veer et al, “Gene expression profiling predicts clinical outcome of breast cancer, Nature, 20025. Van de Vijver et al, “A gene-expression signature as a predictor of survival in breast cancer, NEJM, 20026. Petricoin et al, “Use of proteomics patterns in serum to identify ovarian cancer”, Lancet, 2002 (and relevant correspondence)7. Golub et al, “Molecular Classification of Cancer: Class Discovery and Class prediction by Gene Expression Monitoring “, Science, 19998. Cho et al, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 19999. Dudoit, et al, :Comparison of discrimination methods for the classification of tumors using gene expression data, JASA, 2002
117Microarray Workshop
Some references10. Ambroise and McLachlan, “Selection bias in gene extraction on the basis microarray gene expression data”, PNAS, 200211. Tibshirani et al, “Estimating the number of clusters in the dataset via the GAP statistic”, Tech Report, Stanford, 200012. Tseng et al, “Tight clustering : a resampling-based approach for identifying stable and tight patterns in data”, Tech Report, 200313. Dudoit and Fridlyand, “A prediction-based resampling method for estimating the number of clusters in a dataset “, Genome Biology, 200214. Dudoit and Fridlyand, “Bagging to improve the accuracy of a clustering procedure”, Bioinformatics, 200315. Kaufmann and Rousseeuw, “Clustering by means of medoids.”, Elsevier/North Holland 198716. See many article by Leo Breiman on aggregation