Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...

1Microarray Workshop

Introduction to Classification Issues in Microarray Data Analysis

Jane FridlyandJean Yee Hwa Yang

University of California, San Francisco

Elsinore, DenmarkMay 17-21, 2004


Brief Overview of the Life-Cycle


Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Pre-processing

Biological question

TestingEstimation DiscriminationAnalysis

Clustering

Life Cycle

Quality measurement

Failed

Pass


• The steps outlined in the “Life Cycle” need to be carefully thought through and re-adjusted for each data type/platform combination. Experimental design will impact what questions should be asked and may be answered once the data are collected.

• To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.

Sir RA Fisher


****

*

GeneChip Affymetrix

cDNA microarray

Nylon membrane

Agilent: Long oligo Ink Jet

Illumina Bead Array

CGH

SAGE

DifferentTechnologies


Some statistical issues

• Designing gene expression experiments.• Acquiring the raw data: image analysis.• Assessing the quality of the data.• Summarizing and removing artifacts from the data.• Interpretation and analysis of the data:

- Discovering which genes are differentially expressed- Discovering which genes exhibit interesting expression patterns- Detection of gene regulatory mechanisms. - Classification of samples- And many others…

Lots of other bioinformatics issues …

For a review see Smyth, Yang and Speed, “Statistical issues in microarray data analysis”, In: Functional Genomics: Methods and Protocols, Methods in Molecular Biology, Humana Press, March 2003


Short-oligonucleotide chip data:• quality assessment,• background correction, • probe-level normalization,• probe set summary

Two-color spotted array data:• quality assessment; diagnostic plots,• background correction, • array normalization.

CEL, CDF files gpr, gal files

probes by sample matrix of log-ratios or log-intensities

Analysis of expression data:• Identify D.E. genes, estimation and testing,• clustering, and • discrimination.

Qu

alit

y as

sess

men

tP

re-p

roce

ssin

g

Array CGH data:•quality assessment; diagnostic plots,•, background correction • clones summary; • array normalization.

UCSF spot file

Imag

e an

alys

isA

nal

ysis


Linear ModelsSpecific examplesT-testsF-testsEmpirical bayesSAM

Examples• Identify differential expression genes among

two or more tumor subtypes or different cell treatments.

• Look for genes that have different time profiles between different mutants.

• Looking for genes associated with survival.

Linear Models


Clustering

Algorithms•Hierarchical clustering•Self-organizing maps•Partition around medoids (pam)

Examples• We can cluster cell samples (cols),

the identification of new / unknown tumor sub classes or cell sub types using gene expression profiles.

• We can cluster genes (rows) , using large numbers of yeast experiments, to identify groups of co-expressed genes.


Discrimination

Classification rules • DLDA or DQDA• k-nearest neighbor (knn) • Support vector machine (svm)• Classification tree

Gene 1Mi1 < -0.67

Gene 2Mi2 > 0.18

B-ALL

AML

T-ALL

yes

yes

no

no

B-ALL T-ALL AML

Learning set

? Questions• Identification of groups of

genes that predictive of a particular class of tumors?

• Can I use the expression profile of cancer patients to predict survival?


Annotation

Riken ID

ZX00049O01

GenBank accession

AV128498

Locuslink

15903

Biochemical

pathways

(KEGG)

Nucleotide Sequence

TCGTTCCATTTTTCTTTAGGGGGTCTTTCCCCGTCTTGGGGGGGAGGAAAAGTTCTGCTGCCCTGATTATGAACTCTATAATAGAGTATATAGCTTTTGTACCTTTTTTACAGGAAGGTGCTTTCTGTAATCATGTGATGTATATTAAACTTTTTATAAAAGTTAACATTTTGCATAAT AAACCATTTTTG

Bay Genomic

s ES cells

UniGene

Mm.110

MGD

MGI:96398

Name

Inhibitor of DNA binding 3

Gene Symbol

Idb3

Swiss-Prot

P20109

GO

GO:0000122

GO:0005634

GO:0019904

Map PositionChromosome:4

66.0 cM

PubMed 12858547 2000388

etc

Literature


What is your questions?

• What are the targets genes for my knock-out gene?• Look for genes that have different time profiles between different cell types.

Gene discovery, differential expression

• Is a specified group of genes all up-regulated in a specified conditions?Gene set, differential expression

• Can I use the expression profile of cancer patients to predict survival?• Identification of groups of genes that predictive of a particular class of tumors?

Class prediction, classification

• Are there tumor sub-types not previously identified? • Are there groups of co-expressed genes?

Class discovery, clustering

• Detection of gene regulatory mechanisms. • Do my genes group into previously undiscovered pathways?

Clustering. Often expression data alone is not enough, need to incorporate sequence and other information


Classification


Gene expression data Two color spotted array

Data on G genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

= (normalized) Log( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...


Classification

• Task: assign objects to classes (groups) on the basis of measurements made on the objects

• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)

• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations


Example: Tumor Classification

• Reliable and precise classification essential for successful cancer treatment

• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables

• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous

• Characterize molecular variations among tumors by monitoring gene expression (microarray)

• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)


Tumor Classification Using Gene Expression Data

Three main types of statistical problems associated with tumor classification:

• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)

• Classification of malignancies into known classes (supervised learning – discrimination)

• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).


Clustering


Generic Clustering Tasks

• Estimating number of clusters

• Assign samples to the groups

• Assessing strength/confidence of cluster assignments for individual objects


What to cluster

• Samples: To discover novel subtypes of the existing groups or entirely new partitions. Their utility needs to be confirmed with other types of data, e.g. clinical information.

• Genes: To discover groups of co-regulated genes/ESTs and use these groups to infer function where it is unknown using members of the groups with known function.


Basic principles of clustering

Aim: to group observations or variables that are “similar” based on predefined criteria.

Issues: Which genes / arrays to use? Which similarity or dissimilarity measure?

Which method to use to join clusters/observations? Which clustering algorithm?

How to validate the resulting clusters?

It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific and varies depending on what is being clustered, genes or arrays.


Array Data

For each gene, calculate a summary statistics and/or

adjusted p-values

Clustering

Clusteringof genes

Set of candidate DE genes.Biological verification

Descriptive interpretation

Similarity metrics

Clustering algorithm


Array Data

Set of samples to cluster

Clustering

Clusteringof samplesand genes

Set of genes to use in clustering (DO NOT use class labels in

the set determination).

Descriptive Interpretation

of genes separatingnovel subgroups of the samples

Similarity metrics

Clustering algorithm

Validation of clusters with clinical data


Which similarity or dissimilarity measure?

• A metric is a measure of the similarity or dissimilarity between two data objects

• Two main classes of metric:- Correlation coefficients (similarity)

- Compares shape of expression curves- Types of correlation:

- Centered.- Un-centered.- Rank-correlation

- Distance metrics (dissimilarity)- City Block (Manhattan) distance- Euclidean distance


• Pearson Correlation Coefficient (centered correlation)Sx = Standard deviation of x

Sy = Standard deviation of y

• Others include Spearman’s and Kendall’s

n

i y

i

x

in S

yy

S

xx

11

1

Correlation (a measure between -1 and 1)

Positive correlation Negative correlation

You can use absolute correlation to capture both positive and negative correlation


Potential pitfalls

Correlation = 1


Distance metrics

• City Block (Manhattan) distance:- Sum of differences across

dimensions- Less sensitive to outliers - Diamond shaped clusters

• Euclidean distance:- Most commonly used

distance- Sphere shaped cluster- Corresponds to the

geometric distance into the multidimensional space

i

ii yxYXd ),( i

ii yxYXd 2)(),(

where gene X = (x1,…,xn) and gene Y=(y1,…,yn)

X

Y

Condition 1

Co

nd

itio

n 2

Condition 1

X

Y

Co

nd

itio

n 2


Euclidean vs Correlation (I)

• Euclidean distance

• Correlation


How to Compute Group Similarity?

Given two groups g1 and g2,

•Single-link algorithm: s(g1,g2)= similarity of the closest pair

•Complete-link algorithm: s(g1,g2)= similarity of the furtherest pair•Average-link algorithm: s(g1,g2)= average of similarity of all pairs

Four Popular Methods:

•Centroid algorithm: s(g1,g2)= distance between centroids of the two clusters

Supplementary slide

Adapted from internet


xx

Single (nearest neighbor)Leads to the “cluster chains”

Distance between centroids

Distance between clustersExamples of clustering methods

Average (Mean) linkage

xx

Complete (furtherest neighbor): Leads to small compact clusters


Comparison of the Three Methods

• Single-link- Elongated clusters - Individual decision, sensitive to outliers

• Complete-link- Compact clusters - Individual decision, sensitive to outliers

• Average-link or centroid- “In between” - Group decision, insensitive to outliers

• Which one is the best? Depends on what you need!



Clustering algorithms

• Clustering algorithm comes in 2 basic flavors

Partitioning Hierarchical


Partitioning methods

• Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.

• Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Ideally, dissimilarity between clusters will be maximized while it is minimized within clusters.

• Examples:- k-means, self-organizing maps (SOM), PAM, etc.;- Fuzzy (each object is assigned probability of being in a cluster): needs stochastic model, e.g. Gaussian

mixtures.


K = 2



K = 4



Example of a partitioning algorithm K-Means or PAM (Partitioning Around

Medoids)

1. Given a similarity function

2. Start with k randomly selected data points

3. Assume they are the centroids (medoids) of k clusters

4. Assign every data point to a cluster whose centroid (medoid) is the closest to the data point

5. Recompute the centroid (medoid) for each cluster

6. Repeat this process until the similarity-based objective function converges


Mixture Model Estimation

• Likelihood function (generally Gaussian)

• Parameters: e.g., i, i, I

• Using EM algorithm- Similar to “soft” K-mean

• Number of clusters can be determined using a model-selection criterion, e.g. BIC (Raftery and Fraley, 1998)

21

221

( )( ) exp( )

2i

ki

ii i

xp x



Hierarchical methods

• Hierarchical clustering methods produce a tree or dendrogram.

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

• The tree can be built in two distinct ways- bottom-up: agglomerative clustering (usually used).- top-down: divisive clustering.


Agglomerative Methods

• Start with n mRNA sample (or G gene) clusters

• At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters

The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furtherest pair of points in the two clusters)

Supplementary slide


Divisive Methods

• Start with only one cluster

• At each step, split clusters into two parts

• Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram)

• Disadvantage: Computational difficulties when considering all possible divisions into two groups

Divisive methods are rarely utilized in microarray data analysis.

Supplementary slide


1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Illustration of points In two dimensional space

1

5

34

2


1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Tree re-ordering?

1

5

34

2

1 52 3 4


Partitioning vs. hierarchical

Partitioning:Advantages• Optimal for certain criteria.• Objects automatically assigned to

clustersDisadvantages• Need initial k;• Often require long computation

times.• All objects are forced into a

cluster.

HierarchicalAdvantages• Faster computation.• Visual.Disadvantages• Unrelated objects are

eventually joined• Rigid, cannot correct later for

erroneous decisions made earlier.

• Hard to define clusters – still need to know “where to cut”.

Note that hierarchical clustering results may be used as the starting points for the partitioning or model-based algorithms


Clustering microarray data

• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.

Examples:• We can cluster cell samples (cols), e.g. the identification of new / unknown tumor classes or

cell subtypes using gene expression profiles.

• We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes.

• We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.


Estimating number of clusters using silhouette (see PAM)

Define silhouette width of the observation is :

S = (b-a)/max(a,b)

Where a is the average dissimilarity to all the points in the cluster and bIs the minimum distance to any of the objects in the other clusters.

Intuitively, objects with large S are well-clustered while the ones with small Stend to lie between clusters.

How many clusters: Perform clustering for a sequence of the number of clustersk and choose the number of components corresponding to the largest averagesilhouette.

Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering sampes.


Estimating Number of Clusters with Silhouette (ctd)

Compute average silhouette for k=3And compare it with the results forother k’s.


Estimating number of clusters using reference distribution

r

k

r rk D

nW

1 2

1

Idea: Define a goodness of clustering score to minimize, e,g. pooled Within clusters Sum of Squares (WSS) around the cluster means, reflecting compactness of clusters.

where n and D are the number of points in the cluster and sum of all pairwise distances, respectively.

)log())(log()( *kknn WWEkGap

Then gap statistic for k clusters is defined as:

Where E*n is the average under a sample of the same size from the reference distribution. Reference distribution can be generated either parametrically (e.g. from a multivariate) or non-parametrically (e.g. by sampling from marginal distributions of the variables. The first local maximum is chosen to be the number of clusters (slightly more complicated rule) (Tibshirani et al, 2001)



Estimating number of clusters

There are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ).

The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework.

It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering.


Confidence in of the individual cluster assignments

Want to assign confidence to individual observations of being in their assigned clusters.

•Model-based clustering: natural probability interpretation

•Partitioning methods: silhouette

•Dudoit and Fridlyand (2003) have presented a resampling-based approach that assigns confidence by computing how proportion of resampling timesthat an observation ends up in the assigned cluster.


Tight clustering (genes)Identifies small stable gene clusters by not attempting to cluster all the genes.Thus, it does not necessitate estimation of the number of clusters and assignment of all points into the clusters. Aids interpretability and validity of the results. (Tseng et al, 2003)

Algorithm:

For sequence of k > k0:

1. Identify the set of genes that are consistently grouped together when genes are repeatedly sub-sampled. Order those sets by size. Consider the top largest q sets for each k.

2. Stop when for (k, (k+1)), the two sets are nearly identical. Take the set corresponding to (k+1). Remove that set from the dataset.

3. Set k0 = k0 -1 and repeat the procedure.


Two-way clustering of genes and samples.

Refer to the methods that use samples and genes simulteneously to extract information. These methods are not yet well developed.

Some examples of the approaches include Block Clustering (Hartigan, 1972) which repeatedly rearranges rows and columns to obtain the largest reduction of total within block variance.

Another method is based on Plaid Models (Lazzeroni and Owen, 2002)

Friedman and Meulmann (2002) present an algorithm allowing to cluster samples based on the subsets of attributes, i.e. each group of samples could have beencharacterized by different gene sets.


Applications of clustering to themicroarray data

Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,.

•Three subtypes of lymphoma (FL, CLL and DLBCL) have different genetic signatures. (81 cases total)

•DLBCL group can be partitioned into two subgroups with significantly different survival. (39 DLBCL cases)


Taken from Nature February, 2000Paper by A Alizadeh et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,

Clustering both cell samplesand genes


Clustering cell samplesDiscovering sub-groups

Taken from Alizadeh et al (Nature, 2000)


Attempt at validationof DLBCL subgroups

Taken from Alizadeh et al (Nature, 2000)


Yeast Cell Cycle (Cho et al, 1998)6 × 5 SOM with 828 genes

Taken from Tamayo et al, (PNAS, 1999)

Clustering genesFinding different patterns in the data


Summary

Which clustering method should I use?- What is the biological question?- Do I have a preconceived notion of how many clusters there

should be?- Hard or soft boundaries between clusters

Keep in mind:- Clustering cannot NOT work. That is, every clustering

methods will return clusters.- Clustering helps to group / order information and is a

visualization tool for learning about the data. However, clustering results do not provide biological “proof”.

- Clustering is generally used as an exploratory and hypotheses generation tool.


Discrimination


Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?


Discrimination and Allocation

Learning SetData with

known classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction


?Bad prognosis

recurrence < 5yrsGood Prognosis

recurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5


B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL


Classification Rule

-Classification procedure,-Feature selection,

-Parameters [pre-determine, estimable],

Distance measure,Aggregation methods

Performance Assessmente.g. Cross validation

• One can think of the classification rule as a black box, some methods provides more insight into the box.

• Performance assessment needs to be looked at for all classification rule.


Classification rule Maximum likelihood discriminant rule

• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest.

• For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by

C(X) = argmaxk pk(X)


Gaussian ML discriminant rules

• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the ML classifier is

C(X) = argmink {(X - k) k-1

(X - k)’ + log| k |}

• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities


ML discriminant rules - special cases

[DLDA] Diagonal linear discriminant analysisclass densities have the same diagonal covariance matrix = diag(s1

2, …, sp2)

[DQDA] Diagonal quadratic discriminant analysis)class densities have different diagonal covariance matrix k= diag(s1k

2, …, spk2)

Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation).


The Logistic Regression Model

2-class case: log[p/(1-p)] = + X + e

p is the probability that the event Y occurs given the observed gene expression pattern, p(Y=1 | X)

p/(1-p) is the "odds ratio" log[p/(1-p)] is the log odds ratio, or "logit"

This can easily be generalized to multiclass outcome and to more general dependences than linear. Also, logistic regression makes fewer assumptions on the marginal distribution of the variables. However, the results are generally very similat to LDA. (Hastie et al, 2003)

t


Classification with SVMs

Generalization of the ideas of separating hyperplanes in the original space.Linear boundaries between classes in higher-dimensional space lead tothe non-linear boundaries in the original space.



Nearest neighbor classification

• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation).

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:- find the k observations in the learning set closest to X- predict the class of X by majority vote, i.e., choose

the class that is most common among those k observations.

• The number of neighbors k can be chosen by cross-validation (more on this later).


Nearest neighbor rule


Classification tree

• Partition the feature space into a set of rectangles, then fit a simple model in each one

• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)

• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier


Classification tree

Gene 1Mi1 < -0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no 0

1

2

Gene 1

Gene 2

-0.67

0.18


Three aspects of tree construction

• Split selection rule: - Example, at each node, choose split maximizing decrease in

impurity (e.g. Gini index, entropy, misclassification error).

• Split-stopping:

- Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate.

• Class assignment:

- Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node.

Supplementary slide


Another component in classification rule:aggregating classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1




Examples:BaggingBoosting

Random Forest

Aggregateclassifier


Aggregating classifiers:Bagging

Training Set (arrays)X1, X2, … X100

Tree 1Resample 1

X*1, X*2, … X*100

Lets the treevote

Tree 2Resample 2

X*1, X*2, … X*100

Tree 499Resample 499X*1, X*2, … X*100

Tree 500Resample 500X*1, X*2, … X*100

Testsample

Class 1

Class 2

Class 1

Class 1

90% Class 110% Class 2


Other classifiers include…

• Neural networks

• Projection pursuit

• Bayesian belief networks

• …


Why select features

• Lead to better classification performance by removing variables that are noise with respect to the outcome

• May provide useful insights into etiology of a disease

• Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)


Why select features?

Correlation plotData: Leukemia, 3 class

No feature selection

Top 100 feature selection

Selection based on variance

-1 +1


Approaches to feature selection

• Methods fall into three basic category- Filter methods- Wrapper methods- Embedded methods

• The simplest and most frequently used methods are the filter methods.

Adapted from A. Hartemnick


Filter methods

Rp

Feature selection Rs

s << pClassifier design

•Features are scored independently and the top s are used by the classifier

•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into the disease markers.



Problems with filter method

• Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information

• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)

• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.

Supplementary slide



Dimension reduction: a variant on a filter method

• Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc)

• Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks)

• Those methods tend not to work better than simple filter methods.

Supplementary slide



Wrapper methods

Rp

Feature selection Rs

s << pClassifier design

•Iterative approach: many feature subsets are scored based on classification performance and best is used.

•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc



Problems with wrapper methods

• Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated

• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.

• Easy to overfit.

p

Supplementary slide



Embedded methods

• Attempt to jointly or simultaneously train both a classifier and a feature subset

• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Some examples: tree-building algorithms, shrinkage methods (LDA, kNN)



Performance assessment

• Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

• One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.

• Assessing performance of the classifier based on- Cross-validation.- Test set- Independent testing on future dataset


Diagram of performance assessment

Training set


TrainingSet

Independenttest set

Classifier

Classifier

Resubstitution estimation

Test set estimation


Performance assessment (II)

• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller

variance- Computationally intensive.

• Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)

Supplementary slide


Performance assessment (I)

• Resubstitution estimation: error rate on the learning set.- Problem: downward bias

• Test set estimation:

1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T.

2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T).

- L and T must be independent and identically distributed (i.i.d).- Problem: reduced effective sample size

Supplementary slide


Diagram of performance assessment

Training set


TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation


Performance assessment (III)

• Common practice to do feature selection using the learning , then CV only for model building and classification.

• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased.

• Features (variables) should be selected only from the learning set used to build the model (and not the entire set)


Comparison study

• Leukemia data – Golub et al. (1999)- n = 72 samples, - G = 3,571 genes,- 3 classes (B-cell ALL, T-cell ALL, AML).

• Reference:S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87


Leukemia data, 3 classes: Test set error rates;150 LS/TS runs


Results

• In the main comparison, NN and DLDA had the smallest error rates.

• Aggregation improved the performance of CART classifiers.

• For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers.


Comparison study – discussion (I)

• “Diagonal” LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions.

• Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics.

• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions.


Summary (I)

• Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge.

• Cross-validation . It is of utmost importance to cross-validate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features- how many neighbors- pooled or unpooled variance- classifier itself.

If this is not done, it is possible to wrongly declare having discrimination power when there is none.


Summary (II)

• Generalization error rate estimation. It is necessary to keep sampling scheme in mind.

• Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier.

• We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples.


Some performance assessment quantities

Assume 2-class problemclass 1 = no event ~ null hypothesis. E.g. , no recurrenceclass 2 = event ~ alternative hypothesis. E.g., recurrence

All quantities are estimated on the available dataset (test set if available)

• Misclassification error rate: proportion of misclassified samples• Lift: proportion of correct class 2 predictions divided by the

proportion of class 2 cases Proportion (class 2 is true | class 2 is detected) / Proportion (class is

2)• Odds ratio: measure of association between true and predicted

labels.


Some performance assessment quantities (ctd)

• Sensitivity: proportion of correct class 2 predictions

Prob(detect class 2| class 2 is true) ~ power

• Specificity: proportion of correct class 1 predictions

Prob(declare class 1 | class 1 is true ) = 1 –

Prob(detect class 2 | class 1 is true) ~ 1 – type I error


Some performance assessment quantities (ctd)

• Positive Predictive Value (PPV): proportion of class 2 cases among predicted class 2 cases (should be applicable to the population)

Prob(class 2 is true | class 2 is detected) = P(detect class 2 | class 2 is true) x Prob(class 2 is true )/Prob(detect class 2) =

sensitivity x Prob(class is 2)/

[sensitivity x Prob(class is 2) + (1-specificity) x (1-Prob(class2))]Note that PPV is the only quantity explicitely incorporating population

proportions: i.e., prevalence of class 2 in the population of interest ( Prob(class is 2)) as well as sensitivity and specificity.

If the prevalence is low, specificity of the test has to be very high to be clinically useful.


Reference 1Retrospective studyL van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002..

Learning set

Bad Good

ClassificationRule

Reference 2Cohort studyM Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec 2002.

Reference 3Prospective trials.Aug 2003Clinical trialshttp://www.agendia.com/

Feature selection.Correlation with class

labels, very similar to t-test.

Using cross validation toselect 70 genes

295 samples selected from Netherland Cancer Institute

tissue bank (1984 – 1995).

Results” Gene expression profile is a morepowerful predictor then standard systems based on clinical and histologic criteria

Agendia (formed by reseachers from the Netherlands Cancer Institute)Has started in Oct, 2003

1) 5000 subjects [Health Council of the Netherlands]2) 5000 subjects New York based Avon Foundation.

Custom arrays are made by Agilent including 70 genes + 1000 controls

Case studies


Van’t Veer breast cancer study study

Investigate whether tumor ability for metastasis is

obtained later in development or inherent in the initial

gene expression signature.

• Retrospective sampling of node-negative women: 44 non-recurrences within 5 years of surgery and 34 recurrences. Additionally, 19 test sample (12 recur. and 7 non-recur)

• Want to demonstrate that gene expression profile is significantly associated with recurrence independent of the other clinical variables.

Nature, 2002


Predictor development

• Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there are significant enrichment for such genes in the dataset.

• Rank-order genes on the basis of their correlation• Optimize number of genes in the classifier by using CV-1

Classification is made on the basis of the correlations of the expression profile of leave-out-out sample with the mean expression of the remaining samples from the good and bad prognosis patients, respectively.

N. B.: The correct way to select genes is within rather than outside cross-validation, resulting in different set of markers for each CV iteration

N. B. : Optimizing number of variables and other parameters should be done via 2-level cross-validation if results are to be assessed on the training set.

The classification indicator is included into the logistic model along with other clinical variables. It is shown that gene expression profile has the strongest effect. Note that some of this may be due to overfitting for the threshold parameter.


Van ‘t Veer, et al., 2002


van de Vuver’s breast data(NEJM, 2002)

• 295 additional breast cancer patients, mix of node-negative and node-positive samples.

• Want to use the predictor that was developed to identify patients at risk for metastasis.

• The predicted class was significantly associated with time to recurrence in the multivariate cox-proportional model.


Some examples of wrong answers and questions in microarray data analysis


Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

TestingEstimation DiscriminationAnalysis

Clustering

Life Cycle

Quality measurement

Failed

Pass


Prediction I: estimating misclassification error

Performance of the classifiers on the future samples needs to be assessed while taking population proportions into the account.

Question: Build a classifier to predict a rare (1/100) subclass of cancer and estimate its misclassification rate in the population.

Design: Retrospectively collect equal numbers of rare and common subtypes and build a classifier. Estimate its future performance using cross-validation on the collected set.

Issues: Population proportions of the two types differ from the proportions in the study. For instance, if 0/50 of rare subtype and 10/50 of common subtype were misclassified (10/100), then in population, we expect to observe 1 rare instance and 99 common ones and will misclassify approximately 20/100 samples.

Conclusion: If a dataset is not representative of population distributions, one needs to think hard about how to do the “translation”. (e.g., Positive Predictive Value on the future samples vs Specificity and Sensitivity on the current ones).


50% 43% 10% 1% 0.1% One per 2500

Prevalence

Specificity90% 91 88 53 9 1 0.495% 95 94* 69 17 2 0.8**99% 99 99 92 50 9 499.9% 99.9 99.9 99 91 50 29

Prediction II: Prevalence vs PPV (ctd)

Assumes a constant sensitivity of 100%.

*PPV reported by Petricoin et al (2002)

**Correct PPV assuming prevalence of ovariann cancer in general population is1/2500.

Note that discovering discriminatory power is not the same as demonstrating a clinicalutility of the classifier.l

Adapted from the comment in Lancer by Rockhill


Experimental design

Proper randomization is essential in experimental design.

Question: Build a predictor to diagnose ovarian cancer

Design: Tissue from Normal women and Ovarian cancer patients arrives at different times.

Issues: Complete confounding between tissue type and time of processing.

This phenomenom is very common in the absence of carefully thought-through design.

Post-mortem diagnosis: lack of randomization.


Clustering I

The procedure should not bias results towards desired conclusions.

Question: Do expression data cluster according to the survival status.

Design: Identify genes with high t-statistic for comparison short and long survivors. Use these genes to cluster samples. Get excited that samples cluster according to survival status.

Issues: The genes were already selected based on the survival status. Therefore, it would rather be surprising if samples did *not* cluster according to their survival.

Conclusion: None are possible with respect to clustering as variable selection was driven by class distinction.


Clustering II

P-values for differential expression are only valid when the class labels are independent of the current dataset.

Question: Identify genes distinguishing among “interesting” subgroups.Design: Cluster samples into K groups. For each gene, compute F-

statistic and its associated p-value to test for differential expression among two subgroups.

Issues: Same data was used to create groups as to test for DEs – p-values are invalid.

Conclusion: None with respect to DEs p-values. Nevertheless, it is possible to select genes with high value of the statistic and test hypotheses about functional enrichment with, e.g., Gene Ontology. Also, can cluster these genes and use the results to generate new hypotheses.


AcknowledgementsUCSF /CBMB• Ajay Jain• Mark Segal• UCSF Cancer Center

Array Core• Jain Lab

SFGH• Agnes Paquet • David Erle• Andrea Barczac• UCSF Sandler Genomics

Core Facility.

UCB• Terry Speed• Sandrine Dudoit


Some references1. Hastie, Tibshirani, Friedman “The Elements of Statistical Learning”, Springer, 20012. Speed (editor) “Statistical Analysis of Gene Expression Microarray Data”, Chapman & Hall/CRC, 20033. Alizadeh et al, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 20004. Van ‘t Veer et al, “Gene expression profiling predicts clinical outcome of breast cancer, Nature, 20025. Van de Vijver et al, “A gene-expression signature as a predictor of survival in breast cancer, NEJM, 20026. Petricoin et al, “Use of proteomics patterns in serum to identify ovarian cancer”, Lancet, 2002 (and relevant correspondence)7. Golub et al, “Molecular Classification of Cancer: Class Discovery and Class prediction by Gene Expression Monitoring “, Science, 19998. Cho et al, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 19999. Dudoit, et al, :Comparison of discrimination methods for the classification of tumors using gene expression data, JASA, 2002


Some references10. Ambroise and McLachlan, “Selection bias in gene extraction on the basis microarray gene expression data”, PNAS, 200211. Tibshirani et al, “Estimating the number of clusters in the dataset via the GAP statistic”, Tech Report, Stanford, 200012. Tseng et al, “Tight clustering : a resampling-based approach for identifying stable and tight patterns in data”, Tech Report, 200313. Dudoit and Fridlyand, “A prediction-based resampling method for estimating the number of clusters in a dataset “, Genome Biology, 200214. Dudoit and Fridlyand, “Bagging to improve the accuracy of a clustering procedure”, Bioinformatics, 200315. Kaufmann and Rousseeuw, “Clustering by means of medoids.”, Elsevier/North Holland 198716. See many article by Leo Breiman on aggregation

Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...

Documents

Transcript of Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...