Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...

117
1 Microarray Workshop Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San Francisco Elsinore, Denmark May 17-21, 2004

Transcript of Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane...

Page 1: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

1Microarray Workshop

Introduction to Classification Issues in Microarray Data Analysis

Jane FridlyandJean Yee Hwa Yang

University of California, San Francisco

Elsinore, DenmarkMay 17-21, 2004

Page 2: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

2Microarray Workshop

Brief Overview of the Life-Cycle

Page 3: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

3Microarray Workshop

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Pre-processing

Biological question

TestingEstimation DiscriminationAnalysis

Clustering

Life Cycle

Quality measurement

Failed

Pass

Page 4: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

4Microarray Workshop

• The steps outlined in the “Life Cycle” need to be carefully thought through and re-adjusted for each data type/platform combination. Experimental design will impact what questions should be asked and may be answered once the data are collected.

• To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.

Sir RA Fisher

Page 5: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

5Microarray Workshop

****

*

GeneChip Affymetrix

cDNA microarray

Nylon membrane

Agilent: Long oligo Ink Jet

Illumina Bead Array

CGH

SAGE

DifferentTechnologies

Page 6: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

6Microarray Workshop

Some statistical issues

• Designing gene expression experiments.• Acquiring the raw data: image analysis.• Assessing the quality of the data.• Summarizing and removing artifacts from the data.• Interpretation and analysis of the data:

- Discovering which genes are differentially expressed- Discovering which genes exhibit interesting expression patterns- Detection of gene regulatory mechanisms. - Classification of samples- And many others…

Lots of other bioinformatics issues …

For a review see Smyth, Yang and Speed, “Statistical issues in microarray data analysis”, In: Functional Genomics: Methods and Protocols, Methods in Molecular Biology, Humana Press, March 2003

Page 7: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

7Microarray Workshop

Short-oligonucleotide chip data:• quality assessment,• background correction, • probe-level normalization,• probe set summary

Two-color spotted array data:• quality assessment; diagnostic plots,• background correction, • array normalization.

CEL, CDF files gpr, gal files

probes by sample matrix of log-ratios or log-intensities

Analysis of expression data:• Identify D.E. genes, estimation and testing,• clustering, and • discrimination.

Qu

alit

y as

sess

men

tP

re-p

roce

ssin

g

Array CGH data:•quality assessment; diagnostic plots,•, background correction • clones summary; • array normalization.

UCSF spot file

Imag

e an

alys

isA

nal

ysis

Page 8: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

8Microarray Workshop

Linear ModelsSpecific examplesT-testsF-testsEmpirical bayesSAM

Examples• Identify differential expression genes among

two or more tumor subtypes or different cell treatments.

• Look for genes that have different time profiles between different mutants.

• Looking for genes associated with survival.

Linear Models

Page 9: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

9Microarray Workshop

Clustering

Algorithms•Hierarchical clustering•Self-organizing maps•Partition around medoids (pam)

Examples• We can cluster cell samples (cols),

the identification of new / unknown tumor sub classes or cell sub types using gene expression profiles.

• We can cluster genes (rows) , using large numbers of yeast experiments, to identify groups of co-expressed genes.

Page 10: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

10Microarray Workshop

Discrimination

Classification rules • DLDA or DQDA• k-nearest neighbor (knn) • Support vector machine (svm)• Classification tree

Gene 1Mi1 < -0.67

Gene 2Mi2 > 0.18

B-ALL

AML

T-ALL

yes

yes

no

no

B-ALL T-ALL AML

Learning set

? Questions• Identification of groups of

genes that predictive of a particular class of tumors?

• Can I use the expression profile of cancer patients to predict survival?

Page 11: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

11Microarray Workshop

Annotation

Riken ID

ZX00049O01

GenBank accession

AV128498

Locuslink

15903

Biochemical

pathways

(KEGG)

Nucleotide Sequence

TCGTTCCATTTTTCTTTAGGGGGTCTTTCCCCGTCTTGGGGGGGAGGAAAAGTTCTGCTGCCCTGATTATGAACTCTATAATAGAGTATATAGCTTTTGTACCTTTTTTACAGGAAGGTGCTTTCTGTAATCATGTGATGTATATTAAACTTTTTATAAAAGTTAACATTTTGCATAAT AAACCATTTTTG

Bay Genomic

s ES cells

UniGene

Mm.110

MGD

MGI:96398

Name

Inhibitor of DNA binding 3

Gene Symbol

Idb3

Swiss-Prot

P20109

GO

GO:0000122

GO:0005634

GO:0019904

Map PositionChromosome:4

66.0 cM

PubMed 12858547 2000388

etc

Literature

Page 12: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

12Microarray Workshop

What is your questions?

• What are the targets genes for my knock-out gene?• Look for genes that have different time profiles between different cell types.

Gene discovery, differential expression

• Is a specified group of genes all up-regulated in a specified conditions?Gene set, differential expression

• Can I use the expression profile of cancer patients to predict survival?• Identification of groups of genes that predictive of a particular class of tumors?

Class prediction, classification

• Are there tumor sub-types not previously identified? • Are there groups of co-expressed genes?

Class discovery, clustering

• Detection of gene regulatory mechanisms. • Do my genes group into previously undiscovered pathways?

Clustering. Often expression data alone is not enough, need to incorporate sequence and other information

Page 13: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

13Microarray Workshop

Classification

Page 14: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

14Microarray Workshop

Gene expression data Two color spotted array

Data on G genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

= (normalized) Log( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 15: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

15Microarray Workshop

Classification

• Task: assign objects to classes (groups) on the basis of measurements made on the objects

• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)

• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

Page 16: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

16Microarray Workshop

Example: Tumor Classification

• Reliable and precise classification essential for successful cancer treatment

• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables

• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous

• Characterize molecular variations among tumors by monitoring gene expression (microarray)

• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

Page 17: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

17Microarray Workshop

Tumor Classification Using Gene Expression Data

Three main types of statistical problems associated with tumor classification:

• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)

• Classification of malignancies into known classes (supervised learning – discrimination)

• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).

Page 18: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

18Microarray Workshop

Clustering

Page 19: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

19Microarray Workshop

Generic Clustering Tasks

• Estimating number of clusters

• Assign samples to the groups

• Assessing strength/confidence of cluster assignments for individual objects

Page 20: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

20Microarray Workshop

What to cluster

• Samples: To discover novel subtypes of the existing groups or entirely new partitions. Their utility needs to be confirmed with other types of data, e.g. clinical information.

• Genes: To discover groups of co-regulated genes/ESTs and use these groups to infer function where it is unknown using members of the groups with known function.

Page 21: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

21Microarray Workshop

Basic principles of clustering

Aim: to group observations or variables that are “similar” based on predefined criteria.

Issues: Which genes / arrays to use? Which similarity or dissimilarity measure?

Which method to use to join clusters/observations? Which clustering algorithm?

How to validate the resulting clusters?

It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific and varies depending on what is being clustered, genes or arrays.

Page 22: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

22Microarray Workshop

Array Data

For each gene, calculate a summary statistics and/or

adjusted p-values

Clustering

Clusteringof genes

Set of candidate DE genes.Biological verification

Descriptive interpretation

Similarity metrics

Clustering algorithm

Page 23: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

23Microarray Workshop

Array Data

Set of samples to cluster

Clustering

Clusteringof samplesand genes

Set of genes to use in clustering (DO NOT use class labels in

the set determination).

Descriptive Interpretation

of genes separatingnovel subgroups of the samples

Similarity metrics

Clustering algorithm

Validation of clusters with clinical data

Page 24: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

24Microarray Workshop

Which similarity or dissimilarity measure?

• A metric is a measure of the similarity or dissimilarity between two data objects

• Two main classes of metric:- Correlation coefficients (similarity)

- Compares shape of expression curves- Types of correlation:

- Centered.- Un-centered.- Rank-correlation

- Distance metrics (dissimilarity)- City Block (Manhattan) distance- Euclidean distance

Page 25: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

25Microarray Workshop

• Pearson Correlation Coefficient (centered correlation)Sx = Standard deviation of x

Sy = Standard deviation of y

• Others include Spearman’s and Kendall’s

n

i y

i

x

in S

yy

S

xx

11

1

Correlation (a measure between -1 and 1)

Positive correlation Negative correlation

You can use absolute correlation to capture both positive and negative correlation

Page 26: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

26Microarray Workshop

Potential pitfalls

Correlation = 1

Page 27: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

27Microarray Workshop

Distance metrics

• City Block (Manhattan) distance:- Sum of differences across

dimensions- Less sensitive to outliers - Diamond shaped clusters

• Euclidean distance:- Most commonly used

distance- Sphere shaped cluster- Corresponds to the

geometric distance into the multidimensional space

i

ii yxYXd ),( i

ii yxYXd 2)(),(

where gene X = (x1,…,xn) and gene Y=(y1,…,yn)

X

Y

Condition 1

Co

nd

itio

n 2

Condition 1

X

Y

Co

nd

itio

n 2

Page 28: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

28Microarray Workshop

Euclidean vs Correlation (I)

• Euclidean distance

• Correlation

Page 29: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

29Microarray Workshop

How to Compute Group Similarity?

Given two groups g1 and g2,

•Single-link algorithm: s(g1,g2)= similarity of the closest pair

•Complete-link algorithm: s(g1,g2)= similarity of the furtherest pair•Average-link algorithm: s(g1,g2)= average of similarity of all pairs

Four Popular Methods:

•Centroid algorithm: s(g1,g2)= distance between centroids of the two clusters

Supplementary slide

Adapted from internet

Page 30: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

30Microarray Workshop

xx

Single (nearest neighbor)Leads to the “cluster chains”

Distance between centroids

Distance between clustersExamples of clustering methods

Average (Mean) linkage

xx

Complete (furtherest neighbor): Leads to small compact clusters

Page 31: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

31Microarray Workshop

Comparison of the Three Methods

• Single-link- Elongated clusters - Individual decision, sensitive to outliers

• Complete-link- Compact clusters - Individual decision, sensitive to outliers

• Average-link or centroid- “In between” - Group decision, insensitive to outliers

• Which one is the best? Depends on what you need!

Adapted from internet

Page 32: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

32Microarray Workshop

Clustering algorithms

• Clustering algorithm comes in 2 basic flavors

Partitioning Hierarchical

Page 33: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

33Microarray Workshop

Partitioning methods

• Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.

• Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Ideally, dissimilarity between clusters will be maximized while it is minimized within clusters.

• Examples:- k-means, self-organizing maps (SOM), PAM, etc.;- Fuzzy (each object is assigned probability of being in a cluster): needs stochastic model, e.g. Gaussian

mixtures.

Page 34: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

34Microarray Workshop

K = 2

Partitioning methods

Page 35: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

35Microarray Workshop

K = 4

Partitioning methods

Page 36: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

36Microarray Workshop

Example of a partitioning algorithm K-Means or PAM (Partitioning Around

Medoids)

1. Given a similarity function

2. Start with k randomly selected data points

3. Assume they are the centroids (medoids) of k clusters

4. Assign every data point to a cluster whose centroid (medoid) is the closest to the data point

5. Recompute the centroid (medoid) for each cluster

6. Repeat this process until the similarity-based objective function converges

Page 37: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

37Microarray Workshop

Mixture Model for Clustering

P(X|Cluster1)

P(X|Cluster2)

P(X|Cluster3)

P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)2| ( , )i i iX Cluster N I is a cluster prior~

Adapted from internet

Page 38: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

38Microarray Workshop

Mixture Model Estimation

• Likelihood function (generally Gaussian)

• Parameters: e.g., i, i, I

• Using EM algorithm- Similar to “soft” K-mean

• Number of clusters can be determined using a model-selection criterion, e.g. BIC (Raftery and Fraley, 1998)

21

221

( )( ) exp( )

2i

ki

ii i

xp x

Adapted from internet

Page 39: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

39Microarray Workshop

Hierarchical methods

• Hierarchical clustering methods produce a tree or dendrogram.

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

• The tree can be built in two distinct ways- bottom-up: agglomerative clustering (usually used).- top-down: divisive clustering.

Page 40: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

40Microarray Workshop

Agglomerative Methods

• Start with n mRNA sample (or G gene) clusters

• At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters

The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furtherest pair of points in the two clusters)

Supplementary slide

Page 41: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

41Microarray Workshop

Divisive Methods

• Start with only one cluster

• At each step, split clusters into two parts

• Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram)

• Disadvantage: Computational difficulties when considering all possible divisions into two groups

Divisive methods are rarely utilized in microarray data analysis.

Supplementary slide

Page 42: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

42Microarray Workshop

1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Illustration of points In two dimensional space

1

5

34

2

Page 43: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

43Microarray Workshop

1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Tree re-ordering?

1

5

34

2

1 52 3 4

Page 44: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

44Microarray Workshop

Partitioning vs. hierarchical

Partitioning:Advantages• Optimal for certain criteria.• Objects automatically assigned to

clustersDisadvantages• Need initial k;• Often require long computation

times.• All objects are forced into a

cluster.

HierarchicalAdvantages• Faster computation.• Visual.Disadvantages• Unrelated objects are

eventually joined• Rigid, cannot correct later for

erroneous decisions made earlier.

• Hard to define clusters – still need to know “where to cut”.

Note that hierarchical clustering results may be used as the starting points for the partitioning or model-based algorithms

Page 45: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

45Microarray Workshop

Clustering microarray data

• Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space.

Examples:• We can cluster cell samples (cols), e.g. the identification of new / unknown tumor classes or

cell subtypes using gene expression profiles.

• We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes.

• We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.

Page 46: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

46Microarray Workshop

Estimating number of clusters using silhouette (see PAM)

Define silhouette width of the observation is :

S = (b-a)/max(a,b)

Where a is the average dissimilarity to all the points in the cluster and bIs the minimum distance to any of the objects in the other clusters.

Intuitively, objects with large S are well-clustered while the ones with small Stend to lie between clusters.

How many clusters: Perform clustering for a sequence of the number of clustersk and choose the number of components corresponding to the largest averagesilhouette.

Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering sampes.

Page 47: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

47Microarray Workshop

Estimating Number of Clusters with Silhouette (ctd)

Compute average silhouette for k=3And compare it with the results forother k’s.

Page 48: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

48Microarray Workshop

Estimating number of clusters using reference distribution

r

k

r rk D

nW

1 2

1

Idea: Define a goodness of clustering score to minimize, e,g. pooled Within clusters Sum of Squares (WSS) around the cluster means, reflecting compactness of clusters.

where n and D are the number of points in the cluster and sum of all pairwise distances, respectively.

)log())(log()( *kknn WWEkGap

Then gap statistic for k clusters is defined as:

Where E*n is the average under a sample of the same size from the reference distribution. Reference distribution can be generated either parametrically (e.g. from a multivariate) or non-parametrically (e.g. by sampling from marginal distributions of the variables. The first local maximum is chosen to be the number of clusters (slightly more complicated rule) (Tibshirani et al, 2001)

Adapted from internet

Page 49: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

49Microarray Workshop

Estimating number of clusters

There are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ).

The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework.

It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering.

Page 50: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

50Microarray Workshop

Confidence in of the individual cluster assignments

Want to assign confidence to individual observations of being in their assigned clusters.

•Model-based clustering: natural probability interpretation

•Partitioning methods: silhouette

•Dudoit and Fridlyand (2003) have presented a resampling-based approach that assigns confidence by computing how proportion of resampling timesthat an observation ends up in the assigned cluster.

Page 51: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

51Microarray Workshop

Tight clustering (genes)Identifies small stable gene clusters by not attempting to cluster all the genes.Thus, it does not necessitate estimation of the number of clusters and assignment of all points into the clusters. Aids interpretability and validity of the results. (Tseng et al, 2003)

Algorithm:

For sequence of k > k0:

1. Identify the set of genes that are consistently grouped together when genes are repeatedly sub-sampled. Order those sets by size. Consider the top largest q sets for each k.

2. Stop when for (k, (k+1)), the two sets are nearly identical. Take the set corresponding to (k+1). Remove that set from the dataset.

3. Set k0 = k0 -1 and repeat the procedure.

Page 52: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

52Microarray Workshop

Two-way clustering of genes and samples.

Refer to the methods that use samples and genes simulteneously to extract information. These methods are not yet well developed.

Some examples of the approaches include Block Clustering (Hartigan, 1972) which repeatedly rearranges rows and columns to obtain the largest reduction of total within block variance.

Another method is based on Plaid Models (Lazzeroni and Owen, 2002)

Friedman and Meulmann (2002) present an algorithm allowing to cluster samples based on the subsets of attributes, i.e. each group of samples could have beencharacterized by different gene sets.

Page 53: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

53Microarray Workshop

Applications of clustering to themicroarray data

Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,.

•Three subtypes of lymphoma (FL, CLL and DLBCL) have different genetic signatures. (81 cases total)

•DLBCL group can be partitioned into two subgroups with significantly different survival. (39 DLBCL cases)

Page 54: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

54Microarray Workshop

Taken from Nature February, 2000Paper by A Alizadeh et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,

Clustering both cell samplesand genes

Page 55: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

55Microarray Workshop

Clustering cell samplesDiscovering sub-groups

Taken from Alizadeh et al (Nature, 2000)

Page 56: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

56Microarray Workshop

Attempt at validationof DLBCL subgroups

Taken from Alizadeh et al (Nature, 2000)

Page 57: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

57Microarray Workshop

Yeast Cell Cycle (Cho et al, 1998)6 × 5 SOM with 828 genes

Taken from Tamayo et al, (PNAS, 1999)

Clustering genesFinding different patterns in the data

Page 58: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

58Microarray Workshop

Summary

Which clustering method should I use?- What is the biological question?- Do I have a preconceived notion of how many clusters there

should be?- Hard or soft boundaries between clusters

Keep in mind:- Clustering cannot NOT work. That is, every clustering

methods will return clusters.- Clustering helps to group / order information and is a

visualization tool for learning about the data. However, clustering results do not provide biological “proof”.

- Clustering is generally used as an exploratory and hypotheses generation tool.

Page 59: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

59Microarray Workshop

Discrimination

Page 60: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

60Microarray Workshop

Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?

Page 61: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

61Microarray Workshop

Discrimination and Allocation

Learning SetData with

known classes

ClassificationTechnique

Classificationrule

Data with unknown classes

ClassAssignment

Discrimination

Prediction

Page 62: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

62Microarray Workshop

?Bad prognosis

recurrence < 5yrsGood Prognosis

recurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

Page 63: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

63Microarray Workshop

B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL

Page 64: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

64Microarray Workshop

Classification Rule

-Classification procedure,-Feature selection,

-Parameters [pre-determine, estimable],

Distance measure,Aggregation methods

Performance Assessmente.g. Cross validation

• One can think of the classification rule as a black box, some methods provides more insight into the box.

• Performance assessment needs to be looked at for all classification rule.

Page 65: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

65Microarray Workshop

Classification rule Maximum likelihood discriminant rule

• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest.

• For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by

C(X) = argmaxk pk(X)

Page 66: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

66Microarray Workshop

Gaussian ML discriminant rules

• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the ML classifier is

C(X) = argmink {(X - k) k-1

(X - k)’ + log| k |}

• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities

Page 67: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

67Microarray Workshop

ML discriminant rules - special cases

[DLDA] Diagonal linear discriminant analysisclass densities have the same diagonal covariance matrix = diag(s1

2, …, sp2)

[DQDA] Diagonal quadratic discriminant analysis)class densities have different diagonal covariance matrix k= diag(s1k

2, …, spk2)

Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation).

Page 68: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

68Microarray Workshop

The Logistic Regression Model

2-class case: log[p/(1-p)] = + X + e

p is the probability that the event Y occurs given the observed gene expression pattern, p(Y=1 | X)

p/(1-p) is the "odds ratio" log[p/(1-p)] is the log odds ratio, or "logit"

This can easily be generalized to multiclass outcome and to more general dependences than linear. Also, logistic regression makes fewer assumptions on the marginal distribution of the variables. However, the results are generally very similat to LDA. (Hastie et al, 2003)

t

Page 69: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

69Microarray Workshop

Classification with SVMs

Generalization of the ideas of separating hyperplanes in the original space.Linear boundaries between classes in higher-dimensional space lead tothe non-linear boundaries in the original space.

Adapted from internet

Page 70: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

70Microarray Workshop

Nearest neighbor classification

• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation).

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:- find the k observations in the learning set closest to X- predict the class of X by majority vote, i.e., choose

the class that is most common among those k observations.

• The number of neighbors k can be chosen by cross-validation (more on this later).

Page 71: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

71Microarray Workshop

Nearest neighbor rule

Page 72: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

72Microarray Workshop

Classification tree

• Partition the feature space into a set of rectangles, then fit a simple model in each one

• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)

• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

Page 73: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

73Microarray Workshop

Classification tree

Gene 1Mi1 < -0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no 0

1

2

Gene 1

Gene 2

-0.67

0.18

Page 74: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

74Microarray Workshop

Three aspects of tree construction

• Split selection rule: - Example, at each node, choose split maximizing decrease in

impurity (e.g. Gini index, entropy, misclassification error).

• Split-stopping:

- Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate.

• Class assignment:

- Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node.

Supplementary slide

Page 75: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

75Microarray Workshop

Another component in classification rule:aggregating classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1

Classifier 2Resample 2

Classifier 499Resample 499

Classifier 500Resample 500

Examples:BaggingBoosting

Random Forest

Aggregateclassifier

Page 76: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

76Microarray Workshop

Aggregating classifiers:Bagging

Training Set (arrays)X1, X2, … X100

Tree 1Resample 1

X*1, X*2, … X*100

Lets the treevote

Tree 2Resample 2

X*1, X*2, … X*100

Tree 499Resample 499X*1, X*2, … X*100

Tree 500Resample 500X*1, X*2, … X*100

Testsample

Class 1

Class 2

Class 1

Class 1

90% Class 110% Class 2

Page 77: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

77Microarray Workshop

Other classifiers include…

• Neural networks

• Projection pursuit

• Bayesian belief networks

• …

Page 78: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

78Microarray Workshop

Why select features

• Lead to better classification performance by removing variables that are noise with respect to the outcome

• May provide useful insights into etiology of a disease

• Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)

Page 79: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

79Microarray Workshop

Why select features?

Correlation plotData: Leukemia, 3 class

No feature selection

Top 100 feature selection

Selection based on variance

-1 +1

Page 80: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

80Microarray Workshop

Approaches to feature selection

• Methods fall into three basic category- Filter methods- Wrapper methods- Embedded methods

• The simplest and most frequently used methods are the filter methods.

Adapted from A. Hartemnick

Page 81: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

81Microarray Workshop

Filter methods

Rp

Feature selection Rs

s << pClassifier design

•Features are scored independently and the top s are used by the classifier

•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into the disease markers.

Adapted from A. Hartemnick

Page 82: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

82Microarray Workshop

Problems with filter method

• Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information

• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)

• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.

Supplementary slide

Adapted from A. Hartemnick

Page 83: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

83Microarray Workshop

Dimension reduction: a variant on a filter method

• Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc)

• Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks)

• Those methods tend not to work better than simple filter methods.

Supplementary slide

Adapted from A. Hartemnick

Page 84: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

84Microarray Workshop

Wrapper methods

Rp

Feature selection Rs

s << pClassifier design

•Iterative approach: many feature subsets are scored based on classification performance and best is used.

•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc

Adapted from A. Hartemnick

Page 85: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

85Microarray Workshop

Problems with wrapper methods

• Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated

• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.

• Easy to overfit.

p

Supplementary slide

Adapted from A. Hartemnick

Page 86: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

86Microarray Workshop

Embedded methods

• Attempt to jointly or simultaneously train both a classifier and a feature subset

• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Some examples: tree-building algorithms, shrinkage methods (LDA, kNN)

Adapted from A. Hartemnick

Page 87: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

87Microarray Workshop

Performance assessment

• Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

• One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.

• Assessing performance of the classifier based on- Cross-validation.- Test set- Independent testing on future dataset

Page 88: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

88Microarray Workshop

Diagram of performance assessment

Training set

Performance assessment

TrainingSet

Independenttest set

Classifier

Classifier

Resubstitution estimation

Test set estimation

Page 89: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

89Microarray Workshop

Performance assessment (II)

• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller

variance- Computationally intensive.

• Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)

Supplementary slide

Page 90: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

90Microarray Workshop

Performance assessment (I)

• Resubstitution estimation: error rate on the learning set.- Problem: downward bias

• Test set estimation:

1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T.

2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T).

- L and T must be independent and identically distributed (i.i.d).- Problem: reduced effective sample size

Supplementary slide

Page 91: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

91Microarray Workshop

Diagram of performance assessment

Training set

Performance assessment

TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation

Page 92: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

92Microarray Workshop

Performance assessment (III)

• Common practice to do feature selection using the learning , then CV only for model building and classification.

• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased.

• Features (variables) should be selected only from the learning set used to build the model (and not the entire set)

Page 93: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

93Microarray Workshop

Comparison study

• Leukemia data – Golub et al. (1999)- n = 72 samples, - G = 3,571 genes,- 3 classes (B-cell ALL, T-cell ALL, AML).

• Reference:S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87

Page 94: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

94Microarray Workshop

Leukemia data, 3 classes: Test set error rates;150 LS/TS runs

Page 95: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

95Microarray Workshop

Results

• In the main comparison, NN and DLDA had the smallest error rates.

• Aggregation improved the performance of CART classifiers.

• For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers.

Page 96: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

96Microarray Workshop

Comparison study – discussion (I)

• “Diagonal” LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions.

• Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics.

• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions.

Page 97: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

97Microarray Workshop

Summary (I)

• Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge.

• Cross-validation . It is of utmost importance to cross-validate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features- how many neighbors- pooled or unpooled variance- classifier itself.

If this is not done, it is possible to wrongly declare having discrimination power when there is none.

Page 98: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

98Microarray Workshop

Summary (II)

• Generalization error rate estimation. It is necessary to keep sampling scheme in mind.

• Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier.

• We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples.

Page 99: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

99Microarray Workshop

Some performance assessment quantities

Assume 2-class problemclass 1 = no event ~ null hypothesis. E.g. , no recurrenceclass 2 = event ~ alternative hypothesis. E.g., recurrence

All quantities are estimated on the available dataset (test set if available)

• Misclassification error rate: proportion of misclassified samples• Lift: proportion of correct class 2 predictions divided by the

proportion of class 2 cases Proportion (class 2 is true | class 2 is detected) / Proportion (class is

2)• Odds ratio: measure of association between true and predicted

labels.

Page 100: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

100Microarray Workshop

Some performance assessment quantities (ctd)

• Sensitivity: proportion of correct class 2 predictions

Prob(detect class 2| class 2 is true) ~ power

• Specificity: proportion of correct class 1 predictions

Prob(declare class 1 | class 1 is true ) = 1 –

Prob(detect class 2 | class 1 is true) ~ 1 – type I error

Page 101: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

101Microarray Workshop

Some performance assessment quantities (ctd)

• Positive Predictive Value (PPV): proportion of class 2 cases among predicted class 2 cases (should be applicable to the population)

Prob(class 2 is true | class 2 is detected) = P(detect class 2 | class 2 is true) x Prob(class 2 is true )/Prob(detect class 2) =

sensitivity x Prob(class is 2)/

[sensitivity x Prob(class is 2) + (1-specificity) x (1-Prob(class2))]Note that PPV is the only quantity explicitely incorporating population

proportions: i.e., prevalence of class 2 in the population of interest ( Prob(class is 2)) as well as sensitivity and specificity.

If the prevalence is low, specificity of the test has to be very high to be clinically useful.

Page 102: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

102Microarray Workshop

Reference 1Retrospective studyL van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002..

Learning set

Bad Good

ClassificationRule

Reference 2Cohort studyM Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec 2002.

Reference 3Prospective trials.Aug 2003Clinical trialshttp://www.agendia.com/

Feature selection.Correlation with class

labels, very similar to t-test.

Using cross validation toselect 70 genes

295 samples selected from Netherland Cancer Institute

tissue bank (1984 – 1995).

Results” Gene expression profile is a morepowerful predictor then standard systems based on clinical and histologic criteria

Agendia (formed by reseachers from the Netherlands Cancer Institute)Has started in Oct, 2003

1) 5000 subjects [Health Council of the Netherlands]2) 5000 subjects New York based Avon Foundation.

Custom arrays are made by Agilent including 70 genes + 1000 controls

Case studies

Page 103: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

103Microarray Workshop

Van’t Veer breast cancer study study

Investigate whether tumor ability for metastasis is

obtained later in development or inherent in the initial

gene expression signature.

• Retrospective sampling of node-negative women: 44 non-recurrences within 5 years of surgery and 34 recurrences. Additionally, 19 test sample (12 recur. and 7 non-recur)

• Want to demonstrate that gene expression profile is significantly associated with recurrence independent of the other clinical variables.

Nature, 2002

Page 104: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

104Microarray Workshop

Predictor development

• Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there are significant enrichment for such genes in the dataset.

• Rank-order genes on the basis of their correlation• Optimize number of genes in the classifier by using CV-1

Classification is made on the basis of the correlations of the expression profile of leave-out-out sample with the mean expression of the remaining samples from the good and bad prognosis patients, respectively.

N. B.: The correct way to select genes is within rather than outside cross-validation, resulting in different set of markers for each CV iteration

N. B. : Optimizing number of variables and other parameters should be done via 2-level cross-validation if results are to be assessed on the training set.

The classification indicator is included into the logistic model along with other clinical variables. It is shown that gene expression profile has the strongest effect. Note that some of this may be due to overfitting for the threshold parameter.

Page 105: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

105Microarray Workshop

Van ‘t Veer, et al., 2002

Page 106: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

106Microarray Workshop

van de Vuver’s breast data(NEJM, 2002)

• 295 additional breast cancer patients, mix of node-negative and node-positive samples.

• Want to use the predictor that was developed to identify patients at risk for metastasis.

• The predicted class was significantly associated with time to recurrence in the multivariate cox-proportional model.

Page 107: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

107Microarray Workshop

Page 108: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

108Microarray Workshop

Some examples of wrong answers and questions in microarray data analysis

Page 109: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

109Microarray Workshop

Biological verification and interpretation

Microarray experiment

Experimental design

Image analysis

Normalization

Biological question

TestingEstimation DiscriminationAnalysis

Clustering

Life Cycle

Quality measurement

Failed

Pass

Page 110: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

110Microarray Workshop

Prediction I: estimating misclassification error

Performance of the classifiers on the future samples needs to be assessed while taking population proportions into the account.

Question: Build a classifier to predict a rare (1/100) subclass of cancer and estimate its misclassification rate in the population.

Design: Retrospectively collect equal numbers of rare and common subtypes and build a classifier. Estimate its future performance using cross-validation on the collected set.

Issues: Population proportions of the two types differ from the proportions in the study. For instance, if 0/50 of rare subtype and 10/50 of common subtype were misclassified (10/100), then in population, we expect to observe 1 rare instance and 99 common ones and will misclassify approximately 20/100 samples.

Conclusion: If a dataset is not representative of population distributions, one needs to think hard about how to do the “translation”. (e.g., Positive Predictive Value on the future samples vs Specificity and Sensitivity on the current ones).

Page 111: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

111Microarray Workshop

50% 43% 10% 1% 0.1% One per 2500

Prevalence

Specificity90% 91 88 53 9 1 0.495% 95 94* 69 17 2 0.8**99% 99 99 92 50 9 499.9% 99.9 99.9 99 91 50 29

Prediction II: Prevalence vs PPV (ctd)

Assumes a constant sensitivity of 100%.

*PPV reported by Petricoin et al (2002)

**Correct PPV assuming prevalence of ovariann cancer in general population is1/2500.

Note that discovering discriminatory power is not the same as demonstrating a clinicalutility of the classifier.l

Adapted from the comment in Lancer by Rockhill

Page 112: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

112Microarray Workshop

Experimental design

Proper randomization is essential in experimental design.

Question: Build a predictor to diagnose ovarian cancer

Design: Tissue from Normal women and Ovarian cancer patients arrives at different times.

Issues: Complete confounding between tissue type and time of processing.

This phenomenom is very common in the absence of carefully thought-through design.

Post-mortem diagnosis: lack of randomization.

Page 113: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

113Microarray Workshop

Clustering I

The procedure should not bias results towards desired conclusions.

Question: Do expression data cluster according to the survival status.

Design: Identify genes with high t-statistic for comparison short and long survivors. Use these genes to cluster samples. Get excited that samples cluster according to survival status.

Issues: The genes were already selected based on the survival status. Therefore, it would rather be surprising if samples did *not* cluster according to their survival.

Conclusion: None are possible with respect to clustering as variable selection was driven by class distinction.

Page 114: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

114Microarray Workshop

Clustering II

P-values for differential expression are only valid when the class labels are independent of the current dataset.

Question: Identify genes distinguishing among “interesting” subgroups.Design: Cluster samples into K groups. For each gene, compute F-

statistic and its associated p-value to test for differential expression among two subgroups.

Issues: Same data was used to create groups as to test for DEs – p-values are invalid.

Conclusion: None with respect to DEs p-values. Nevertheless, it is possible to select genes with high value of the statistic and test hypotheses about functional enrichment with, e.g., Gene Ontology. Also, can cluster these genes and use the results to generate new hypotheses.

Page 115: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

115Microarray Workshop

AcknowledgementsUCSF /CBMB• Ajay Jain• Mark Segal• UCSF Cancer Center

Array Core• Jain Lab

SFGH• Agnes Paquet • David Erle• Andrea Barczac• UCSF Sandler Genomics

Core Facility.

UCB• Terry Speed• Sandrine Dudoit

Page 116: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

116Microarray Workshop

Some references1. Hastie, Tibshirani, Friedman “The Elements of Statistical Learning”, Springer, 20012. Speed (editor) “Statistical Analysis of Gene Expression Microarray Data”, Chapman & Hall/CRC, 20033. Alizadeh et al, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 20004. Van ‘t Veer et al, “Gene expression profiling predicts clinical outcome of breast cancer, Nature, 20025. Van de Vijver et al, “A gene-expression signature as a predictor of survival in breast cancer, NEJM, 20026. Petricoin et al, “Use of proteomics patterns in serum to identify ovarian cancer”, Lancet, 2002 (and relevant correspondence)7. Golub et al, “Molecular Classification of Cancer: Class Discovery and Class prediction by Gene Expression Monitoring “, Science, 19998. Cho et al, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 19999. Dudoit, et al, :Comparison of discrimination methods for the classification of tumors using gene expression data, JASA, 2002

Page 117: Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

117Microarray Workshop

Some references10. Ambroise and McLachlan, “Selection bias in gene extraction on the basis microarray gene expression data”, PNAS, 200211. Tibshirani et al, “Estimating the number of clusters in the dataset via the GAP statistic”, Tech Report, Stanford, 200012. Tseng et al, “Tight clustering : a resampling-based approach for identifying stable and tight patterns in data”, Tech Report, 200313. Dudoit and Fridlyand, “A prediction-based resampling method for estimating the number of clusters in a dataset “, Genome Biology, 200214. Dudoit and Fridlyand, “Bagging to improve the accuracy of a clustering procedure”, Bioinformatics, 200315. Kaufmann and Rousseeuw, “Clustering by means of medoids.”, Elsevier/North Holland 198716. See many article by Leo Breiman on aggregation