* B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable...

33
*BIOINF 704* St´ ephane Guindon Department of Statistics, UoA. Building 303, Room 381. [email protected] Class discovery

Transcript of * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable...

Page 1: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

* B I O I N F 7 0 4 *

Stephane Guindon

Department of Statistics, UoA.

Building 303, Room 381.

[email protected]

Class discovery

Page 2: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

The big ‘p’, small ‘n’ problem

• Microarray classification suffers from the big ‘p’, small ‘n’problem, aka the ‘curse of dimensionality’.

• This relates to the fact that in microarray experiments, thenumber of samples (n) is far smaller than the number ofgenes, which are the potential predictors (p).

Class discovery

Page 3: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Why is the big ‘p’, small ‘n’ a problem ?

• Think about this is terms of dimensionality of the data.

• We can plot each of our n samples (i.e., arrays) in p

dimensional space (where p is our number of potentialpredictors, i.e., genes).

• Generally we will have around two orders of magnitudedifference between the size of these quantities: typically n willbe in the hundreds (at most), and p will be in the tens ofthousands.

• If we now try to find a hyperplane (in < p dimensions) whichwill separate our n samples into k classes, there is a goodchance that such a hyperplane will exist, simply because wehave so many dimensions.

Class discovery

Page 4: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Why is the big ‘p’, small ‘n’ a problem ?

gene 1

gene 2

We can’t separate the twogroups using a straight line.

Class discovery

Page 5: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Why is the big ‘p’, small ‘n’ a problem ?

gene 1

gene 2

gene 3

Adding a third gene now allowsus to separate the two groupsusing a 2D plane.

Class discovery

Page 6: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Why is the big ‘p’, small ‘n’ a problem ?

• Hence, we will probably find one or more hyperplanes in p-1dimensions that separate the different groups.

• However, if we add a new data point (i.e., a new array), thesehyperplanes may no longer separate the arrays intohomogeneous groups.

• We need to reduce the dimensionality in order to avoid suchover-fitting issues.

• It is generally recommended to select p such that(n/k)/p > 10, with n the number of arrays, k the number ofclasses and p the number of ‘predictor genes’.

Class discovery

Page 7: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Variable selection

• Want to remove genes from the analysis which do not containinformation which helps to distinguish between the classes.

• One solution is to perform a t-test (in the case of two classes)for each gene to test for differences between the classes, andthen select the genes with the highest t-statistics as potentialpredictors.

• This actually works quite well, but doesn’t take interactions

between genes into account: two genes may individually bepoor predictors, but a combination of the two may make agood predictor.

Class discovery

Page 8: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Variable selection

• Another approach, the so-called wrapper methods, aims atfinding the set of genes that minimize the classification errors.

• A direct approach for finding the r best performing genes outof a total of p would involve trying all

(

pr

)

combinations.

• p = 100 and r = 10,(

pr

)

= 1.731x1013 !

• Smart algorithms for searching the gene subset space havebeen proposed.

• Branch-and-bound search, sequential forward (backward)selection and sequential forward (backward) floating searchare the most popular approaches.

Class discovery

Page 9: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Prediction error

• Given a new sample, what is the probability of infering itsclass correctly?

• This probability is generally defined empirically.

• The empirical classification error is the ratio of wrongdecisions to the total number of cases studied.

• The prediction accuracy can initially be evaluated by testingthe classifier back on the training set and noting the resultanttraining, or resubstitution or apparent error rate.

• Such error rate is useful for the purpose of designing theclassifier but it underestimates the true prediction error rate asthe test and training sets are confounded here.

Class discovery

Page 10: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Prediction error based on cross-validation

• The estimate of the classification error depends on theparticular training and test samples used, so it is a randomvariable.

• Cross-validation is used to estimate the mean and variance ofthis random variable.

• The idea is to split the original training set into m subsets ofapproximately equal size.

• m − 1 of these subsets are used as training samples and theprediction error is estimated on the remaining subset.

• This operation is repeated over the m sub-samples (i.e., eachof the m sub-sample is successively used as the test set).

• This procedure is called m-fold cross-validation.

Class discovery

Page 11: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Leave-one-out cross-validation

• Training set: original data minus one observation discarded.

• Test set: discarded observation.

• Go through all the observations in the original data and useeach of them as test set.

• Prediction error: number of incorrect predictions / number ofobservations in the original data set.

Class discovery

Page 12: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Prediction errors (again)

• We now need to be more specific about the prediction (orclassification) error we are interested in.

• Let Xi be the vector of expression levels for array i . Yi is thecorresponding class (e.g., affected or not affected). C(Xi |T ) isthe class estimated for Xi using a classifier which parametershave been estimated from a training set T .

• L(C(Xi |T ),Yi ) is equal to 1 if C(Xi |T ) 6= Yi (i.e., predictionerror) and 0 otherwise (i.e., correct prediction).

• Two types of error are of interest:• The prediction error, ErrT = ET (L(C(Xi |T ), Yi )), the expected

number of errors when classifying a new sample (Xi ) using aclassifier that has been trained using T (i.e., the expectation isconditional on a particular T ).

• The expected prediction error, Err = E (ErrT ), is averaged overall the possible values that L can take. We need toacknowledge here that the traning set should be considered asindependant random draws from (one or more) population(s)of interest.

Class discovery

Page 13: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Prediction errors (again)

• Ideally, we would like to estimate ErrT , but cross-validation orleave-one-out are only good at estimating Err.

• With leave-one-out, the estimator is approximately unbiasedfor ErrT , but the variance of the prediction accuracy of testdata can be high due to over-fitting issue, i.e., the errorestimate is only accurate for test data sets that “are similar”to T .

• Cross-validation has lower variance but underestimates ErrT .

Class discovery

Page 14: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Outline

Class discovery

Class discovery

Page 15: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Class discovery

• Clustering of genes:

• Genes with similar expression profiles (across treatments) arelikely to be functionaly related.

• Use expression profiles to annotate genes.

• Clustering of arrays:

• Samples from a given disease.• Identify various subtypes for this disease.

• Unsupervised approaches: no class label is available (asopposite to prediction methods)

Class discovery

Page 16: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Cluster analysis

• Clustering is an exploratory procedure which provides amethod for grouping objects.

• No assumptions are made about the number of groups, or thegroup structure.

• Grouping is done on the basis of similarities, or distances(dissimilarities).

• A quantitative scale is used to measure the association (i.e.,similarity) between objects.

• For microarray data, the objects are the expression profiles ofthe genes in the experiment (rows) or the different arrays(columns).

Class discovery

Page 17: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Clustering of microarray data

• We use a distance matrix to record the distances between theobjects to be considered.

• A simple approach is to treat each expression profile as avector defining a point in a p-space (or n-space for distancesbetween arrays), and measure Euclidean distances betweenobjects.

Class discovery

Page 18: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Euclidean distance

• Log ratios (or absolute expression levels) are given for eachgene.

Array 1 Array 2 Array 3

Gene 1 (g1) 2.3 2.4 1.8Gene 2 (g2) 2.1 3.4 4.8

• Euclidean distance:

d(g1, g2) =√

(g11 − g21)2 + (g12 − g22)2 + (g13 − g23)2

=√

(2.3 − 2.1)2 + (2.4 − 3.4)2 + (1.8− 4.8)2

• In matrix notation, the Euclidean distance between twovectors x and y is :

d(x , y) =√

(x − y)t(x − y)

Class discovery

Page 19: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Manhattan distance

d(x , y) =∑

i

|xi − yi |

• A weighted version can also be used.

• The advantage of Manhattan distance over Euclidean distanceis its robustness to extreme observations.

• Manhattan distance is one of the best performers inmicroarray clustering analysis.

Class discovery

Page 20: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Pearson correlation

• Measures how well the expression measurement fromgene/array x can be expressed as a linear function of theexpression measurement from gene/array y .

cor(x , y) =

i=1(xi − x)(yi − y)√

i=1(xi − x)2∑

i=1(yi − y)2

• x and y can be very dissimilar by Euclidean distance, yet bevery similar by a correlation measure.

• Correlation similarity metric is usually converted to a distancemetric using, i.e., d(x , y) = 1− cor(x , y) ord(x , y) = (1− cor(x , y))/2.

Class discovery

Page 21: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Clustering algorithms

• The goal of clustering methods is to form subgroups such thatobjects (arrays or genes) within a subgroup are more similar toone another than objects in different subgroups.

• There are two distinct approaches: hierarchical methods andpartitional methods.

• Partitional methods aims to find non-nested partitions of thedata.

• Hierarchical methods aim to find a nested series of partitions.

Class discovery

Page 22: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Clustering algorithms

Partitional methods. Hierarchical methods.

Class discovery

Page 23: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

k-means clustering

• k-means is a popular partitional clustering procedure.

• Given some specified number of clusters K , the goal is tosegregate objects into K cohesive subgroups.

• The criterion the algorithm tries to minimize is the sum ofintra-cluster variances:

V =

K∑

i=1

j∈ni

(xj − xi )2

where ni is the number of objects in cluster i . xi is thecentroid of cluster i .

Class discovery

Page 24: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

k-means clustering

Class discovery

Page 25: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

k-means clustering

Class discovery

Page 26: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

k-means clustering

Class discovery

Page 27: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

k-means clustering

• V is generally not monotonic.

• Thus, there is no guarantee that the minimum found is theglobal minimum.

• It is highly recommended to run the algorithm several times(default in R is 25) and keep the best solution.

• Despite this, the k-means method is quite fast. It is also easyto implement and therefore very popular.

Class discovery

Page 28: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Hierarchical clustering

• Two distinct approaches: agglomerative and divisive.

• Agglomerative hierarchical clustering begins with each datavector (gene of array) as its own cluster and at each stagechooses the “best” merge of two clusters until, in the end, asingle cluster remains.

• Divisive hierarchical clustering proceeds in the oppositedirection. It starts with all data vectors in a single big clusterand at each stage finds the best split.

Class discovery

Page 29: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Linkage methods

• The linkage method determines the updating scheme of theoriginal matrix of pairwise distances.

• Linkage methods rely on an agglomeration step followed by areduction step.

• The agglomeration step identifies the pairs of nodes to beagglomerated according to a pre-defined criterion.

• The reduction step updates the matrix of distances betweennodes.

Class discovery

Page 30: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

Linkage methods

• Single linkage uses the minimum distance between twoclusters.

• Complete linkage uses the maximum distance between twoclusters.

• Average linkage uses the average distance between all pairs ofitems in the two clusters.

Class discovery

Page 31: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

UPGMA: an average linkage approach

• UPGMA stands for Unweighted Pair-Group Method usingarithmetic Averages.

• Agglomeration criterion: (xy) = argminxy(dxy ), where (xy) isany pair of nodes that have not been agglomerated yet.

• Reduction step: dxi ←12(dxi + dyi) for every i 6= x and i 6= y .

y is then considered as an agglomerated node.

Class discovery

Page 32: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

UPGMA: an average linkage approach

• Construct an UPGMA dendrogram for the distance matrixgiven below :

a b c d

a 0 3 4 6b 0 4 5c 0 6d 0

Class discovery

Page 33: * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable selection •Another approach, the so-called wrapper methods, aims at finding the

UPGMA: an average linkage approach

a

a

a

aa

b

b

b

b

b

c

c

c

cc

c

d

d

d

d

d

d

d

0.5

0.5

0.875

1.5

1.5

1.5

1.5

1.5

1.5

2

2

2.875

3

3

3

3

4

4

4

4

4

5

5.5

5.75

5.75

6

6

6

(a, b)

(a, b)

((a, b), c)

((a, b), c)

Class discovery