* B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable...

* B I O I N F 7 0 4 *

Stephane Guindon

Department of Statistics, UoA.

Building 303, Room 381.

[email protected]

Class discovery

The big ‘p’, small ‘n’ problem

• Microarray classification suffers from the big ‘p’, small ‘n’problem, aka the ‘curse of dimensionality’.

• This relates to the fact that in microarray experiments, thenumber of samples (n) is far smaller than the number ofgenes, which are the potential predictors (p).

Class discovery

Why is the big ‘p’, small ‘n’ a problem ?

• Think about this is terms of dimensionality of the data.

• We can plot each of our n samples (i.e., arrays) in p

dimensional space (where p is our number of potentialpredictors, i.e., genes).

• Generally we will have around two orders of magnitudedifference between the size of these quantities: typically n willbe in the hundreds (at most), and p will be in the tens ofthousands.

• If we now try to find a hyperplane (in < p dimensions) whichwill separate our n samples into k classes, there is a goodchance that such a hyperplane will exist, simply because wehave so many dimensions.

Class discovery


gene 1

gene 2

We can’t separate the twogroups using a straight line.

Class discovery


gene 1

gene 2

gene 3

Adding a third gene now allowsus to separate the two groupsusing a 2D plane.

Class discovery


• Hence, we will probably find one or more hyperplanes in p-1dimensions that separate the different groups.

• However, if we add a new data point (i.e., a new array), thesehyperplanes may no longer separate the arrays intohomogeneous groups.

• We need to reduce the dimensionality in order to avoid suchover-fitting issues.

• It is generally recommended to select p such that(n/k)/p > 10, with n the number of arrays, k the number ofclasses and p the number of ‘predictor genes’.

Class discovery

Variable selection

• Want to remove genes from the analysis which do not containinformation which helps to distinguish between the classes.

• One solution is to perform a t-test (in the case of two classes)for each gene to test for differences between the classes, andthen select the genes with the highest t-statistics as potentialpredictors.

• This actually works quite well, but doesn’t take interactions

between genes into account: two genes may individually bepoor predictors, but a combination of the two may make agood predictor.

Class discovery

Variable selection

• Another approach, the so-called wrapper methods, aims atfinding the set of genes that minimize the classification errors.

• A direct approach for finding the r best performing genes outof a total of p would involve trying all

(

pr

)

combinations.

• p = 100 and r = 10,(

pr

)

= 1.731x1013 !

• Smart algorithms for searching the gene subset space havebeen proposed.

• Branch-and-bound search, sequential forward (backward)selection and sequential forward (backward) floating searchare the most popular approaches.

Class discovery

Prediction error

• Given a new sample, what is the probability of infering itsclass correctly?

• This probability is generally defined empirically.

• The empirical classification error is the ratio of wrongdecisions to the total number of cases studied.

• The prediction accuracy can initially be evaluated by testingthe classifier back on the training set and noting the resultanttraining, or resubstitution or apparent error rate.

• Such error rate is useful for the purpose of designing theclassifier but it underestimates the true prediction error rate asthe test and training sets are confounded here.

Class discovery

Prediction error based on cross-validation

• The estimate of the classification error depends on theparticular training and test samples used, so it is a randomvariable.

• Cross-validation is used to estimate the mean and variance ofthis random variable.

• The idea is to split the original training set into m subsets ofapproximately equal size.

• m − 1 of these subsets are used as training samples and theprediction error is estimated on the remaining subset.

• This operation is repeated over the m sub-samples (i.e., eachof the m sub-sample is successively used as the test set).

• This procedure is called m-fold cross-validation.

Class discovery

Leave-one-out cross-validation

• Training set: original data minus one observation discarded.

• Test set: discarded observation.

• Go through all the observations in the original data and useeach of them as test set.

• Prediction error: number of incorrect predictions / number ofobservations in the original data set.

Class discovery

Prediction errors (again)

• We now need to be more specific about the prediction (orclassification) error we are interested in.

• Let Xi be the vector of expression levels for array i . Yi is thecorresponding class (e.g., affected or not affected). C(Xi |T ) isthe class estimated for Xi using a classifier which parametershave been estimated from a training set T .

• L(C(Xi |T ),Yi ) is equal to 1 if C(Xi |T ) 6= Yi (i.e., predictionerror) and 0 otherwise (i.e., correct prediction).

• Two types of error are of interest:• The prediction error, ErrT = ET (L(C(Xi |T ), Yi )), the expected

number of errors when classifying a new sample (Xi ) using aclassifier that has been trained using T (i.e., the expectation isconditional on a particular T ).

• The expected prediction error, Err = E (ErrT ), is averaged overall the possible values that L can take. We need toacknowledge here that the traning set should be considered asindependant random draws from (one or more) population(s)of interest.

Class discovery

Prediction errors (again)

• Ideally, we would like to estimate ErrT , but cross-validation orleave-one-out are only good at estimating Err.

• With leave-one-out, the estimator is approximately unbiasedfor ErrT , but the variance of the prediction accuracy of testdata can be high due to over-fitting issue, i.e., the errorestimate is only accurate for test data sets that “are similar”to T .

• Cross-validation has lower variance but underestimates ErrT .

Class discovery

Outline

Class discovery

Class discovery

Class discovery

• Clustering of genes:

• Genes with similar expression profiles (across treatments) arelikely to be functionaly related.

• Use expression profiles to annotate genes.

• Clustering of arrays:

• Samples from a given disease.• Identify various subtypes for this disease.

• Unsupervised approaches: no class label is available (asopposite to prediction methods)

Class discovery

Cluster analysis

• Clustering is an exploratory procedure which provides amethod for grouping objects.

• No assumptions are made about the number of groups, or thegroup structure.

• Grouping is done on the basis of similarities, or distances(dissimilarities).

• A quantitative scale is used to measure the association (i.e.,similarity) between objects.

• For microarray data, the objects are the expression profiles ofthe genes in the experiment (rows) or the different arrays(columns).

Class discovery

Clustering of microarray data

• We use a distance matrix to record the distances between theobjects to be considered.

• A simple approach is to treat each expression profile as avector defining a point in a p-space (or n-space for distancesbetween arrays), and measure Euclidean distances betweenobjects.

Class discovery

Euclidean distance

• Log ratios (or absolute expression levels) are given for eachgene.

Array 1 Array 2 Array 3

Gene 1 (g1) 2.3 2.4 1.8Gene 2 (g2) 2.1 3.4 4.8

• Euclidean distance:

d(g1, g2) =√

(g11 − g21)2 + (g12 − g22)2 + (g13 − g23)2

=√

(2.3 − 2.1)2 + (2.4 − 3.4)2 + (1.8− 4.8)2

• In matrix notation, the Euclidean distance between twovectors x and y is :

d(x , y) =√

(x − y)t(x − y)

Class discovery

Manhattan distance

d(x , y) =∑

i

|xi − yi |

• A weighted version can also be used.

• The advantage of Manhattan distance over Euclidean distanceis its robustness to extreme observations.

• Manhattan distance is one of the best performers inmicroarray clustering analysis.

Class discovery

Pearson correlation

• Measures how well the expression measurement fromgene/array x can be expressed as a linear function of theexpression measurement from gene/array y .

cor(x , y) =

∑

i=1(xi − x)(yi − y)√

∑

i=1(xi − x)2∑

i=1(yi − y)2

• x and y can be very dissimilar by Euclidean distance, yet bevery similar by a correlation measure.

• Correlation similarity metric is usually converted to a distancemetric using, i.e., d(x , y) = 1− cor(x , y) ord(x , y) = (1− cor(x , y))/2.

Class discovery

Clustering algorithms

• The goal of clustering methods is to form subgroups such thatobjects (arrays or genes) within a subgroup are more similar toone another than objects in different subgroups.

• There are two distinct approaches: hierarchical methods andpartitional methods.

• Partitional methods aims to find non-nested partitions of thedata.

• Hierarchical methods aim to find a nested series of partitions.

Class discovery

Clustering algorithms

Partitional methods. Hierarchical methods.

Class discovery

k-means clustering

• k-means is a popular partitional clustering procedure.

• Given some specified number of clusters K , the goal is tosegregate objects into K cohesive subgroups.

• The criterion the algorithm tries to minimize is the sum ofintra-cluster variances:

V =

K∑

i=1

∑

j∈ni

(xj − xi )2

where ni is the number of objects in cluster i . xi is thecentroid of cluster i .

Class discovery

k-means clustering

Class discovery

k-means clustering

• V is generally not monotonic.

• Thus, there is no guarantee that the minimum found is theglobal minimum.

• It is highly recommended to run the algorithm several times(default in R is 25) and keep the best solution.

• Despite this, the k-means method is quite fast. It is also easyto implement and therefore very popular.

Class discovery

Hierarchical clustering

• Two distinct approaches: agglomerative and divisive.

• Agglomerative hierarchical clustering begins with each datavector (gene of array) as its own cluster and at each stagechooses the “best” merge of two clusters until, in the end, asingle cluster remains.

• Divisive hierarchical clustering proceeds in the oppositedirection. It starts with all data vectors in a single big clusterand at each stage finds the best split.

Class discovery

Linkage methods

• The linkage method determines the updating scheme of theoriginal matrix of pairwise distances.

• Linkage methods rely on an agglomeration step followed by areduction step.

• The agglomeration step identifies the pairs of nodes to beagglomerated according to a pre-defined criterion.

• The reduction step updates the matrix of distances betweennodes.

Class discovery

Linkage methods

• Single linkage uses the minimum distance between twoclusters.

• Complete linkage uses the maximum distance between twoclusters.

• Average linkage uses the average distance between all pairs ofitems in the two clusters.

Class discovery

UPGMA: an average linkage approach

• UPGMA stands for Unweighted Pair-Group Method usingarithmetic Averages.

• Agglomeration criterion: (xy) = argminxy(dxy ), where (xy) isany pair of nodes that have not been agglomerated yet.

• Reduction step: dxi ←12(dxi + dyi) for every i 6= x and i 6= y .

y is then considered as an agglomerated node.

Class discovery


• Construct an UPGMA dendrogram for the distance matrixgiven below :

a b c d

a 0 3 4 6b 0 4 5c 0 6d 0

Class discovery


a

a

a

aa

b

b

b

b

b

c

c

c

cc

c

d

d

d

d

d

d

d

0.5

0.5

0.875

1.5

1.5

1.5

1.5

1.5

1.5

2

2

2.875

3

3

3

3

4

4

4

4

4

5

5.5

5.75

5.75

6

6

6

(a, b)

(a, b)

((a, b), c)

((a, b), c)

Class discovery

* B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable...

Documents

Transcript of * B I O I N F 7 0 4 * - -0 eserved@d = *@let@token 1cm eserved@d … · 2012. 10. 29. · Variable...