Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... ·...

63
Translational Cancer Medicine Statistical Analysis of Microarray Data Eric Blanc KCL December 16, 2013 Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 1 / 42

Transcript of Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... ·...

Page 1: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Translational Cancer MedicineStatistical Analysis of Microarray Data

Eric Blanc

KCL

December 16, 2013

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 1 / 42

Page 2: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Outline

1 IntroductionMicroarrays

2 Statistics of differential expressionDifferential expression detectionMultiple testing correctionsModerated statistics

3 Clustering and classificationStatistical Decision TheoryExamples

4 Data processing and quality control

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 2 / 42

Page 3: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionGenomics data sets

Genomics data sets: a single experiment returns a quantity measuredfor a large proportion of the constituents of the cell. For example

I Expression levels for most genes.I SNP counts for millions of SNP covering most of the genome.I All binding sites for a transcription factor.I Protein-protein interactions for most protein pairs.

Genomics data sets are usually very large, much larger (andpotentially noisier) than those obtained by conventional methods.

Statistical tools are necessary to1 Quantitative assessment of the results,2 Discovery and identification of features in the data,3 Analysis of possible sources of bias, or structure in the experimental

noise.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 3 / 42

Page 4: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionMicro-array data sets as an example

The micro-array experiment offers an experimental read-out for alarge proportion of the genome.It can be DNA (SNPs, CNV, ChIP) or RNA (mRNA, small RNAs)

The primary output of a micro-array experiment is represented by alarge matrix of numbers, which contains the readout values for r rowsof probes (genes, SNPs, genomic regions, ...) and c columns ofdifferent samples.

The lack of a quantitative theoretical framework for thephysico-chemical processes generating the data exacerbates the needfor a careful analysis of the noise structures

Only differences in expression between conditions can be reliablymeasured.

We consider here expression arrays to measure mRNA expressionlevels.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 4 / 42

Page 5: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionMicroarray technology

Probes are oligonucleotide sequences complementary to small genomic sequences(in genes, exons, regulatory sequences, around SNPs, ...)

Probes are covalently bound on the array surface so that probes sharing the samesequence are located in the same area of the array.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 5 / 42

Page 6: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionMicroarray technology

The sample (DNA or mRNA) is fragmented and labelled with fluorescent dyecovalently bound.

The labelled sample is hybridised on the array surface, and after washing, onlyprobes which target sequence are present in the sample remain hybridised.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 6 / 42

Page 7: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionMicro-array data and expression matrix

Ncol

Nrow

Low expression

High expression

Arr

ayS

urf

ace

Sam

ple

1

Sam

ple

2

Sam

ple

3

Sam

ple

4

Sam

ple

5

Probe 1

Probe 2

Probe 3

Probe 4

Probe 5

Probe 6

Probe 7

Probe 8

Probe 9

Probe 10

Expression matrix

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 7 / 42

Page 8: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

IntroductionOverview

1 Quantitative assessment of the resultsI Identification of genes that follow a pre-defined pattern in several

conditions.

2 Discovery and identification of features in the dataI Classification and prediction of phenotype status based of expression

pattern.Cancer classification and prediction of patient response to treatmentmay be predicted from the patient’s expression profile.

I Clustering of genes that have similar expression patterns in severalconditions.An example would be finding groups of co-regulated genes in a timeseries containing a large number of points.

3 Analysis of possible sources of bias, or structure in the experimentalnoise.

I Looking for “surprises”

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 8 / 42

Page 9: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistics of differential expressionOverview

1 Quantitative assessment of the resultsI Identification of genes that follow a pre-defined pattern in several

conditions.

2 Discovery and identification of features in the dataI Classification and prediction of phenotype status based of expression

pattern.Cancer classification and prediction of patient response to treatmentmay be predicted from the patient’s expression profile.

I Clustering of genes that have similar expression patterns in severalconditions.An example would be finding groups of co-regulated genes in a timeseries containing a large number of points.

3 Analysis of possible sources of bias, or structure in the experimentalnoise.

I Looking for “surprises”

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 9 / 42

Page 10: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical modelsIntroduction

Is gene g expressed differently between healthy controls and patientsaffected by the disease under study ?

The expression levels of g in Nh and Nd patients are recorded ({xi}and {yj}).

For each gene g , a t test is carried out between expression levels {xi}and {yj} to assess the expression difference statistical significance, butwhere is the model ?

The model is the mathematical description of the situation.It provides a quantitative framework to describe the main features ofthe system.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 10 / 42

Page 11: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical modelsMathematical description

Model assumptions:I Each sample is a “faithful representation” of the corresponding parent

population.I The parent populations follow a pre-determined distribution (generally,

the normal distribution)I These distribution represent the probability of observing a given value

for the expression of gene g , when the patient is taken at random.

The gene expression sample averages provide estimates of the meanexpression in the two parent populations, and the standard deviationsestimates of the populations dispersion.

The t test provides the probability that the expression of gene g isequal in the two parent populations.

The model is intrinsically probabilistic, and its assumptions cannot beproven.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 11 / 42

Page 12: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical modelsModel choice

When the sample data are normallydistributed, the sample averagemaximises the likelihood, but whenthe distribution is a doubleexponential, then the samplemedian is the maximum likelihoodestimator.

The definition of an outlier dependson the distribution: there is aprobability of 1.5 · 10−23 ofobserving a data point further than10σ away from the mean when thedistribution is normal, but thisprobability is 0.063 when thedistribution is Cauchy.

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normal vs Cauchy distributions

x

Pro

babi

lity

dens

ity

P(|x|>5) < 10^−6P(|x|>5) = 0.1

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 12 / 42

Page 13: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Differential expression detectionApplication of statistical models in genomic data sets

The same statistical model applies for all genes.Usually, an normal error model is assumed to obtain closed formulae.

Statistical significance of gene expression differences across conditionsis usually assessed by hypothesis testing (t tests or ANOVA).

The high level of noise in the data usually implies a large number offalse positive and false negative among genes called differentiallyexpressed.

The large number of identical tests has two major consequences:I Need for multiple testing correction, andI Possibility of statistic moderation

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 13 / 42

Page 14: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Multiple testing correctionIntroduction

Consider that N statistical tests are carried out in an analysis, and NP-values are produced.

If a gene has a P value under the pre-defined threshold α, it can beinterpreted as that there is a probability α that it is a False Positive.

Therefore we have (in the N tests are independent):

P(1 FP in 1 test) = α1

P(0 FP in 1 test) = 1− α1

P(0 FP in N tests) = (1− α1)N

P(at least 1 FP in N tests) = αN = 1− (1− α1)N

To ensure that P(at least 1 FP in N tests) is small, the cutoff for thesignificance an individual test α1 must be set such thatαN = 1− (1− α1)N , or α1 ≈ αN/N.

This correction (due to Bonferroni) is exceedingly stringent when thenumber of tests is large.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 14 / 42

Page 15: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Multiple testing correctionIntroduction

Consider that N statistical tests are carried out in an analysis, and NP-values are produced.

If a gene has a P value under the pre-defined threshold α, it can beinterpreted as that there is a probability α that it is a False Positive.

Therefore we have (in the N tests are independent):

P(1 FP in 1 test) = α1

P(0 FP in 1 test) = 1− α1

P(0 FP in N tests) = (1− α1)N

P(at least 1 FP in N tests) = αN = 1− (1− α1)N

To ensure that P(at least 1 FP in N tests) is small, the cutoff for thesignificance an individual test α1 must be set such thatαN = 1− (1− α1)N , or α1 ≈ αN/N.

This correction (due to Bonferroni) is exceedingly stringent when thenumber of tests is large.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 14 / 42

Page 16: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Multiple testing correctionIntroduction

Consider that N statistical tests are carried out in an analysis, and NP-values are produced.

If a gene has a P value under the pre-defined threshold α, it can beinterpreted as that there is a probability α that it is a False Positive.

Therefore we have (in the N tests are independent):

P(1 FP in 1 test) = α1

P(0 FP in 1 test) = 1− α1

P(0 FP in N tests) = (1− α1)N

P(at least 1 FP in N tests) = αN = 1− (1− α1)N

To ensure that P(at least 1 FP in N tests) is small, the cutoff for thesignificance an individual test α1 must be set such thatαN = 1− (1− α1)N , or α1 ≈ αN/N.

This correction (due to Bonferroni) is exceedingly stringent when thenumber of tests is large.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 14 / 42

Page 17: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Multiple testing correctionIntroduction

Consider that N statistical tests are carried out in an analysis, and NP-values are produced.

If a gene has a P value under the pre-defined threshold α, it can beinterpreted as that there is a probability α that it is a False Positive.

Therefore we have (in the N tests are independent):

P(1 FP in 1 test) = α1

P(0 FP in 1 test) = 1− α1

P(0 FP in N tests) = (1− α1)N

P(at least 1 FP in N tests) = αN = 1− (1− α1)N

To ensure that P(at least 1 FP in N tests) is small, the cutoff for thesignificance an individual test α1 must be set such thatαN = 1− (1− α1)N , or α1 ≈ αN/N.

This correction (due to Bonferroni) is exceedingly stringent when thenumber of tests is large.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 14 / 42

Page 18: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Multiple testing correctionFalse Discovery Rate

When there is 106 tests (typical for SNPs arrays), the 1 % statisticalsignificance level after Bonferroni correction is P values below 10−8,which is unreasonably stringent for most experiments.

An alternative is the control of the False Discovery Rate is moreappropriate (less stringent) than the control of the probability for theoccurrence of one False Positive call (Family-Wise Error Rate).

The q value is defined as the expected ratio of False Positive callsamong the tests for which the statistic (for example t) is above agiven threshold.

So from the statistic t, instead of computing the P value, the q valueis computed instead, and lists of differentially expressed SNPs (orgenes) are obtained from these latter values.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 15 / 42

Page 19: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Parallel statistical testsModerated statistics

The t statistics used to assess differential expression significance isgiven by:

t =x − y

s ·√

1/nx + 1/nywith s2 =

∑nxi=1(xi − x)2 +

∑nyj=1(yj − y)2

nx + ny − 2

When s is small, the statistical significance increases, at a given valueof the difference between means.

With many parallel tests and few replicates for each conditions, oneexpects “accidental” small standard deviations in gene expression,leading to artificially high statistical significance.

By introducing a parametric model for the standard deviationsdistribution, the t (and F ) statistics can be “moderated”, reducingthe false positive rate.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 16 / 42

Page 20: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Using parallel tests to regularise statisticsHierarchical models

Hierarchical models impose parametric distributions for the parametersgoverning each test pdf.

In the case of a comparison between the mean expression values in twoconditions with a common variance, we could for example impose thatρ(µx), ρ(µy ) ∝ cst and ρ(σ2) ∝ s20χ

−2ν0

.

The whole dataset is then used to fit the hyper-parameters (here s0 and ν0).

The posterior values for the residual variances (& degrees of freedom) are:

s2j =ν0s

20 + νs2jν0 + ν

When sj is “accidentally” small, the addition of ν0s20 regularises the value of

the t statistic, as it avoids division by a small number.

The loss of statistical significance is compensated by the increase of thedegrees of freedom.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 17 / 42

Page 21: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistics of differential expressionOverview

1 Quantitative assessment of the resultsI Identification of genes that follow a pre-defined pattern in several

conditions.

2 Discovery and identification of features in the dataI Classification and prediction of phenotype status based of expression

pattern.Cancer classification and prediction of patient response to treatmentmay be predicted from the patient’s expression profile.

I Clustering of genes that have similar expression patterns in severalconditions.An example would be finding groups of co-regulated genes in a timeseries containing a large number of points.

3 Analysis of possible sources of bias, or structure in the experimentalnoise.

I Looking for “surprises”

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 18 / 42

Page 22: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering and classification of gene expression

van’t Veer et al. (2002). Nature 415 530-536.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 19 / 42

Page 23: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering and classification of gene expression

van’t Veer et al. (2002). Nature 415 530-536.

Genes in set 1 Genes in set 2

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 19 / 42

Page 24: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering and classification of gene expression

van’t Veer et al. (2002). Nature 415 530-536.

Genes in set 1 Genes in set 2

Patients in cluster 1

Patients in cluster 2

These patientshave differentexpression patterns

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 19 / 42

Page 25: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering and classification of gene expression

van’t Veer et al. (2002). Nature 415 530-536.

Genes in set 1 Genes in set 2

Patients in cluster 1

Patients in cluster 2

These patientshave differentexpression patterns

Expression pattern predicts ER status

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 19 / 42

Page 26: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical Decision TheoryRegression, classification and clustering

We consider a data set of N pairs of (yi , xi ), where i varies from 1 to N.The xi are the input, and the yi the output for the problem.

Regression: the numerical observations yi are predicted by a modelwhich maps the explanatory variables xi onto yi . One may beinterested either in model predictions Y for new explanatory variablesX , or by parameters θ identifying the model

Classification: the observations yi are classes from which theexplanatory variables xi are drawn.One is usually interested in modelprediction for new variables

Clustering: There are no observations yi , only explanatory variablesxi , which must be grouped according to similar properties.

NB: In most cases, the explanatory variables xi have many components,and are therefore represented by a vector xi .

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 20 / 42

Page 27: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical Decision TheoryDefinitions

We consider an y (the ER class, for example) which depend onanother variable (observed or modelled) x (the expression pattern)The variables have a joint probability distribution p(x, y)

We seek a function f predicting the value of y from the knowledge ofx: y = f (x)

We define a loss function L(y , f (x)) penalising the prediction errors.Typically L(y , f (x)) = (y − f (x))2 or 0 and 1 for correct and incorrectclassifications

The solution f minimising the loss is f (x) = E (y |x)

When the choice for f is not directed by the problem, a trainingdata set is required to select f from broad classes of functions.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 21 / 42

Page 28: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical Decision TheoryAlgorithm training procedure

We assume a training data set D = {(xi , yi )}, and a loss functionL(y , f (x)).

By some minimisation algorithm, we tune the function f so that theloss over the training data is minimal.

This training of f depends on the class of acceptable functions f .It involves optimisation of internal parameters of f .

Once the algorithm is trained, it can predict output values y it hasnever seen (not in D).

The true value of an algorithm is in its prediction efficiency for newdata, not for the training data, while it is optimised only against thetraining data set.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 22 / 42

Page 29: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical Decision TheoryExamples of type of prediction function f

K th nearest neighbour:Consider that the classifier knows the class for xi inputs. Then, a newpoint x is classified according to the class of the nearest K points xi .Each of the nearest neighbours of x “votes” for a class, and the classof the new point is assigned to the majority class.

Support Vector Machines (SVM):Hyper-planes (or hyper-surfaces) in the input space of the xi areconstructed to optimally separate the various classes. The supportvectors are the data points xi which define the separation planes.Support vector machines provide a separation of the input space intoC disjoint regions, where C is the number of classes allowed foroutput values.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 23 / 42

Page 30: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Synthetic example

The expression of 2 genes have been measured for 100 patients, 50ER+ (red) and 50 ER− (green).

The perfect classifier allowing prediction from ER status from theexpression of these two genes is known.

There are some patients which ER status is outside of the perfectclassifier domains, because the ER status is an observable, and assuch is subject to experimental error.

For this example, we have set parameters such that:I The (0, 1) square is divided in two regions (green & red), both of equal

area 0.5.I 100 points randomly distributed in the (0, 1) square, such that there

are 50 points in each region.I 40 points in the red region are assigned the red class, and 10 the green

class. The same proportion are used for green.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 24 / 42

Page 31: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

KNN training set: 100 points

x1

x 2

●●

●●

●●

● ●

●●

●● ●●

●●

●●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 25 / 42

Page 32: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Class prediction for black point

x1

x 2

●●

●●

●●

● ●

●●

●● ●●

●●

●●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 25 / 42

Page 33: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Number of neighbours : 1

x1

x 2

●●

●●

●●

● ●

●●

●● ●●

●●

●●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 25 / 42

Page 34: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Number of neighbours : 5

x1

x 2

●●

●●

●●

● ●

●●

●● ●●

●●

●●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 25 / 42

Page 35: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-Nearest Neighbours (KNN algorithm)Example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Number of neighbours : 59

x1

x 2

●●

●●

●●

● ●

●●

●● ●●

●●

●●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 25 / 42

Page 36: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Classification and Statistical Decision TheoryLessons from the KNN example

Within a class of algorithms (KNN), there is still choice of which oneto choose (number of neighbours).

When the number of neighbours is low, the predicted regionboundaries are very complex.When the neighbours’ number increases, the it becomes smoother.

The number of misclassified training set data points is 0 when thenumber of neighbours is 1, and it increases with the number ofneighbours.This shows that the training set cannot be used to select the bestclassifier.

Classifiers, as usual estimators, are statistical quantities, which enjoystatistical properties, such as bias and variance.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 26 / 42

Page 37: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Statistical Decision TheoryBias-Variance decomposition

The prediction error can be decomposed into 3 main sources:

E ((y − f (x))2|D) = E[(y − f (x) + f (x)− f (x))2|D

]= E

[(y − f (x))2

]+ E

[(f (x)− f (x))2

]= σ2 + E

[(f (x)− E (f (x)) + E (f (x))− f (x))2

]= σ2 + E

[(f (x)− E (f (x)))2

]︸ ︷︷ ︸

Variance

+ (E (f (x))− f (x))2︸ ︷︷ ︸Bias

σ2 is the irreducible error made on the measurement of y

The variance is due to the choice of sample: other data samples would haveled to slightly different models f

The bias is due to the choice of model function (linear models could bechosen for their simplicity, even thought they may have a bias in theirpredictions)

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 27 / 42

Page 38: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Model assessment and selectionDefinitions

We can define

the model selection, which is choosing the best performing model,and

the model assessment, which estimates the prediction error on newdata

In data rich situations, the data can be split into 3 parts:

the training set, against which the model parameters are optimised,

the validation set, used to estimate prediction error for modelselection, and

the test set, to compute the true test error

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 28 / 42

Page 39: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering of gene expressionDifferent algorithms lead to different clustering results

D’haeseleer (2005). Nat. Biotechnol. 23 1499-1501.

(a) Original clusters

(b) Hierarchicalclustering

(c) K-means

(d) Self-organisingmaps (SOM)

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 29 / 42

Page 40: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Clustering of expression dataDistances and Proximity matrices

To cluster data, one needs to define the notion of proximity between datapoints. Formally:

The proximity dij between inputs xi and xj must be defined

In many cases (but not all), proximities enjoy the mathematicalproperties of distances:

I dij ≥ 0I dij = 0 ⇔ xi = xjI dij = djiI dij ≤ dik + dkj

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 30 / 42

Page 41: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Distances and Proximity matricesExamples

Examples of true distances:

dp(xi , xj) =

(∑k

||xi ,k | − |xj ,k ||p)1/p

p = 2 : Euclidian, p = 1 : Manhattan, p =∞ : Maximum

Examples of non-distance similarities

r = 1− xi · xj/(||xi || ||xj ||) Pearson Correlation

D =∑|xi ,k − xj ,k |/

∑|xi ,k + xj ,k | Camberra

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 31 / 42

Page 42: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Distances and Proximity matricesOptimisation of cluster assignments

Clustering is an assignment of data points into K clusters so that thedistances between points from the same cluster in minimised

The optimisation can be done on the within-cluster point scatter

W (C ) =1

2

K∑k=1

∑i ,i ′∈Ck

d(xi , xi ′)

Combinatorial explosion of the different possible assignments of datapoints to clusters

S(N,K ) =1

K !

K∑k=1

(−1)K−k(K

k

)kN

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 32 / 42

Page 43: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansDefinition

Algorithm to find a quick approximation to the optimal assignment ofpoints into K clusters

A-priori number of clusters K is known, and the algorithm returns anassignment where each observation belongs to exactly one cluster

”Representative” points are chosen to perform the data points clusterassignment

The objective function is∑K

k=1

∑xi∈Ck

||xi −mk ||2 where the clusterCk centroid is mk , and observations xi are assigned to the clusterwhich center is nearest

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 33 / 42

Page 44: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

Truth

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 45: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

Starting point

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 46: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●● ●●●

● ●

● ●

● ●

●●

●●● ●

●●

● ●

●● ●

●●

●●

●●

●●●

●● ●

●●

●●

●●

● ●

● ●

●●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●●●

●●

●●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

Initial seeds

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 47: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 1 iteration

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 48: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 2 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 49: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 3 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 50: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 4 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 51: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 5 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 52: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 6 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 53: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 7 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 54: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

After 8 iterations

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 55: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

Convergence

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 56: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●●

●●

●●

●●

● ● ●● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

−2 0 2 4 6 8

−2

02

46

8

3 clusters

Mislabelled

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 34 / 42

Page 57: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

● ●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ● ●

●●

●●

● ●

● ●●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

● ●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ● ●

●●

●●

● ●

● ●●●

● ●●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

● ●

● ●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●● ●

●● ●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

● ●

● ●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●● ●

●● ●

●●

● ●

●● ●

●●

●●

−3 −2 −1 0 1 2 3

−6

−4

−2

02

46

Unsuccessful 2−clusters assignment

Convergence

● ●

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 35 / 42

Page 58: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

K-meansExamples

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

−2 0 2 4 6

−2

02

46

Unsuccessful 2−clusters assignment

Convergence

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 36 / 42

Page 59: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Classification and clusteringSummary

Classification and clustering are both difficult problems, with manydifferent competing algorithms to address them.

The error made by classifiers can be (in principle) be estimated usingtest sets (reference data sets not used for training or model selection).

An objective assessment of error estimation is not possible forclustering, as the outputs are never known.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 37 / 42

Page 60: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Data processing and quality controlBesides specific hybridization

There are many ways in which cross-hybridization and folding can affectthe measured intensity

Taken from Binder (2006). J. Phys. Condens. Matter 18 S491-S523

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 38 / 42

Page 61: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Data processing and quality controlTrivial experimental problems

False-colour imaging of amicro-array experiment

In green are the regions of lowreliability intensity measures

The green regions probablyhighlight locations of air bubbleswhich have limited thehybridization

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 39 / 42

Page 62: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Data processing and quality controlBiological samples variability

Measured intensities agreement between 3 biological replicates, displayedon logarithmic scale.The two replicates on the left have a good agreement, while there isconsiderable differences with the third replicate.

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 40 / 42

Page 63: Translational Cancer Medicine - King's College London › ton.coolen › SBGP › lecture... · Introduction Overview 1 Quantitative assessment of the results I Identi cation of genes

Data processing and quality controlImportance of processing algorithm

Agreement between the topdifferentially expressed genes aspredicted by 4 different algorithms

The main processing stepinvolvde with micro-array data iscalled “Normalisation”

Many different algorithms havebeen proposed and implemented

There is no theoreticaljustification for choosing oneover another

The choice of algorithm canhave a dramatic influence on theoutcome

Eric Blanc (KCL) Translational Cancer Medicine December 16, 2013 41 / 42