CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical...

CBCl/AI MIT

What is bioinformatics ?

Application of computing technology to providing statistical anddatabase solutions to problems in molecular biology.

Defining and addressing problems in molecular biology using methodologies from statistics and computer science.

The genome project, genome wide analysis/screening of disease,genetic regulatory networks, analysis of expression data.

Pre 1995

Post 1995

CBCl/AI MIT

Central dogma of biology

DNA RNA pre-mRNA

mRNA Protein

Central dogma

CBCl/AI MIT

CGAACAAACCTCGAACCTGCTDNA:

mRNA: GCU UGU UUA CGA

Polypeptide: Ala Cys Leu Arg

Translation

Transcription

Basic molecular biology

CBCl/AI MIT

Transcription

End modification

Splicing

Transport

Translation

Less basic molecular biology

CBCl/AI MIT

Sequence information

Quantitative information

microarray

rt-pcr

protein arrays

yeast-2 hybrid

Chemical screens

DNA

RNA

preRNA

mRNA

Protein

transcription

splicing

translation, stability

transport, localization

Biological information

CBCl/AI MIT

•Splice sites and branch points in eukaryotic pre-mRNA•Gene finding in prokaryotes and eukaryotes•Promoter recognition (transcription and termination)•Protein structure prediction•Protein function prediction•Protein family classification

Sequence problems

CBCl/AI MIT

•Predict tissue morphology •Predict treatment/drug effect•Infer metabolic pathway•Infer disease pathways•Infer developmental pathway

Gene expression problems

CBCl/AI MIT

Basic idea:

The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA.Measuring amount particular mRNA gives measure ofamount of corresponding protein.Copies of mRNA is expression of a gene.

Microarray technology allows us to measure the expressionof thousands of genes at once. (Northern blot).

Measure the expression of thousands of genesunder different experimental conditions and ask what isdifferent and why.

Microarray technology

CBCl/AI MIT

Cy3 Cy5

ReferenceTest Sample

cDNA Clone(LIBRARY)

PCR Product

PE

Test Sample

OligonucleotideSynthesis

Biological Sample

RNA

ARRAY

ARRAY

Ramaswamy and Golub, JCO


CBCl/AI MIT

Lockhart and Winzler 2000

Oligonucleotide cDNA


CBCl/AI MIT

Yeast experiment

Microarray experiment

CBCl/AI MIT

When the science is not well understood, resort to statistics:

Ultimate goal: discover the genetic pathways of cancers

Infer cancer genetics by analyzing microarray data from tumors

Curse of dimensionality: Far too few examples for so many dimensions to predict accurately

Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model

Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem

Analytic challenge

CBCl/AI MIT

38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)

34 examples to test classifier

Results: 33/34 correct

d perpendicular distancefrom hyperplane

Test data

d

Cancer Classification

CBCl/AI MIT

Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression

Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important

2

22

)1()()(),(

1,,,,,)(

,

jijiji

TrkshshTrkTrksh

Trksh

xxxxxxK

eeeeeex

eex

φφ

φ

Two gene example: two genes measuring Sonic Hedgehog and TrkC

Coregulation and kernels

CBCl/AI MIT

Nonlinear SVM helps when the most informative genes are removed,Informative as ranked using Signal to Noise (Golub et al).

Genes removed errors1st order 2nd order 3rd order

polynomials

0 1 1 110 2 1 120 3 2 130 3 3 240 3 3 250 3 2 2100 3 3 2200 3 3 3 1500 7 7 8

Gene coregulation

CBCl/AI MIT

Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes

Need to introduce concept of rejects to SVM

g1

g2

Normal

Cancer

Reject

Rejecting samples

CBCl/AI MIT

Rejecting samples

CBCl/AI MIT

Estimating a CDF

CBCl/AI MIT

The regularized solution

CBCl/AI MIT

1/d

P(c=1 | d)

.95

95% confidence or p = .05 d = .107

Rejections for SVMs

CBCl/AI MIT

Results: 31 correct, 3 rejected of which 1 is an error

Test data

d

Results with rejections

CBCl/AI MIT

SVMs as stated use all genes/features

Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in

discriminating

Practical reasons, a clinical device with thousands of genes is not financially practical

Possible performance improvement

Wrapper method for gene/feature selection

Gene selection

CBCl/AI MIT

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

d

Test data

d

Test data

Results with gene selection

CBCl/AI MIT

Recursive feature elimination (RFE): based upon perturbationanalysis, eliminate genes that perturb the margin the least

Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is

unbiased

Two feature selection algorithms

CBCl/AI MIT

(2) step goto and set gene reduced on SVM Retrain 4.

10%)

smallest (for magnitude absolute small withelements vector

those to ingcorrespond enesfeatures/g input Discard 3.

value absoluteby vector of elements order Rank

vector for problem SVM the Solve

w

w

.2

.1

Recursive feature elimination

CBCl/AI MIT

Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.

When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.

The rescaling can be done in the input space or in a “Principal Components” space.

Optimizing the LOO

CBCl/AI MIT

Rescale features to minimize the LOO bound R2/M2

x2

x1

R2/M2 >1

M

R

x2

R2/M2 =1

M = R

Pictorial demonstration

CBCl/AI MIT

Radius margin bound: simple to compute, continuous very loose but often tracks LOO well

Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth,

valid only for SVMs with no b term

Span bound: tight complicated to compute, discontinuous so need to smooth

Three LOO bounds

CBCl/AI MIT

tion.multiplica elementby element denote

, where

vu

σσyσxKyxK

bxxKαyxf

nσ

iσSVi

ii

),,(),(

.),()(

We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.

The SVM function has the form:

Classification function with scaling

CBCl/AI MIT

. ,)

:formquadratic following the maximizingby computed are s' The

ij 0),((),(,

1212 αxxKααyyασαW

α

jiCjiσjiji

ii

.

data the around variance For

and ,

maximize data the around radius the computing For

1

jijiσ

iiiσ

ii

jijiσji

iiiσi

xxKxxKσV

ββxxKββxxKβσβR

,

12

,212

),(),()(

.10),(),(),(

2

).(),(),(),()( 221221 σVσαWσβRσαWTDL Remeber

SVM and other functionals

CBCl/AI MIT

2. step to return and in elements small to ingcorrespond dimensions Discard 5.

3. step goto reached not is of minima local 4.If

step. gradient a with to respect with error of estimate the Minimize 3.

algorithm SVM standard the 2.Solve

Initialize

and compute to used are steps following The

σ

T

σT

σαWσα

σ

σα

α).,(maxarg)(

.1,...,1.1

Algorithm

CBCl/AI MIT

Computing gradients

.01

1KK~

where

,K~K

~K~

span therespect toith gradient w The

,

,

radius therespect toith gradient w The

.,W

margin therespect toith gradient w The

SV

1-SV

SV1-SV

42

1,1

2

1,

2

T

fi

f

i

ji f

jiji

i i

iii

f

ji f

jijiji

f

SS

xxKxxKR

xxKyy

CBCl/AI MIT

Linear problem with 6 relevant dimensions of 202

Nonlinear problem with 2 relevant dimensions of 52

number of samples number of samples

erro

r ra

te

erro

r ra

te

Toy data

CBCl/AI MIT

Dataset Total Samples

Class 0

Class 1

Leukemia Morphology (train)

38 27 ALL

11 AML

Leukemia Morpholgy (test)

34 20 ALL

14 AML

Leukemia Lineage (ALL)

23 15 B-Cell

8 T-Cell

Lymphoma Outcome (AML)

15 8 Low risk

7 High risk

Dataset Total Samples

Class 0

Class 1

Lymphoma Morphology

77 19 FSC

58 DLCL

Lymphoma Outcome

58 22 Low risk

36 High risk

Brain Morphology

41 14 Glioma

27 MD

Brain Outcome

50 38 Low risk

12 High risk

Hierarchy of difficulty:1. Histological differences: normal vs. malignant, skin vs. brain2. Morphologies: different leukemia types, ALL vs. AML3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma4. Outcome: treatment outcome, elapse, or drug sensitivity.

Molecular classification of cancer

CBCl/AI MIT

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 35 0/35 0/21 0/14 40

WV 35 2/35 1/21 1/14 50

Leukemia Morphology (trest) AML vs ALL

k-NN 35 3/35 1/21 2/14 10

SVM 23 0/23 0/15 0/8 10

WV 23 0/23 0/15 0/8 9

Leukemia Lineage (ALL) B vs T

k-NN 23 0/23 0/15 0/8 10

SVM 77 4/77 2/32 2/35 200

WV 77 6/77 1/32 5/35 30

Lymphoma FS vs DLCL

k-NN 77 3/77 1/32 2/35 250

SVM

41 1/41 1/27 0/14 100

WV

41 1/41 1/27 0/14 3

Brain MD vs Glioma

k-NN

41 0/41 0/27 0/14 5

Morphology classification

CBCl/AI MIT

Dataset Algorithm Total Samples

Total errors

Class 1 errors

Class 0 errors

Number Genes

SVM 58 13/58 3/32 10/26 100

WV 58 15/58 5/32 10/26 12

Lymphoma LBC treatment outcome

k-NN 58 15/58 8/32 7/26 15

SVM 50 7/50 6/12 1/38 50

WV 50 13/50 6/12 7/38 6

Brain MD treatment outcome

k-NN 50 10/50 6/12 4/38 5

Outcome classification

CBCl/AI MIT

Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.0015

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

p-val = 0.00039

Lymphoma Medulloblastoma

Outcome classification

CBCl/AI MIT

Breast Prostate Lung Colorectal

Lymphoma

Bladder

Melenoma Uterus Leuke

mia Renal Pancreas Ovary Mesothel

ioma Brain

Abrev B P L CR Ly Bl M U Le R PA Ov MS C

Total 11 10 11 13 22 11 10 10 30 11 11 11 11 20

Train 8 8 8 8 16 8 8 8 24 8 8 8 8 16

Test 3 2 3 5 6 3 2 3 6 3 3 3 3 4

Note that most of these tumors came from secondary sources and were notat the tissue of origin.

Multi tumor classification

CBCl/AI MIT

CNS, Lymphoma, Leukemia tumors separate

Adenocarcinomas do not separate

Clustering is not accurate

CBCl/AI MIT

+

+

+

+

R+1+1

Y-1-1

G+1-1

B-1+1

ClassG+RB+R

Combination approaches: All pairsOne versus all (OVA)

Multi tumor classification

CBCl/AI MIT

GeneExpression

Dataset

FinalMulticlass

Call(Highest OVA

PredictionStrength)

Breast OVAClassifier

. . .

. . .

Prostate OVAClassifier

CNS OVAClassifier

TEST SAMPLE

BREAST TUMORS

ALL OTHER TUMORS

Hyperplane

Confidence

Breast (High Confidence)

-2

0

+2

Figure 2

Supervised methodology

CBCl/AI MIT

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4

Accuracy Fraction of Calls

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.

Train/cross -val. Test 1

00.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls


0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4


0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4-1

0

1

2

3

4

5

Low HighLow High

Correct Errors Correct Errors

Lo

w

H

igh

Confidence Confidence

Co

nfi

den

ce

Train/ Test 1

cross -val.


0

0.1

0.20.3

0.40.5

0.6

0.7

0.8

0.91

First Top 2 Top 3

Prediction Calls


Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Train Well Differentiated Cross-val. 144 78% 80% 90% 20% 28%

Test 1 Well Differentiated Train/Test 54 78% 78% 83% 22% 58%

Well differentiated tumors

CBCl/AI MIT

Feature selection hurts performance

CBCl/AI MIT

0

0.2

0.4

0.6

0.8

1

-1 0 1 2 3 4


-1

0

1

2

3

4

5

Low High

Confidence

Lo

w

H

igh

Co

nfi

den

ce

Correct Errors

00.10.20.30.40.50.60.70.80.9

1

First Top 2 Top 3

Prediction Calls

Dataset Sample Type ValidationMethod

Sample Number

TotalAccuracy

Confidence High LowFraction Accuracy Fraction Accuracy

Test Poorly Differentiated Train/test 20 30% 50% 50% 50% 10%

Poorly differentiated tumors

CBCl/AI MIT

Predicting sample sizes

For a given number of existing samples, how significant is the performance

of a classifier? Does the classifier work at all? Are the results better than

what one would expect by chance?

Assuming we know the answer to the previous questions for a number of sample sizes. What would be the performance of the classifier When trained with additional samples? Will the accuracy of the classifier improve significantly? Is the effort to collect additional samples worthwhile?

CBCl/AI MIT

Predicting sample sizes

classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier

classifierclassifierclassifierclassifier

Train

Test

Train

Test Train

Test

Input

Dataset

l samples n1 n2 l

….

….

….

Train/test realizations

i =1,2,…T1

Subsampling procedure

Significance test (comparison with random predictors)

1ne 2ne le Average error rates

Sample size

Error rate

Not significant Significant

Learning curve

banne )(

This model can be used to estimate sample size requirements




Train

Test

Train

Test Train

Test

Input

Dataset

l samples n1 n2 l

….

….

….

Train/test realizations

i =1,2,…T1

Subsampling procedure

Significance test (comparison with random predictors)

1ne 2ne le Average error rates

Sample size

Error rate

Not significant Significant

Learning curve

banne )(

This model can be used to estimate sample size requirements

CBCl/AI MIT

Statistical significance

a) b)

The statistical significance in the tumor vs. non-tumor classification for a) 15 samples and b) 30 samples.

CBCl/AI MIT

Learning curves

Tumor vs. normal

CBCl/AI MIT

Learning curves

Tumor vs. normal

a) b)

CBCl/AI MIT

Learning curves

AML vs. ALL

CBCl/AI MIT

Learning curves

Colon cancer

CBCl/AI MIT

Learning curves

Brain treatment outcome

CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical...

Documents

Transcript of CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical...