Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT -...

29
ical Data Fusion to Prioritize Lists o Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De Moor Assessor: Yves Moreau base Issues in Biological Databases (DBiBD), January 8-9, 2005

Transcript of Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT -...

Page 1: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Statistical Data Fusion to Prioritize Lists of Genes

Bert Coessens, Stein Aerts

Departement ESAT - SCDKatholieke Universiteit Leuven

Promotor: Bart De MoorAssessor: Yves Moreau

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Page 2: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Context

x

xx

x

x

x

x

xx

Linkage AnalysisPositional Cloning

NEFL

RAB7

GARS

GIB1

LMNA

High-throughput technologies

Page 3: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Concept

Pathology / Biological process / …

Gene Expression Literature

AnatomicalExpression

GeneRegulation

ProteinDomains

FunctionalAnnotation

EvolutionaryConservation …

Page 4: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Concept

Model with multiple submodels

Training genes

Training set

Choose submodels TRAIN

Candidate genes

Test set

One ranking foreach submodel

Combinedranking

Orderstatistics

SCORE

genei

Page 5: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Order Statistics

Given a set of n rank ratios for genei

- what is the probability of getting these ratios by chance alone?

Q r1,r 2, ... , r n n!0

r 1

s1

r2

...sn 1

rn

dsn dsn 1 ...ds1

Joint probability density function of all n order statistics:

V ki 1

k 1

1 i 1 V k i

i !rn k 1

i

Complexity O(n2)

Page 6: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Setup

29 lists of disease genes from OMIM

5 lists of random genesfrom the human genome

Foreach disease or random gene set do:Foreach gene in the set do:a. Leave one gene outb.TRAIN all submodels on the set minus the left-out genec. Create a test set by adding left-out gene to [9, 49, 99] random genesd. SCORE the test set with all trained submodelse. RANK the genes in the test set according to their order statistics p-valueend

end

Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x- FP: number of genes but left-out gene ranked above x- TN: number of genes but the left-out gene ranked below x- FN: number of left-out genes ranked below x

Calculate sensitivity and specificity using the above mentioned values,plot (1-specificity) versus sensitivity to obtain a Rank ROC plot andcalculate the area under the curve.

Page 7: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Disease genes

Page 8: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Disease genes

- 29 human diseases (OMIM) = 29 gene sets- 627 disease genes with Ensembl identifier in total- average gene set contains 19 genes- smallest gene set = ALS with 4 genes- largest gene set = leukemia with 113 genes

Page 9: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: SubmodelsTextual data: TXTGate

Sequence similarity: BLAST

+

Rank genes according to e-value

Example: Presenilin 1 vs. Presenilin 2 e-value = 10-133

Page 10: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Functional annotation: GO

Functional annotation: Kegg

Set ofgenes

GO IDs observed

frequencies

Full Genome

GO IDsGO-id

expected frequencies

GO IDs

Page 11: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Protein information: InterPro

Protein information: BIND

Training genes+

Interaction partners

Test gene+

Interaction partners

Overlap?

Page 12: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Gene expression: Microarray data

Gene expression: ESTs

- Model is average expression profile of training genes- Score test gene by calculating Pearson correlation

Human gene expression atlas: Su et al.47 normal human tissues

Page 13: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Cis-regulatory elements: TFBSs

Cis-regulatory elements: TFBS modules

- Check human-mouse CNS blocks in upstream sequence of a test gene

- Compare found motifs with motifs in training set

ModuleSearcher:searches best combination of 3 TFs in 300 bp USof genes in training set

ModuleScanner:scores test gene with model

Page 14: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Similarity

Statistical meta-analysis

Vector-based similarity

Fisher’s methodAssume there are m independent tests of H0.1. For the i-th test calculate the corresponding p-value, pi.2. If pi has a uniform distribution on [0,1],

then –2Σlog pi has a χ2m

distribution.

T1

T3

T2

- Euclidean distance- Pearson correlation- Cosine similarity

Page 15: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Correlation

Page 16: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Rank ROC

Page 17: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodel Rank ROC

Page 18: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Bias towards known genes

Page 19: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Screenshot

Page 20: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Architecture

ESATWeb server

Linux cluster

Java RMI

SOAP messages

Page 21: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Conclusions and Future

- Different weighting for different submodels- Explore mathematical modeling techniques (neural nets, SVM)- Add more information models- Define best combination of submodels

F

- Allows integration of heterogeneous data- Solves problem of uncertainty- Solves multiple testing problem (Bonferroni correction)- Allows for cut-offs with statistical significance

C

Page 22: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Acknowledgements

Bart De MoorStein Aerts Yves Moreau

Patrick Glenisson Steven Van Vooren Joke Allemeersch

Page 23: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Load training set

Page 24: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Add submodels

Page 25: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Train submodels

Page 26: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Load candidate genes

Page 27: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Score candidate genes with all submodels

Page 28: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Results of scoring

Page 29: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Ranking visualized in sprintplot