STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl...

14
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg

Transcript of STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl...

Page 1: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

STRINGPrediction of protein networks through

integration of diverse large-scale data sets

Lars Juhl JensenEMBL Heidelberg

Page 2: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

STRING integrates many types of evidence

Genomic neighborhood

Species co-occurrence

Gene fusions

Database imports

Exp. interaction data

Microarray expression data

Literature co-mentioning

Page 3: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Integrating physical interaction screens

Make binaryrepresentationof complexes

Make binaryrepresentationof complexes

Yeast two-hybriddata sets are

inherently binary

Yeast two-hybriddata sets are

inherently binary

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom non-shared

partners

Calculate scorefrom non-shared

partners

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associations in other speciesInfer associations in other species

Combine evidence from experimentsCombine evidence from experiments

Page 4: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Gene fusion: predicting physical interactions

Detect multiple proteinsmatching to one proteinDetect multiple proteinsmatching to one protein

Exclude overlappingalignments

Exclude overlappingalignments

Infer associations inother species

Infer associations inother species

Calibrate againstKEGG maps

Calibrate againstKEGG maps

Page 5: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Mining microarray expression databases

Re-normalize arraysby modern methodto remove biases

Re-normalize arraysby modern methodto remove biases

Buildexpression

matrix

Buildexpression

matrix

Combinesimilar arrays

by PCA

Combinesimilar arrays

by PCA

Construct predictorby Gaussian kerneldensity estimation

Construct predictorby Gaussian kerneldensity estimation

Calibrateagainst

KEGG maps

Calibrateagainst

KEGG maps

Inferassociations inother species

Inferassociations inother species

Page 6: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Gene neighborhood: predicting co-expression

Identify runs of adjacent geneswith the same direction

Identify runs of adjacent geneswith the same direction

Score each gene pair based onintergenic distances

Score each gene pair based onintergenic distances

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associationsin other species

Infer associationsin other species

Page 7: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Co-mentioning in the scientific literature

Associate abstracts with speciesAssociate abstracts with species

Identify gene names in title/abstractIdentify gene names in title/abstract

Count (co-)occurrences of genesCount (co-)occurrences of genes

Test significance of associationsTest significance of associations

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associations in other speciesInfer associations in other species

Page 8: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Phylogenetic profile: co-mentioning in genomes

Align all proteins against allAlign all proteins against all

Calculate best-hit profileCalculate best-hit profile

Join similar species by PCAJoin similar species by PCA

Calculate PC profile distancesCalculate PC profile distances

Calibrate against KEGG mapsCalibrate against KEGG maps

Page 9: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Multiple evidence types from several species

Page 10: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Score calibration against a common reference

• Many diverse types of evidence– The quality of each is judged by

very different raw scores

– These are all calibrated against the same reference set

• Requirements for a reference– Must represent a compromise

of the all types of evidence

– Broad species coverage

• Both a strength and a weakness– Scores for all evidence types

are directly comparable

– The type of interaction is currently not predicted

Page 11: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Getting more specific – generally speaking

Page 12: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Other possible improvements

• Bidirectionally transcribed gene pairs: a new genomic context method that may work on eukaryotes too[Korbel et al., Nature Biotechnology 2004]

• Information extraction from PubMed using shallow parsing[Saric et al., Proceedings of ACL 2004]

• Add more types of experiment types, e.g. protein expression levels

• Infer functional relations from feature similarity

• Hook up STRING with a robot

Page 13: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Acknowledgments

• The STRING team– Christian von Mering

– Berend Snel

– Martijn Huynen

– Daniel Jaeggi

– Steffen Schmidt

– Mathilde Foglierini

– Peer Bork

• ArrayProspector web service– Julien Lagarde

– Chris Workman

• NetView visualization tool– Sean Hooper

• Analysis of yeast cell cycle– Ulrik de Lichtenberg

– Thomas Skøt

– Anders Fausbøll

– Søren Brunak

• Web resources– string.embl.de

– www.bork.embl.de/ArrayProspector

– www.bork.embl.de/synonyms

Page 14: STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.

Thank you!