STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl...
-
Upload
evan-young -
Category
Documents
-
view
213 -
download
1
Transcript of STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl...
STRINGPrediction of protein networks through
integration of diverse large-scale data sets
Lars Juhl JensenEMBL Heidelberg
STRING integrates many types of evidence
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Exp. interaction data
Microarray expression data
Literature co-mentioning
Integrating physical interaction screens
Make binaryrepresentationof complexes
Make binaryrepresentationof complexes
Yeast two-hybriddata sets are
inherently binary
Yeast two-hybriddata sets are
inherently binary
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom non-shared
partners
Calculate scorefrom non-shared
partners
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associations in other speciesInfer associations in other species
Combine evidence from experimentsCombine evidence from experiments
Gene fusion: predicting physical interactions
Detect multiple proteinsmatching to one proteinDetect multiple proteinsmatching to one protein
Exclude overlappingalignments
Exclude overlappingalignments
Infer associations inother species
Infer associations inother species
Calibrate againstKEGG maps
Calibrate againstKEGG maps
Mining microarray expression databases
Re-normalize arraysby modern methodto remove biases
Re-normalize arraysby modern methodto remove biases
Buildexpression
matrix
Buildexpression
matrix
Combinesimilar arrays
by PCA
Combinesimilar arrays
by PCA
Construct predictorby Gaussian kerneldensity estimation
Construct predictorby Gaussian kerneldensity estimation
Calibrateagainst
KEGG maps
Calibrateagainst
KEGG maps
Inferassociations inother species
Inferassociations inother species
Gene neighborhood: predicting co-expression
Identify runs of adjacent geneswith the same direction
Identify runs of adjacent geneswith the same direction
Score each gene pair based onintergenic distances
Score each gene pair based onintergenic distances
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associationsin other species
Infer associationsin other species
Co-mentioning in the scientific literature
Associate abstracts with speciesAssociate abstracts with species
Identify gene names in title/abstractIdentify gene names in title/abstract
Count (co-)occurrences of genesCount (co-)occurrences of genes
Test significance of associationsTest significance of associations
Calibrate against KEGG mapsCalibrate against KEGG maps
Infer associations in other speciesInfer associations in other species
Phylogenetic profile: co-mentioning in genomes
Align all proteins against allAlign all proteins against all
Calculate best-hit profileCalculate best-hit profile
Join similar species by PCAJoin similar species by PCA
Calculate PC profile distancesCalculate PC profile distances
Calibrate against KEGG mapsCalibrate against KEGG maps
Multiple evidence types from several species
Score calibration against a common reference
• Many diverse types of evidence– The quality of each is judged by
very different raw scores
– These are all calibrated against the same reference set
• Requirements for a reference– Must represent a compromise
of the all types of evidence
– Broad species coverage
• Both a strength and a weakness– Scores for all evidence types
are directly comparable
– The type of interaction is currently not predicted
Getting more specific – generally speaking
Other possible improvements
• Bidirectionally transcribed gene pairs: a new genomic context method that may work on eukaryotes too[Korbel et al., Nature Biotechnology 2004]
• Information extraction from PubMed using shallow parsing[Saric et al., Proceedings of ACL 2004]
• Add more types of experiment types, e.g. protein expression levels
• Infer functional relations from feature similarity
• Hook up STRING with a robot
Acknowledgments
• The STRING team– Christian von Mering
– Berend Snel
– Martijn Huynen
– Daniel Jaeggi
– Steffen Schmidt
– Mathilde Foglierini
– Peer Bork
• ArrayProspector web service– Julien Lagarde
– Chris Workman
• NetView visualization tool– Sean Hooper
• Analysis of yeast cell cycle– Ulrik de Lichtenberg
– Thomas Skøt
– Anders Fausbøll
– Søren Brunak
• Web resources– string.embl.de
– www.bork.embl.de/ArrayProspector
– www.bork.embl.de/synonyms
Thank you!