STRING: Prediction of protein networks through integration of diverse large-scale data sets
-
Upload
lars-juhl-jensen -
Category
Technology
-
view
1.142 -
download
0
description
Transcript of STRING: Prediction of protein networks through integration of diverse large-scale data sets
STRING: Prediction of protein networks through integration of diverse large-scale data sets
Lars Juhl JensenEMBL Heidelberg
The problem ...
Prediction of protein function
• Homology based methods– Simple sequence similarity searches (BLAST)– Profile searches (PSI-BLAST)– Databases of conserved domains (Pfam, SMART)
• Non-homology based methods working on sequence– Prediction from sequence derived features (ProtFun)– Prediction from genomic context (STRING)
• Prediction from high-throughput experimental data– Microarray gene expression data– Protein-protein interaction screens– ...
Prediction of functional associations
“Protein mode”
Separate networkfor each species
“COG mode”
One networkcoveringall species
STRING provides a protein network based on integration of diverse types of evidence
Genomic Neighborhood
Species Co-occurrence
Gene Fusions
Database Imports
Exp. Interaction Data
Co-expression
Literature Co-mentioning
Score calibration against a common reference
• Many diverse types of evidence– The quality of each is judged by
very different raw scores– These are all calibrated against
the same reference set– This is the key to obtaining a
consistent scoring scheme
• Requirements for a reference– Must represent a compromise
of the all types of evidence– Broad species coverage
• Our reference is KEGG maps– Two proteins are “related” if on
a common KEGG map
Integrating physical interaction screens
Make binaryrepresentationof complexes
Yeast two-hybriddata sets are
inherently binary
Calculate scorefrom number of
(co-)occurrences
Calculate scorefrom non-shared
partners
Calibrate against KEGG maps
Infer associations in other species
Combine evidence from experiments
Gene fusion: predicting physical interactions
Detect multiple proteinsmatching to one protein
Exclude overlappingalignments
Infer associations inother species
Calibrate againstKEGG maps
Mining microarray expression databases
Re-normalize arraysby modern methodto remove biases
Buildexpression
matrix
Combinesimilar arrays
by PCA
Construct predictorby Gaussian kerneldensity estimation
Calibrateagainst
KEGG maps
Inferassociations inother species
Gene neighborhood: predicting co-expression
Identify runs of adjacent geneswith the same direction
Score each gene pair based onintergenic distances
Calibrate against KEGG maps
Infer associationsin other species
Co-mentioning in the scientific literature
Associate abstracts with species
Identify gene names in title/abstract
Count (co-)occurrences of genes
Test significance of associations
Calibrate against KEGG maps
Infer associations in other species
Phylogenetic profile: co-mentioning in genomes
Align all proteins against all
Calculate best-hit profile
Join similar species by PCA
Calculate PC profile distances
Calibrate against KEGG maps
COG based vs. similarity based transfer
• Resolution of the mapping– COGs result in many-to-many– Sequence similarity should
resolve with better detail
• Our scoring scheme– Pairwise alignment scores are
normalized by self-hit– These scores are transformed
using exp(-k1/x), where k1=0.7
– Missing values are estimated– Divide each score by the
column and row sum
• This gives a quantitative score for protein correspondence
Tar
get
spec
ies
Source species
Tar
get
spec
ies
Source species
?
Source species
Target species
Transfer and combination of evidence
• Evidence scores are multiplied by “correspondence scores”
• From each set of closely related species (a clade) only the best scoring evidence of each type is transferred
• The best evidence from each clade is “added” and scaled:scoretransfer = k3 * ( 1 – (1-k2*clade1) * (1-k2*clade2) * ... )
• In-species and transferred evidence is “added” and a total combined score calculated
Combining multiple types of evidencefrom several species
The next step in data integration:predicting the type of interaction
Information extraction from PubMed:extracting specific types of associations
• Tokenization and multi word detection
• Part-of-speech tagging
• Semantic labeling– Gene names– Cue words for entity recognition– Verbs for relation extraction
• Named entity chunking– A CASS grammar recognizes
noun chunks relevant for gene transcription
– [nxgene The GAL4 gene]
• Relation chunking– Our CASS grammar also
recognizes relations between entities:
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
• Output and visualization– TIGERSearch for inspection– Script for extracting a binary
representation of the relations– Show later go into STRING
We extract from both active, passive, and nominalized sentence constructs
[nx_prom the ATR1 promoter region][contain contains[nx_uas_pt
[dt-a a] [bs binding site] [for for] [nx_activator the GCN4 activator protein]]
[nx_expr RNR1 expression][bez is] [repv reduced] [by by][nx_oprd CLN1 or CLN2 overexpression]
[dt-the the] [binding binding] [of of][nx_prot GCN4 protein] [to to][nx_prom the SER1 promoter in vitro]
A high confidence regulatory network
• We manage to extract a satisfactory number of relations– 422 relation chunks– 597 binary relations– 441 unique binary relations
• Activation/repression assigned for ~50% of relations
• High accuracy: 83-90% on event extraction
• “Arrows” generally point from known transcription factors to other genes
More STRING to come
• Adding more large scale data sets and more species
• New types of genomic context evidence– White seminar by Jan Korbel in May
• Assign specific interaction types to functional associations– Expand text mining to cover more interaction types– Predict interaction types from evidence types
• Interpreting the network– Discover functional modules/pathways– Network topology and network motifs– White seminar by Christian von Mering in June
Acknowledgments
• The STRING team– Christian von Mering– Berend Snel– Martijn Huynen– Daniel Jaeggi– Steffen Schmidt– Mathilde Foglierini– Peer Bork
• ArrayProspector web service– Julien Lagarde– Chris Workman
• NetView visualization tool– Sean Hooper
• Text mining together with EML– Jasmin Saric– Rossitza Ouzounova– Isabel Rojas
• All my other “partners in crime” on various projects– The Steinmetz Group– The Furlong Group
• Web resources– string.embl.de– www.bork.embl.de/ArrayProspector– www.bork.embl.de/synonyms
Thank you!