Post on 10-May-2015
description
Data integrationIntegration of functional associations using STRING
Lars Juhl Jensen
Jensen, Kuhn et al., Nucleic Acids Research, 2009
functional associations
confidence scores
cross-species integration
630 genomes
model organism databases
Ensembl
RefSeq
defining orthology
two modes
protein mode
von Mering et al., Nucleic Acids Research, 2005
COG mode
von Mering et al., Nucleic Acids Research, 2005
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
conserved neighborhood
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
examples
bacterial Cox assembly
Banci et al., PNAS, 2005
Banci et al., PNAS, 2005
cellulose degradation
Cell
Cellulosomes
Cellulose
experimental data
protein interactions
yeast two-hybrid
affinity purification
fragment complementation
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
BINDBiomolecular Interaction Network Database
BioGRIDGeneral Repository for Interaction Datasets
DIPDatabase of Interacting Proteins
IntAct
MINTMolecular Interactions Database
HPRDHuman Protein Reference Database
PDBProtein Data Bank
inferred associations
gene coexpression
GEOGene Expression Omnibus
expression compendia
curated knowledge
complexes
MIPSMunich Information center
for Protein Sequences
Gene Ontology
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
KEGGKyoto Encyclopedia of Genes and Genomes
MetaCyc
Reactome
PIDNCI-Nature Pathway Interaction Database
literature mining
>10 km
MEDLINE
SGDSaccharomyces Genome Database
The Interactive Fly
OMIMOnline Mendelian Inheritance in Man
co-mentioning
NLPNatural Language Processing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxgene The GAL4 gene]
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
easy in theory …
… but not in practice
many data types
not comparable
variable quality
many sources
different file formats
different gene identifiers
partially redundant
spread over 630 genomes
quality scores
reproducibility
von Mering et al., Nucleic Acids Research, 2005
intergenic distances
benchmarking
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
raw quality scores
probabilistic scores
integrate over orthologs
protein mode
von Mering et al., Nucleic Acids Research, 2005
COG mode
von Mering et al., Nucleic Acids Research, 2005
combine all evidence
Frishman et al., Modern Genome Annotation, 2009
small molecules
Kuhn et al., Nucleic Acids Research, 2008
metametabolomics
Acknowledgments
Christian von Mering
Michael Kuhn
Manuel Stark
Samuel Chaffron
Philippe Julien
Monica Campillos
Tobias Doerks
Jan Korbel
Berend Snel
Martijn Huynen
Peer Bork
larsjuhljensen