TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological...
TMRA 06TMRA 06
From Biological DataFrom Biological Datato Biological Knowledgeto Biological Knowledge
Volker Volker StümpflenStümpflen
Group for Biological Information SystemsGroup for Biological Information Systems
MIPS / Institute for BioinformaticsMIPS / Institute for Bioinformatics
GSF – National Research Center for Environment and HealthGSF – National Research Center for Environment and Health
TMRA 06TMRA 06
Something About Our ProblemSomething About Our Problem
We can’t understand anything without understanding the contextWe can’t understand anything without understanding the context
For a long time we focused on For a long time we focused on individual genes / proteins …individual genes / proteins …
… … but e.g. humans don’t have but e.g. humans don’t have much more genes than much more genes than “simple” organisms …“simple” organisms …
… … because complexity occurs because complexity occurs at the level of biological at the level of biological networksnetworks
TMRA 06TMRA 06
Small Scale „Knowledge Generation“Small Scale „Knowledge Generation“ Accessing some of the several hundred Accessing some of the several hundred
(web) resources (public available data > 2 (web) resources (public available data > 2 Petabyte)Petabyte)
=> Compilation of required knowledge by hand
TMRA 06TMRA 06
Large Scale Assessment of Large Scale Assessment of Information and KnowledgeInformation and Knowledge
R. Shamir et. al., Revealing modularity and organization in the yeast R. Shamir et. al., Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous molecular network by integrated analysis of highly heterogeneous
genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981-2986genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981-2986 ““To gain deeper understanding of the [biological] To gain deeper understanding of the [biological]
systems, it is pertinent to analyze heterogeneous data systems, it is pertinent to analyze heterogeneous data sources in a truly integrated fashion and shape the sources in a truly integrated fashion and shape the analysis results into one body of knowledge.”analysis results into one body of knowledge.”
““By integrating experimental data of heterogeneous By integrating experimental data of heterogeneous sources and types, we are able to perform analysis on sources and types, we are able to perform analysis on a much broader scope than previous studies.”a much broader scope than previous studies.”
TMRA 06TMRA 06
Technical ProblemsTechnical Problems
Information integration Information integration from heterogeneous from heterogeneous and distributed data and distributed data sources (databases sources (databases AND applications)AND applications)
Solvable with n-Tier Solvable with n-Tier architecturesarchitectures
E.g. GenRE at MIPSE.g. GenRE at MIPS J2EE based J2EE based
middlewaremiddleware Enterprise Java Enterprise Java
Beans (EJBs) and Beans (EJBs) and Web Services (WS)Web Services (WS)
TMRA 06TMRA 06
Semantic ProblemsSemantic Problems
Sloppy Definitions:Sloppy Definitions:e.g. Gene has Functione.g. Gene has Function
Homonym / Synonym problemsHomonym / Synonym problemse.g. gene identifierse.g. gene identifiers
Ambiguity of termsAmbiguity of terms Differences in meaning of terms between Differences in meaning of terms between
different biological communitiesdifferent biological communities Results of in-vitro often differ within the Results of in-vitro often differ within the
experimental scope (e.g. Protein Interactions)experimental scope (e.g. Protein Interactions) ……
TMRA 06TMRA 06
StrategiesStrategies Complete semantic annotation of (all) Complete semantic annotation of (all)
resourcesresources Funding ?Funding ? Data models ?Data models ?
Modeling of individual domainsModeling of individual domains Suited for biologists (Topic Maps)Suited for biologists (Topic Maps) Access of relevant data sourcesAccess of relevant data sources Merging of individual domains to obtain the Merging of individual domains to obtain the
“complete picture”“complete picture”
TMRA 06TMRA 06
Static Generation of Topic MapsStatic Generation of Topic Maps
+ Highly flexible data modelHighly flexible data model+ Straightforward processStraightforward process+ Intuitive user interface Intuitive user interface + Finding the right information easyFinding the right information easy- Topic maps tend to be very largeTopic maps tend to be very large- Redundant information in DBs and Topic Map filesRedundant information in DBs and Topic Map files- Update problemsUpdate problems
Dynamic generation of topic mapsDynamic generation of topic maps
TM4J Omnigator
Extract
Extract
Extract
XTMFile
TMRA 06TMRA 06
Dynamic Topic Map GenerationDynamic Topic Map Generation Dynamical information retrieval via EJBs / Web ServicesDynamical information retrieval via EJBs / Web Services
Each topic type is mapped to a EJBs / Web ServiceEach topic type is mapped to a EJBs / Web Service Each association is also represented by a EJBs / Web ServiceEach association is also represented by a EJBs / Web Service
Straightforward extension of the data modelStraightforward extension of the data model Afterwards user's adjustments are possibleAfterwards user's adjustments are possible
Intuitive navigation of related informationIntuitive navigation of related information
Protein EC Number
Protein – ECNum Association
has
is associated to
ProteinWeb Services
EC NumberWeb ServicesProtein – ECNum
AssociationWeb Services
TMRA 06TMRA 06
Interface DefinitionInterface Definition Information retrieval via EJBs (Web Service)Information retrieval via EJBs (Web Service)
Each topic type is mapped to a EJB / WSEach topic type is mapped to a EJB / WS Each association type is also represented by a EJB / WSEach association type is also represented by a EJB / WS
Straightforward extension of the data modelStraightforward extension of the data model Afterwards user's adjustments are possibleAfterwards user's adjustments are possible
…
… …
…
TMRA 06TMRA 06
DTMG ArchitectureDTMG Architecture(Extension of GenRE)(Extension of GenRE)
Enterprise Information SystemTier
Arabidopsis thaliana
SIMAP…
IntegrationTier
SyntaxTier
ProteinExtractor
SemanticTier
Web PresentationTier
SIMAP AccessEJB
GISEEJB
PEDANT DBs
GISEEJB …
GISEEJB
ProteinPfamExtractor
ResourceManager
XSL TransformationOther Types of
Clients
FunCat
ProteinTopicTypeEJB
ProteinPfamAssType EJB
TopicMapManager
…
…
K. Nenova
TMRA 06TMRA 06
““Worst Case” ExampleWorst Case” Example
Combination of two large resources at Combination of two large resources at MIPSMIPS Annotated Proteins:Annotated Proteins:
Calculated properties of genes / proteins from Calculated properties of genes / proteins from various organismsvarious organisms
Orthologs:Orthologs:Calculated similarities of proteinsCalculated similarities of proteins(all against all)(all against all)
K. Nenova / R. Gregory
TMRA 06TMRA 06
Large Scale Annotation with PEDANTLarge Scale Annotation with PEDANT(Protein Extraction, Description and Analysis Tool)(Protein Extraction, Description and Analysis Tool)
Covers currently > 400 genomesCovers currently > 400 genomes ~ 1000 end of this year~ 1000 end of this year
TMRA 06TMRA 06
SIMAP: Precalculated Sequence HomologiesSIMAP: Precalculated Sequence Homologies
LAN
SIMAP databaseNFS-ServerGrid Master
Grid execution hosts
Web-server
Database-,Fileserver
InternetInternet
BOINC daemons
SIMAP database
SIMAP client
BOINC coreLinux
WindowsMac
External users: MIPS + WWW users
• 450 proteoms• 4 sequence collections• 7.5 million protein entries• 3.5 million sequences
8 billion FASTA hits
R. Arnold, T. Rattei, P. Tischler, V. Stümpflen, M-D. Truong and HW. Mewes;Bioinformatics in press
BOINC:• 12600 hosts• 2.3 TeraFLOPS
TMRA 06TMRA 06
Topic Map SchemaTopic Map Schemais represented by
EC NumberProteinPFAM Domain
has
is associated to
Classification PedantURL
PedantURL
Description Length Molecular Weight
Contig NameSequence
Description
PfamURL
PfamURL
Description
KEGGURL
KEGGURL
contains
belongs
has orthologs
Genome
Genome
Taxonomy IdDescription StrainStatus
URL
URL
Domain
Fun Cat
Description FunCat
URL
FunCat
URL
belongs
is represented by
TMRA 06TMRA 06
Some ScreenshotsSome Screenshots
Context
TMRA 06TMRA 06
ImprovementsImprovements Parallel searches based on Message Driven BeansParallel searches based on Message Driven Beans
StatelessSession
Bean
Pedant DB 1
Pedant DB 2
Pedant DB 3
Pedant DB n
Message DrivenBean
Message DrivenBean
Message DrivenBean
Message DrivenBean
SearchQueue
ResponseQueue
1 2
43
Request Message Response Message Database Connection
R. Gregory
TMRA 06TMRA 06
Further ImprovementsFurther Improvements
More MapsMore Maps Deseases, MetabolismsDeseases, Metabolisms
Combination with Text MiningCombination with Text Mining Inference Engines, ReasonersInference Engines, Reasoners ……
Computer:
Show me all proteins
in mus musculus
involved in
transmembrane signal transduction
and show me the orthologs
in rattus norvegicus
TMRA 06TMRA 06
ConclusionConclusion
Topic Maps suitable for semantic information Topic Maps suitable for semantic information integrationintegration
Development of a Dynamic Topic Map Development of a Dynamic Topic Map Generation (DTMG) FrameworkGeneration (DTMG) Framework
Generation of fragments based on component Generation of fragments based on component and service oriented architecturesand service oriented architectures
Capable to gain deeper understanding of Capable to gain deeper understanding of biological entities and systems in a truly biological entities and systems in a truly integrated fashionintegrated fashion
TMRA 06TMRA 06
AcknowledgementsAcknowledgements
Filka NenovaFilka NenovaRichard GregoryRichard Gregory
Matthias Oesterheld Matthias Oesterheld Roland ArnoldRoland ArnoldOctave NoubibouOctave NoubibouMarisa ThomaMarisa ThomaKonrad SchreiberKonrad Schreiber……
Thomas RatteiThomas Rattei
Ulrich GüldenerUlrich GüldenerMartin MünsterkötterMartin Münsterkötter
FundingFundingImpuls- und Impuls- und Vernetzungsfonds derVernetzungsfonds derHelmholtz-Gemeinschaft Helmholtz-Gemeinschaft Deutscher Deutscher Forschungszentren e.V.Forschungszentren e.V.