TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological...

20
TMRA 06 TMRA 06 From Biological Data From Biological Data to Biological Knowledge to Biological Knowledge Volker Volker Stümpflen Stümpflen Group for Biological Information Systems Group for Biological Information Systems MIPS / Institute for Bioinformatics MIPS / Institute for Bioinformatics GSF – National Research Center for Environment GSF – National Research Center for Environment and Health and Health
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological...

Page 1: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

From Biological DataFrom Biological Datato Biological Knowledgeto Biological Knowledge

Volker Volker StümpflenStümpflen

Group for Biological Information SystemsGroup for Biological Information Systems

MIPS / Institute for BioinformaticsMIPS / Institute for Bioinformatics

GSF – National Research Center for Environment and HealthGSF – National Research Center for Environment and Health

Page 2: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Something About Our ProblemSomething About Our Problem

We can’t understand anything without understanding the contextWe can’t understand anything without understanding the context

For a long time we focused on For a long time we focused on individual genes / proteins …individual genes / proteins …

… … but e.g. humans don’t have but e.g. humans don’t have much more genes than much more genes than “simple” organisms …“simple” organisms …

… … because complexity occurs because complexity occurs at the level of biological at the level of biological networksnetworks

Page 3: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Small Scale „Knowledge Generation“Small Scale „Knowledge Generation“ Accessing some of the several hundred Accessing some of the several hundred

(web) resources (public available data > 2 (web) resources (public available data > 2 Petabyte)Petabyte)

=> Compilation of required knowledge by hand

Page 4: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Large Scale Assessment of Large Scale Assessment of Information and KnowledgeInformation and Knowledge

R. Shamir et. al., Revealing modularity and organization in the yeast R. Shamir et. al., Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous molecular network by integrated analysis of highly heterogeneous

genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981-2986genomewide data, PNAS, Vol. 101, No. 9, 2004, p. 2981-2986 ““To gain deeper understanding of the [biological] To gain deeper understanding of the [biological]

systems, it is pertinent to analyze heterogeneous data systems, it is pertinent to analyze heterogeneous data sources in a truly integrated fashion and shape the sources in a truly integrated fashion and shape the analysis results into one body of knowledge.”analysis results into one body of knowledge.”

““By integrating experimental data of heterogeneous By integrating experimental data of heterogeneous sources and types, we are able to perform analysis on sources and types, we are able to perform analysis on a much broader scope than previous studies.”a much broader scope than previous studies.”

Page 5: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Technical ProblemsTechnical Problems

Information integration Information integration from heterogeneous from heterogeneous and distributed data and distributed data sources (databases sources (databases AND applications)AND applications)

Solvable with n-Tier Solvable with n-Tier architecturesarchitectures

E.g. GenRE at MIPSE.g. GenRE at MIPS J2EE based J2EE based

middlewaremiddleware Enterprise Java Enterprise Java

Beans (EJBs) and Beans (EJBs) and Web Services (WS)Web Services (WS)

Page 6: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Semantic ProblemsSemantic Problems

Sloppy Definitions:Sloppy Definitions:e.g. Gene has Functione.g. Gene has Function

Homonym / Synonym problemsHomonym / Synonym problemse.g. gene identifierse.g. gene identifiers

Ambiguity of termsAmbiguity of terms Differences in meaning of terms between Differences in meaning of terms between

different biological communitiesdifferent biological communities Results of in-vitro often differ within the Results of in-vitro often differ within the

experimental scope (e.g. Protein Interactions)experimental scope (e.g. Protein Interactions) ……

Page 7: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

StrategiesStrategies Complete semantic annotation of (all) Complete semantic annotation of (all)

resourcesresources Funding ?Funding ? Data models ?Data models ?

Modeling of individual domainsModeling of individual domains Suited for biologists (Topic Maps)Suited for biologists (Topic Maps) Access of relevant data sourcesAccess of relevant data sources Merging of individual domains to obtain the Merging of individual domains to obtain the

“complete picture”“complete picture”

Page 8: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Static Generation of Topic MapsStatic Generation of Topic Maps

+ Highly flexible data modelHighly flexible data model+ Straightforward processStraightforward process+ Intuitive user interface Intuitive user interface + Finding the right information easyFinding the right information easy- Topic maps tend to be very largeTopic maps tend to be very large- Redundant information in DBs and Topic Map filesRedundant information in DBs and Topic Map files- Update problemsUpdate problems

Dynamic generation of topic mapsDynamic generation of topic maps

TM4J Omnigator

Extract

Extract

Extract

XTMFile

Page 9: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Dynamic Topic Map GenerationDynamic Topic Map Generation Dynamical information retrieval via EJBs / Web ServicesDynamical information retrieval via EJBs / Web Services

Each topic type is mapped to a EJBs / Web ServiceEach topic type is mapped to a EJBs / Web Service Each association is also represented by a EJBs / Web ServiceEach association is also represented by a EJBs / Web Service

Straightforward extension of the data modelStraightforward extension of the data model Afterwards user's adjustments are possibleAfterwards user's adjustments are possible

Intuitive navigation of related informationIntuitive navigation of related information

Protein EC Number

Protein – ECNum Association

has

is associated to

ProteinWeb Services

EC NumberWeb ServicesProtein – ECNum

AssociationWeb Services

Page 10: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Interface DefinitionInterface Definition Information retrieval via EJBs (Web Service)Information retrieval via EJBs (Web Service)

Each topic type is mapped to a EJB / WSEach topic type is mapped to a EJB / WS Each association type is also represented by a EJB / WSEach association type is also represented by a EJB / WS

Straightforward extension of the data modelStraightforward extension of the data model Afterwards user's adjustments are possibleAfterwards user's adjustments are possible

… …

Page 11: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

DTMG ArchitectureDTMG Architecture(Extension of GenRE)(Extension of GenRE)

Enterprise Information SystemTier

Arabidopsis thaliana

SIMAP…

IntegrationTier

SyntaxTier

ProteinExtractor

SemanticTier

Web PresentationTier

SIMAP AccessEJB

GISEEJB

PEDANT DBs

GISEEJB …

GISEEJB

ProteinPfamExtractor

ResourceManager

XSL TransformationOther Types of

Clients

FunCat

ProteinTopicTypeEJB

ProteinPfamAssType EJB

TopicMapManager

K. Nenova

Page 12: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

““Worst Case” ExampleWorst Case” Example

Combination of two large resources at Combination of two large resources at MIPSMIPS Annotated Proteins:Annotated Proteins:

Calculated properties of genes / proteins from Calculated properties of genes / proteins from various organismsvarious organisms

Orthologs:Orthologs:Calculated similarities of proteinsCalculated similarities of proteins(all against all)(all against all)

K. Nenova / R. Gregory

Page 13: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Large Scale Annotation with PEDANTLarge Scale Annotation with PEDANT(Protein Extraction, Description and Analysis Tool)(Protein Extraction, Description and Analysis Tool)

Covers currently > 400 genomesCovers currently > 400 genomes ~ 1000 end of this year~ 1000 end of this year

Page 14: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

SIMAP: Precalculated Sequence HomologiesSIMAP: Precalculated Sequence Homologies

LAN

SIMAP databaseNFS-ServerGrid Master

Grid execution hosts

Web-server

Database-,Fileserver

InternetInternet

BOINC daemons

SIMAP database

SIMAP client

BOINC coreLinux

WindowsMac

External users: MIPS + WWW users

• 450 proteoms• 4 sequence collections• 7.5 million protein entries• 3.5 million sequences

8 billion FASTA hits

R. Arnold, T. Rattei, P. Tischler, V. Stümpflen, M-D. Truong and HW. Mewes;Bioinformatics in press

BOINC:• 12600 hosts• 2.3 TeraFLOPS

Page 15: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Topic Map SchemaTopic Map Schemais represented by

EC NumberProteinPFAM Domain

has

is associated to

Classification PedantURL

PedantURL

Description Length Molecular Weight

Contig NameSequence

Description

PfamURL

PfamURL

Description

KEGGURL

KEGGURL

contains

belongs

has orthologs

Genome

Genome

Taxonomy IdDescription StrainStatus

URL

URL

Domain

Fun Cat

Description FunCat

URL

FunCat

URL

belongs

is represented by

Page 16: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Some ScreenshotsSome Screenshots

Context

Page 17: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

ImprovementsImprovements Parallel searches based on Message Driven BeansParallel searches based on Message Driven Beans

StatelessSession

Bean

Pedant DB 1

Pedant DB 2

Pedant DB 3

Pedant DB n

Message DrivenBean

Message DrivenBean

Message DrivenBean

Message DrivenBean

SearchQueue

ResponseQueue

1 2

43

Request Message Response Message Database Connection

R. Gregory

Page 18: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

Further ImprovementsFurther Improvements

More MapsMore Maps Deseases, MetabolismsDeseases, Metabolisms

Combination with Text MiningCombination with Text Mining Inference Engines, ReasonersInference Engines, Reasoners ……

Computer:

Show me all proteins

in mus musculus

involved in

transmembrane signal transduction

and show me the orthologs

in rattus norvegicus

Page 19: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

ConclusionConclusion

Topic Maps suitable for semantic information Topic Maps suitable for semantic information integrationintegration

Development of a Dynamic Topic Map Development of a Dynamic Topic Map Generation (DTMG) FrameworkGeneration (DTMG) Framework

Generation of fragments based on component Generation of fragments based on component and service oriented architecturesand service oriented architectures

Capable to gain deeper understanding of Capable to gain deeper understanding of biological entities and systems in a truly biological entities and systems in a truly integrated fashionintegrated fashion

Page 20: TMRA 06 From Biological Data to Biological Knowledge Volker Stümpflen Group for Biological Information Systems MIPS / Institute for Bioinformatics GSF.

TMRA 06TMRA 06

AcknowledgementsAcknowledgements

Filka NenovaFilka NenovaRichard GregoryRichard Gregory

Matthias Oesterheld Matthias Oesterheld Roland ArnoldRoland ArnoldOctave NoubibouOctave NoubibouMarisa ThomaMarisa ThomaKonrad SchreiberKonrad Schreiber……

Thomas RatteiThomas Rattei

Ulrich GüldenerUlrich GüldenerMartin MünsterkötterMartin Münsterkötter

FundingFundingImpuls- und Impuls- und Vernetzungsfonds derVernetzungsfonds derHelmholtz-Gemeinschaft Helmholtz-Gemeinschaft Deutscher Deutscher Forschungszentren e.V.Forschungszentren e.V.