Medicel Integrator platform - views to current themes in ... · Vg1 wnt. Part 3 – Store and share...
Transcript of Medicel Integrator platform - views to current themes in ... · Vg1 wnt. Part 3 – Store and share...
Medicel Integrator platform -views to
current themes in systems biology
Tommi AhoComputational Systems Biology 1
6.2.2008
(slides modified from material of Medicel)
Outline
•Part 1: Biology is complex•Part 2: How to model biology – in theory•Part 3: Is the data available?•Part 4: Data integration
Part 1 – Biology is complex
http://www.studiodaily.com/main/technique/tprojects/6850.html
Part 2 – How to model biology – in theory
Part 2 – How to model biology – in theory
Biology Modeling
Data Integration
anatomybiochemistry
botanycell biology
ecologyevolutiongenetics
immunologyhistolocymicrobiologyparasitologypathologypharmacalogyzoology
Structural models,Graph models,
FBA, ODEdx/dt = f (x(t), u(t) ,t ) y(t) = g (x(t), u(t), t )
Partial differential equationsStatistics, Optimization
XML, SBML, SQL, RDF, OGSA-DAI, etc
Part 2 – How to model biology – in theory
Part 3 – Is the data available?
Part 3 – Long history of research
•1924: Spemann and Mangold reveal the phenomenon of primary embryonic induction
•1970: Nieuwkoop et al. show that animal hemisphere cells are induced to become mesoderm by signals from vegetal hemisphere cells (so it depends on primary polarity axis definition)
•1990: ...Fig. from Scott GilbertDevelopmental BiologySinauer Press
Part 3 – Documentation of components
•1990: Asashima et al., Smith et al., Sokol et al. Show that activin A (TGFb) induces mesoderm. Then noggin and Vg1 join the list. (Chicken ovalbumin genes was cloned 10 years earlier)
•> 5000 references to text•> 150 images
Part 3 – The components
•Once upon a time: One gene, one protein, one function
activin
Cell differentiation
Part 3 – More functions
•One gene, many functions...
activin
Gonadotropin release
Cell differentiation
Inflammation
Carbohydrate metabolism
Protein & steroid metabolism
Part 3 – More components
•Redundancy, overlapping, specificity, divergence...
activin
Gonadotropin release
Cell differentiation
Inflammation
Carbohydrate metabolism
Protein & steroid metabolism
nogin
Vg1wnt
Part 3 – Store and share the data
•Scientific documentation today
Part 3 – Store and share the data
•Scientific documentation today
The user hard disk
Part 3 – Data is far away
•typical
Part 3 – Data should be at hand
•integrated
Part 3 – Conclusions
•Most of the data is never shared•No systematic data accumulation•Lacking meta-data: what parameter was measured, where did the sample come from and when was the parameter measured?
•Seriously impairs our competitiveness•IT solutions needed - biomedical researchers cannot resolve the problems alone
Part 4 – Data integration
“Integration is difficult”Stein, L.D., Integrating Biological Databases. Nature Rev. Genet. 4, 337-345 (2003)
Part 4 – Integration example
Model as SBML file 612 compounds with IDs
Model as Excel file 1039 compounds with somewhatsimilar IDs with SBML model 756 corresponding KEGG IDs
KEGG database 1843 compounds withKEGG IDs
Part 4 – Integration example
538 same
Model as SBML file 612 compounds with IDs
Model as Excel file 1039 compounds with somewhatsimilar IDs with SBML model 756 corresponding KEGG IDs
KEGG database 1843 compounds withKEGG IDs
501 not found
74 not found1255 not found588 same
169 not found
Part 4 – Integration difficulties
•Diversity of data•Heterogenity of available databases:
› Data stored in different formats› Often no schema (i.e. structural definition) available
Part 4 – Integration difficulties
•Conflicts of terms: What is a gene?•Namespace difficulties (1):
› One object, multiple names
e.g. P53_HUMAN: P04637, Cellular tumor antigen p53, Antigen NY-CO-13,Tumor suppressor p53, Phosphoprotein p53, p53, ...
= = =
P04637P53_HUMAN Phosphoprotein
p53
Tumor suppressor
p53
= ...
Part 4 – Integration difficulties
•Namespace difficulties (2): › Multiple objects, one name
e.g. P53 refers to
• a set of proteins across different species
• a set of transcripts encoding those proteins
• a set of genes encoding those transcripts
Common name
...Object 1 Object 2 Object 3 Object 4
!= != != !=
Part 4 – Technical difficulties
•Lack of metadata - or metadata exists, but in unstructured form (e.g. notes) that is not computer readable
•External databases: No standard accession method•Database versions: Updated vs. old data•Data model: No unified model available•Amount of data
•The system includes following data sources: › ENSEMBL
› NCBI Taxonomy
› NCBI Refseq Proteins
› UniProt/Swissprot
› UniProt/TrEMBL
› Interpro
› Mammalian Phenotype Ontology
› IntAct
› KEGG
› Human Disease Ontology
› GO (Gene Ontology)
› Cell Ontology
Part 4 – Databases in Integrator
• Chebi• Cytomer• Brenda Tissue Ontology• PDB• PubMed
Current database• 2,5 million proteins
• 75 000 genes
• 98 000 transcripts
• 10 million connections on 144 000 pathways
• 1200 different species
Part 4 – Database Schema - Medicel Infomodel
•Performing efficient searches across databases presents a big problem as the database structures are not unified
•Answer -> Structuring of data into a unified schema •Medicel Infomodel is the framework of the platform•Explains how data is organized into tables and fields of
the database•Using a unified schema is indispensable when wanting to
bring different experimental data together•Data is much more worth when it is compatible -> more
likely to arouse new knowledge
•Schema to model biology•Divided into biological data and meta data•Biological systems consist of interacting components•Interactions effect the change in the amounts of the
components•Amounts of the components give the state of the system•Pathways model these systems
Part 4 – Database Schema - Medicel Infomodel
•About 200 data tables constitute a relational database •Tables define the attributes of objects and the relations
of the objects to each other•E.g. a gene can be annotated to a category and the
category annotated to be part of another category•Data in the tables is structured in rows and columns
› Table -> Object Class› Row -> Object› Column -> Property of Object
•Knowledge of the Infomodel is not required of every user
Part 4 – Database Schema - Medicel Infomodel
Part 4 – Medicel Infomodel at high level
Component Data System Data State Data Laboratory Data
(This is an abstract representation showing only a fraction of the Medicel Infomodel.)
Part 4 – What is Component Data
•Definitions of quantifiable components (e.g. protein, genome, gene, macromolecular complex, organism)
› Name is not a real definition› Structural facts are concrete definitions that
• can be detected in laboratory• compared by computer algorithms
› Component list (formula)• implies molecular mass and charge
› Patterns• Bonds between components
› Sequence› Features
•Useful definitions can explain system behaviour
Part 4 – Where does component data come from?
•Population of databases› e.g. UniProt, Ensembl are protein databases› The key is to identify “reference objects” -> one unique
name which may have many database references•Own components given in
› Individuals e.g. patients examined› Populations e.g. any group of individuals like ‘the Finns’› Organisms e.g. genetically engineered microbe strains
Part 4 – What is system data
A system is an assemblage of inter-related elements comprising a unified whole (Wikipedia)
Location•a named real biological system that can be identified•a unique location needs to be created for each distinct
biologically interesting context•are related through common components•for each Location, information is recorded about
› Environment› Population› Individual› Organism› Organ› Tissue› Cell type› Cellular compartment
... an assemblage of inter-related elements comprising a unified whole
Locations are related through common components
i n p u to u t p u t
L[location1]: En[fermentor]
L[location2]: En[fermentor]Po[population]
O[Saccharomyces cerevisiae]Ct[yeast_cell]
L[location3]:
En[fermentor] ...
Cc[nucleus]
Components
•Various kinds of components› Genes, Transcripts, Proteins, Compounds,
Macromolecular complexes...
› but also, at a higher level: Cell types, Individuals, Populations, Environments
• not limited to molecular systems
... an assemblage of inter-related elements comprising a unified whole
Interaction•an event (or a process)
› typically, a biochemical event
•Components are connected to Interactions via Connections
› Different types of connections:• substrate (is consumed)
• product (is produced)
• control (is neither consumed or produced, but affects)
• outcome (not consumed or produced, but affected)
... an assemblage of inter-related elements comprising a unified whole
Example: Transcription
gene
transcript
transcription
connections
Pathway
•a network model of one location› a container for the components and interactions
•there can be multiple pathways for one location› at different abstraction levels
› alternative models from different origin, creators, evidence
... an assemblage of inter-related elements comprising a unified whole
Part 4 – What is state data?
•State data describes quantitatively the state in which a location (system) currently is in
› May quantify something about the location itself or about a component in the location
•Non-state data can be derived from state data
› E.g. p-values are quantitative but not state data
Part 4 – Infomodel for state data
s t a t ev a r ia b le
v a r ia b le
u n it
c o m p o n e n t
lo c a t io n
p a t h w a y
s t a t e d a t ap o in t
s a m p le
in d e x
t im e o fo b s e r v a t io n
fr e e t e x td e s c r ip t io n
v a lu e
0 . . 1
1
1
0 . . *
0 . . 1
1
1
1
0 . . 1
0 . . 1
0 . . 1
0 . . 1
x - c o o r d in a t e
y - c o o r d in a t e
z - c o o r d in a t e0 . . 1
0 . . 1
0 . . 1
Quantitative information
Biological information
Storing information
Several data points per state variable – one state variable per data point