Computational challenges in Bioinformatics€¦ · Bioinformatics applications for eHealth have...
Transcript of Computational challenges in Bioinformatics€¦ · Bioinformatics applications for eHealth have...
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Computational challenges in
Bioinformatics
Milanesi Luciano
National Research Council
Institute of Biomedical Technologies, Milan, Italy
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
The human organism:
~ 3 billion nucleotides
~ 30,000 genes coding for
~ 100,000-300,000 transcripts
~ 1-2 million proteins
~ 60 trillion cells of
~ 300 cell types in
~14,000 distinguishable
morphological structures
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Networking resources
Data analysis specific for bioinformatics allow the user to
store and search genetics data, with direct access to the
data files and application on GRID servers.
Researchers
perform their
activities
regardless
geographical
location, interact
with colleagues,
share and
access data
Scientific instruments and
experiments provide huge
amount of data from
microarray
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Networks of resources
The potential of new biological and biomedical technological platforms in connection with HPC and GRIDtechnology will be particularly useful to deal with the increasing amount, complexity, and heterogeneity of biological and biomedical data.
Bioinformatics applications for eHealth have become an ideal research area where computer scientists can apply and further develop new intelligent computation methods, in both experimental and theoretical cases.
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
ICT and Genomics
A key development in the computational world has been the
arrival of de novo design algorithms that use all available
spatial information to be found within the target to design
novel drugs.
Coupling these algorithms to the rapidly growing body of
information from structural genomics together with the new
ICT technology (eg. HPC, GRID, Web Services,
Bioinspired networks ecc.)
provides a powerful new possibility for exploring design to a
broad spectrum of genomics targets, including more
challenging techniques such as:
protein–protein interactions, docking, molecular
dynamics, system biology, gene network ecc.
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
System Biology for Health
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Data mining
db
PROTEOME
TRANSCRIPTOME
GENOME
Identify
USEFUL and SIGNIFICANT
information
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Allen Institute for Brain Science
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
What level
At what level can a systems biology strategy be implemented?
Faugeras O. et al., 2007, Journal of Physiology
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
The use of Statistical Parametric Mapping
Analysis (SPM) for the quantification of
hypometabolic patterns in brain is
currently the standard within the
neurological research community as
regards an analysis of PET/SPECT
studies for the early diagnosis of AD.
The use of Grid technologies allows easy
access to distributed data as well as to
distributed computational resources in a
secure way.
Remote access to SPM and to distributed
databases of normal subjects has been
made available through a Grid portal.
The original SPM scripts for data analysis
have not been modified. Only the SPM
routines concerning access and
extraction of information from normal
images have been rewritten in order to
allow parallelization and Grid
implementation.
SPM for the early diagnosis of AD
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
National Research Council
Study and implementation of innovative techniques for multiple sclerosis
lesions classification based on an integrated approach which uses ontologies
for a formal description of the domain and a fuzzy-based reasoning engine to
perform lesion classification.
Ontology based approach for the discovery of anomalies in the segmentation
process of brain tissues.
Brain NMR Brain tissue segmentation Automatic Reasoning on formalized knowledge Lesion
discovery
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
• Distributed Architectures for
DICOM images DBs
• Integration of textual data and
image features and descriptors
• Image retrieval by template
based on analysis of
– Histogram
– Texture
– Shape
User Interface for
Queries by
example
Reference Image
Visual query
Parameters
Template
Correlation threeshold
Histogram max diff
Entropy max diff
Biomedical Images Databases
BRAIN
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
BRAIN
Simulation of Associative/Intuitive Capabilities
Data-Driven “Conceptual Spaces”
– Based on psychological
foundations of Latent Semantic
Analysis
Sub-symbolic and symbolic
approaches combined together
– a conceptual similarity
relationship layer added to an
ontology
– tools for introduction of new
concepts in existing ontologies
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
National Research Council
Description, matching and retrieval of 3D anatomical data
(Extended Reeb graphs and size functions)
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Example of data integration
•Entrez Gene
•Entrez SNP
•Entrez HomoloGene
•Entrez GEO
•ArrayExpress
•SMD
•HPRD
•BioGRID
•String
GENOMETRANSCRIPTOME
PROTEOMEAnnotated Pathways
•KEGG Pathway
•Reactome
•GO Biological Process
db
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
The GeneNerveCellDB
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Genome-wide analysis
Current interest in the genome-wide analysis of cells at the
level of transcription ('transcriptome') and translation
('proteome'), the third level of analysis is the 'metabolome'.
The term 'metabolome' refers to the entire complement of all
the small molecular weight metabolites inside a cell
suspension of interest.
A new level of experiments are required to obtain an overall
picture of when, where, and how gene are expressed.
The functional genomics includes:
The analysis of gene expression profiles at the mRNA and
protein levels
The analysis of polymorphism or mutation patterns in the
genome
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Disease resistant population Disease susceptible population
Genotype all individuals for thousands of SNPs
ATGATTATAG ATGTTTATAG
Resistant people all have an ‘A’ at position 4 in geneX,
while susceptible people have a ‘T’
geneX
Disease Network
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Illumina Chips
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600Jobs
Ho
urs
Grid - measured
Cluster
Single CPU
Log. (Grid - interpolated)
Genome-wide analysis
• This approach is mostly useful in high-end challenges, where
Grid overheads are less affecting overall execution times
compared to single CPU performances. Only very small
challenges may show higher efficiency when run in a single
CPU workstation.
Performaces
Single
CPU
Cluster(70 nodes)
(280 CPUs)
Grid*interp.
10 k 200 6 33 h 8 h 8 h
66 k 1320 35 220 h 9.5 h *30 h
100 k 2000 60 333 h 10 h 35 h
317 k 6340 172 1056 h 13 h *72 h
370 k 7400 206 1233 h 15 h *75 h
500 k 10000 278 1665 h 16 h 80 h
670 k 13400 373 2233 h 18 h *87 h
1 M 20000 556 3332 h 20 h 100 h
Illumina
Chip # Runs [50 SNP]
# Jobs [6 h]
Comput. Cost (time)
VNASJobsWorkflow
Results
Grid
U.I.
submission
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Parametric Linkage Analysis
LOD score function for whole Chromosomes 2 and 17 (1M SNPs)
Genome-wide analysis
Chro
m. 2
Chro
m. 17
LOD Score > 3
→ high probability of linkage
between these markers/loci and
the disease
(the likelihood of observing the
given pedigree if the two loci are
not linked is less than 1 in 1000)
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
System Biology pipeline
Significant
information
Databases
Models of molecular systems:
gene regulation
signal transduction
Metabolism
Simulations, structural
and dynamycal analyses
Hypotheses
formulation
Wet experiments
Data integration /
data mining
Reverse
engeenering
Model predictions
Biological knowledge
and open problems
Hypotheses
Acceptance/
rejection
Data collection
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
NF-kappaB family
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Inactivation of NF-kB pathway by steroids.
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
eHealth interoperability
Connecting eHealth services
The full benefits of eHealth services and tools will not reach patients unless a high level of interoperability is integrated at the heart of their design and deployment. Healthcare providers need to co-operate extensively with each other, and with their suppliers, to ensure that their services are well connected.
The European Union's eHealth action plan seeks to harness Information and Communication Technologies to provide better healthcare for the entire EU population.
Central to that project is the development of interoperable healthcare systems in and across Member States.
The plan calls for urgent action to set up health systems and services which are connected at local, regional, national and pan-European levels.
Early and wide collaboration is critical to share costs, thereby reducing the need for future reinvestment to update systems to ensure interoperability.
The development of eHealth systems should underpin better organisation and delivery of health services, and improve citizens' awareness of how to prevent disease and preserve good health.
For example, when eHealth systems are able to communicate with each other effectively, doctors in different hospitals, or even different countries, can manage a patient's care more efficiently.
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
www.bioinfogrid.eu
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano 26
BioinfoGrid publications
Participation in external events constituted a very
effective way of raising awareness about the
BioinfoGRID project and its activities. The project
has participated to 58 international conferences
over 24 months
The project has published 23 articles on scientific
journals
The project has published 19 papers on
conference proceedings
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Gene expression data Analysis
Gene expression
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Docking: predict how
small molecules bind
to a receptor of
known 3D structure
Starting compound
databaseStarting target
structure model
DOCKING
Predicted
binding models
Post-analysis
Compounds
for assay
WISDOM Virtual screening process
There are successful examples
rapid,
cost effective…
But there are limitations
CPU and storage needed
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Virtual screening process
A few target structures
Millions of chemical
compounds
1 to 30 mn by docking
A few MB by output
100 CPU years, 1 TB
Large scale deployment on grid infrastructure
Challenges: - Speed-up the process - Manage the data
Docking
software
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.eu-egee.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.edges-grid.eu:8080/web/edges
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.symbiomatics.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.bbmri.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.cilab.upf.edu/aneurist1/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.eu-acgt.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://assist.iti.gr/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.euresist.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.health-e-child.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.cfin.au.dk/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.immunogrid.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.livinghuman.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.multiknowledge.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.biotec.tu-dresden.de/sealife
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.eu-share.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.europhysiome.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.virolab.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.action-grid.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.vph-arch.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.euheart.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.eibir.org/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://imppact.icg.tugraz.at/index.html
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.neomark.eu/portal/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.passport-liver.eu/Homepage.html
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.predictad.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.vph-noe.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
http://www.vphop.eu/
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
Conclusion
SysBioHealth
Population
“ Someday we will understand how
more than 3 billion nucleotides will
be able to produce more than
300,000 transcripts used to build 2
million proteins interacting
between them in more then 60
trillion cells capable to build 14,000
distinguishable morphological
structures just for one man”
~ 3 billion nucleotides
~ 30,000 genes coding for
~ 100,000-300,000 transcripts
~ 1-2 million proteins
~ 60 trillion cells of
~ 300 cell types in
~14,000 distinguishable morphological
structures
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
BioMedGrid Summer school 2009
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
GRID PON-2000-2006 Catania, IT 10-12 February 2009Milanesi Luciano
BioinfoGRID http://www.bioinfogrid.eu
EGEE Enabling Grid for E-science project http://www.eu.egee.org
FIRB-MIUR LITBIO: Laboratory for Interdisciplinary Technologies in Bioinformatics http://www.litbio.org,
FIRB-MIUR ITALBIONET: Italian Bioinformatics Network
Acknowledgments