Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and
description
Transcript of Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and
Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics: Architecture and Experiments
Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and Pierangelo Veltri1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy
Presenter: Michael Robinson Agnostic: Javier Munoz
Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006
2
Organization
Abstract ~60% is about Bioinformatics Proteus Architecture First Test Implementation Results of First Test Conclusion and Future Work
3
Abstract
Live sciences Bioinformatics Computer
Science
Data Files sizes
Computer power
4
The Partners
What is Livesciences
What is Bioinformatics Other Sciences used in Bioinformatics
What is Computer Science
5
Human Genome The sum total of DNA in an organism is its
genome.
The Human Genome Project (HGP) an international effort, began in October 1990, and was completed in 1999, 2003, 2004. (http://www.pbs.org/wgbh/nova/genome/program.html)
Project goals were to: Determine the complete sequence of the 3
billion DNA bases Identify all human genes And make them accessible for further
biological study
6
Human Genome
The bacterium E. coli and others were used to help develop the technology and interpret human gene function.
The Human Genome Project was sponsored by:
The U.S. Department of Energy and The U.S. National Institutes of Health
http://www.preventiongenetics.com/edu/genetics_nutshell.htm
7
DNA (ACGT)
Humans have from 10 to 100 trillion cells
Each Human cell has about 3 billion nucleotides
We have approximately 30,000 genes
Of the three billion letters of DNA that we have,
only 1 to 1.5 percent of it is gene the rest is STUFF”.
The functions are unknown for over 50% of known genes
8
DNA (ACGT)
Human Genome
3,000,000,000 ~ dna bases 30,000,000 ~ bases in genes 2,970,000,000 ~ stuff
adenine (A) forms a base pair with thymine (T) guanine (G) forms a base pair with cytosine (C)
9
Similarities to Human DNA Another
human? 99.9% - All humans have the same genes, but some of these genes contain sequence differences that make each person unique.
A chimpanzee? 98.5% - Chimpanzees are the closest living species to humans.
A mouse? 92.0% - All mammals are quite similar genetically.
A fruit fly? 44.0% - Studies of fruit flies have shown how shared genes govern the growth and structure of both insects and mammals.
Yeast? 26.0% - Yeasts are single-celled organisms, but they have many housekeeping genes that are the same as the genes in humans, such as those that enable energy to be derived from the breakdown of sugars.
A weed (thale cress)?
18.0% - Plants have many metabolic differences from humans. For example, they use sunlight to convert carbon dioxide gas to sugars. But they also have similarities in their housekeeping genes.
10
The gene sizes Largest known human gene is dystrophin at 2.4 million bases.
Chromosome 21 is the smallest human chromosome. Three copies of this autosome causes Down syndrome, the most frequent genetic disorder associated with significant mental retardation.
Academic groups from Germany and Japan mapped and sequenced it, it has 33,546,361 bp of DNA
Analysis of the chromosome revealed: 127 known genes, 98 predicted genes, and 59 pseudogenes.
Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp
11
Bioinformatics
DNA RNA PROTEINS
MUTATIONS, ILLNESSES
MEDICATIONS
CLONING
12
DNA (ACGT)
Pseudomonas Aeruginosas PA016,264,403 bases, 5565 genes
complement(6264226..6264360)6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga
cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg
tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag
tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg
13
RNA In RNA, thymine is replaced by uracil (U).
DNA6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga
cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg
tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag
tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg
RNA6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga
cggucagacg6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg
uggccauacg6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag
uacguuucau6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg
14
Amino Acids
UUU F phe Phenylalanine UUG V val Valine UAU Y tyr Tyrosine UGU C cys Cysteine
UUC F phe Phenylalanine UCC S ser Serine UAC Y tyr Tyrosine UGC C cys Cysteine
UUA L leu Leucine UCA S ser Serine UAA Stop UGA Stop
UUG L leu Leucine UCG S ser Serine UAG Stop UGG W trp Tryptophan
CUU L leu Leucine CCU P pro Proline CAU H his Histedine CGU R srg Arginine
CUC L leu Leucine CCC P pro Proline CAC H his Histedine CGC R srg Arginine
CUA L leu Leucine CCA P pro Proline CAA Q gln Glutamine CGA R srg Arginine
CUG L leu Leucine CCG P pro Proline CAG Q gln Glutamine CGG R srg Arginine
AUU l lle Isoleucine ACU T thr Threonine AAU N asn Asparagine AGU S ser Serine
AUC l lle Isoleucine ACC T thr Threonine AAC N asn Asparagine AGC S ser Serine
AUA l lle Isoleucine ACA T thr Threonine AAA K lys Lysine AGA R arg Arginine
AUG M met Methionime Start ACG T thr Threonine AAG K lys Lysine AGG R arg Arginine
GUU V val Valine GCU A ala Alanine GAU D asp Aspartic GGU G gly Glycine
GUC V val Valine GCC A ala Alanine GAC D asp Aspartic GGC G gly Glycine
GUA V val Valine GCA A ala Alanine GAA Z glu Glutamic GGA G gly Glycine
GUG V val Valine GCG A ala Alanine GAG Z glu Glutamic GGG G gly Glycine
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
U C A G
15
Proteins (sequences)DNA6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg
RNA6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga
cggucagacg6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg
uggccauacg6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag
uacguuucau6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg
PROTEIN MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV
16
Proteins: Pattern Matching
G-H-E-X(2)-G-X(4,5)-[GA]
GHEGVGKVVKLGAGA GHEKKGYF-DRGPSA GHEGYGGRSRGGGYS GHEFEGPK-CGALYI GHELRGTTFMPALEC
17
Proteins: Structures Chemical properties that distinguish the 20 different
amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell
18
Reality Somewhere in this dense chemical forest
are genes involved in deafness, Alzheimer, cancer, cataracts, etc.
But where? This is such a maze scientists need a map.
Out of three billion base pairs in our DNA, just one single letter can make a difference.
19
Data Locations GenBank in the US, 1974 1997 = 1.26
gigabases http://www.ncbi.nlm.nih.gov/ 2004 = 39
gigabases 2005 = 100
gigabases EMBL in England, 1980 http://www.ebi.ac.uk/embl/
DDBJ in Japan, 1984 http://www.ddbj.nig.ac.jp/
20
Some Databases
The Swiss Institute of Bioinformatics maintains the following databases:
Ashbya Genome Database Cancer Immunome Database Eukaryotic Promoter Database (EPD) GermOnline MyHits PROSITE Swiss-Prot and TrEMBL SWISS-2DPAGE SWISS-MODEL Repository
21
Specialization Plasmodb http://
www.plasmodb.org/plasmo/home.jsp parasitic eukaryote Plasmodium the
causative agent of the disease Malaria. [email protected]
22
Proteus General Architecture
23
Proteus’ Software Modules
24
Some Taxonomies of the Bioinformatics Ontology
25
Snapshot of the Ontology Browser
26
Human Protein
Clustering
Workflow
27
Snapshot of VEGA: Workspace 1 of the Data Selection Phase
28
Software Installed in the Example Grid
Software Components
Grid Nodes
Minos k3 k4
segret *
splitfasta *
blastall * * *
cat * * *
Tribe-parse * * *
Tribe-matrix *
mcl *
Tribe-families *
29
Snapshot of the Ontology Browser
30
Snapshot of the Ontology Browser
31
Snapshot of the Ontology Browser
32
Snapshot of VEGA: Workspace 1 of the Pre-processing Phase
33
Conclusions and Future Work Execution Times of the Application
TribeMCL Application 30 Proteins All Proteins
Data Selection 1’44” 1’41”
Pre-Processing 2’50” 8h50’13”
Clustering 1’40” 2h50’28”
Results Visualization 1’14” 1’42”
Total Execution Time 7’28” 11h50’53”
34
References
On the paper the authors cited 27 references
35
Questions
Thank you