Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative...
-
Upload
edward-black -
Category
Documents
-
view
224 -
download
0
Transcript of Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative...
![Page 1: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/1.jpg)
Introduction to Bioinformatics
Lecture Lecture 44: : Bioinformatics infrastructure Bioinformatics infrastructure
Centre for Centre for Integrative Bioinformatics VU (IBIVU)Integrative Bioinformatics VU (IBIVU)
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
![Page 2: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/2.jpg)
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense except in the light of Biology”
Bioinformatics
![Page 3: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/3.jpg)
Divergent evolution
Ancestral sequence: ABCD
ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise Alignment
AB─D A─BD
mutation deletion
![Page 4: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/4.jpg)
Divergent evolution
Ancestral sequence: ABCD
ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise Alignment AB─D A─BD
true alignment
mutation deletion
![Page 5: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/5.jpg)
What can be observed about divergent evolution
Ancestral sequence
Sequence 1 Sequence 2
1: ACCTGTAATC2: ACGTGCGATC * **D = 3/10 (fraction different sites (nucleotides))
G
G C
(a) G
A C
(b)
G
A A
(c)
One substitution -one visible
Two substitutions -one visible
Two substitutions -none visible
G
G A
(d)
Back mutation -not visible G
![Page 6: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/6.jpg)
Convergent evolution
Often with shorter motifs (e.g. active sites) Motif (function) has evolved more than once
independently, e.g. starting with two very different sequences adopting different folds
Sequences and associated structures remain different, but (functional) motif can become identical
Classical example: serine proteinase and chymotrypsin
![Page 7: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/7.jpg)
Serine proteinase (subtilisin) and chymotrypsin
Different evolutionary origins These proteins chop up other proteins Similarities in the reaction mechanisms. Chymotrypsin,
subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base.
The geometric orientations of the catalytic residues are similar between families, despite different protein folds.
The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan is ordered HDS, but is ordered DHS in the subtilisin clan and SDH in the carboxypeptidase clan.
![Page 8: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/8.jpg)
Serine proteinase (subtilisin) and chymotrypsin
chymotrypsin
serine proteinase
carboxypeptidase C
H D S
H SD
S D H
Catalytic triads
Read http://www.ebi.ac.uk/interpro/potm/2003_5/Page1.htm
![Page 9: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/9.jpg)
Serine proteinase (subtilisin) and chymotrypsin
![Page 10: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/10.jpg)
Serine proteinase (subtilisin) and chymotrypsin
![Page 11: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/11.jpg)
A gene codes for a protein
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Transcription + Translation = Expression
![Page 12: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/12.jpg)
DNA makes mRNA makes Protein
Translation happens within the ribosome
![Page 13: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/13.jpg)
Ribosome structure In the nucleolus, ribosomal RNA is
transcribed, processed, and assembled with ribosomal proteins to produce ribosomal subunits
At least 40 ribosomes must be made every second in a yeast cell with a 90-min generation time (Tollervey et al. 1991). On average, this represents the nuclear import of 3100 ribosomal proteins every second and the export of 80 ribosomal subunits out of the nucleus every second. Thus, a significant fraction of nuclear trafficking is used in the production of ribosomes.
Ribosomes are made of a small (‘2’ in Figure) and a large subunit (‘1’ in Figure)
Large (1) and small (2) subunit fit together (note this figure mislabels angstroms as nanometers)
![Page 14: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/14.jpg)
Transcriptional RegulationIntegrated View
![Page 15: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/15.jpg)
Expression..
mRNA transcription
TF binding site
TATA
TF
Pol II
DNA
![Page 16: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/16.jpg)
Epigenectics – Epigenomics: Gene Expression
Transcription factors (TF) are essential for transcription initialisation
Transcription is done by polymerase type II (eukaryotes)
mRNA must then move from nucleus to ribosomes (extranuclear) for translation
In eukaryotes there can be many TF-binding sites upstream of an ORF that together regulate transcription
Nucleosomes (chromatin structures composed of histones) are structures round of which DNA coils. This blocks access of TFs
![Page 17: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/17.jpg)
Epigenectics – Epigenomics: Gene Expression
mRNA transcription
TF binding site (open)
TF binding site (closed)
TATA
Nucleosome
![Page 18: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/18.jpg)
Expression Because DNA has flexibility, bound TFs can move in order to
interact with pol II, which is necessary for transcription initiation (see next slide)
Recent TF-based initialisation theory includes a wave function (Carlsberg) of TF-binding, which is supposed to go from left to right. In this way the TF-binding site nearest to the TATA box would be bound by a TF which will then in turn bind Pol II.
It has been suggested that “Speckles” have something to do with this (speckels are observed protein plaques in the nucleus)
Current prediction methods for gene co-expression, e.g. finding a single shared TF binding site, do not take this TF cooperativity into account (“parking lot optimisation”)
![Page 19: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/19.jpg)
![Page 20: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/20.jpg)
434 Cro protein complex(phage)
PDB: 3CRO
![Page 21: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/21.jpg)
Zinc finger DNA recognition
(Drosophila) PDB: 2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
![Page 22: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/22.jpg)
Characteristics of the family:
Function: The DNA-binding motif is found as part of transcription regulatory proteins.
Structure: One of the most abundant DNA-binding motifs. Proteins may contain more than one finger in a single chain. For example Transcription Factor TF3A was the first zinc-finger protein discovered to contain 9 C2H2 zinc-finger motifs (tandem repeats). Each motif consists of 2 antiparallel beta-strands followed by by an alpha-helix. A single zinc ion is tetrahedrally coordinated by conserved histidine and cysteine residues, stabilising the motif.
Zinc-finger DNA binding protein family
![Page 23: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/23.jpg)
Binding: Fingers bind to 3 base-pair subsites and specific contacts are mediated by amino acids in positions -1, 2, 3 and 6 relative to the start of the alpha-helix.
Contacts mainly involve one strand of the DNA.
Where proteins contain multiple fingers, each finger binds to adjacent subsites within a larger DNA recognition site thus allowing a relatively simple motif to specifically bind to a wide range of DNA sequences.
This means that the number and the type of zinc fingers dictates the specificity of binding to DNA
Characteristics of the family:
Zinc-finger DNA binding protein family
![Page 24: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/24.jpg)
Leucine zipper(yeast)
PDB: 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...
![Page 25: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/25.jpg)
A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***
A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******
![Page 26: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/26.jpg)
Searching for similarities
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
![Page 27: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/27.jpg)
Intermezzo: what is a domainA domain is a:
• Compact, semi-independent unit (Richardson, 1981).
• Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
![Page 28: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/28.jpg)
The DEATH Domain (DD)• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.
http
://w
ww
.msh
ri.o
n.ca
/paw
son
Protein domains recur in different combinations
![Page 29: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/29.jpg)
Pyruvate kinasePhosphotransferase
barrel regulatory domain
barrel catalytic substrate binding domain
nucleotide binding domain
1 continuous + 2 discontinuous domains
Structural domain organisation can intricate…
![Page 30: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/30.jpg)
Evolutionary and functional relationships
Reconstruct evolutionary relation:
•Based on sequence-Identity (simplest method)-Similarity
•Homology (common ancestry: the ultimate goal)•Other (e.g., 3D structure)
Functional relation:Sequence Structure Function
![Page 31: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/31.jpg)
Common ancestry is more interesting:Makes it more likely that genes sharethe same function
Homology: sharing a common ancestor– a binary property (yes/no)– it’s a nice tool:When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.
Searching for similarities
![Page 32: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/32.jpg)
The deluge of genomic information begs the following question: what do all these genes do?
Many genes are not annotated, and many more are partially or erroneously annotated. Given a genome which is partially annotated at best, how do we fill in the blanks?
Of each sequenced genome, 20%-50% of the functions of proteins encoded by the genomes remains unknown!
Protein Function Prediction
![Page 33: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/33.jpg)
We are faced with the problem of predicting protein function from sequence, genomic, expression, interaction and structural data. For all these reasons and many more, automated protein function prediction is rapidly gaining interest among bioinformaticians and computational biologists
Protein Function Prediction
![Page 34: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/34.jpg)
Ways to predict function Sequence-based function prediction
Structure-based function prediction– Sequence-structure comparison– Structure-structure comparison
Motif-based function prediction
Phylogenetic profile analysis
Protein interaction prediction and databases
Functional inference at systems level
![Page 35: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/35.jpg)
Classes of function prediction methods Sequence based approaches
– protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X
Structure-based approaches– protein A has structure X, and X has so-so structural features;
Hence A’s function sites are ….
Motif-based approaches– a group of genes have function X and they all have motif Y; protein
A has motif Y; Hence protein A’s function might be related to X
Function prediction based on “guilt-by-association”– gene A has function X and gene B is often “associated” with gene A,
B might have function related to X
![Page 36: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/36.jpg)
Sequence-based function prediction Homology searching Sequence comparison is a powerful tool for detection
of homologous genes but limited to genomes that are not too distant away
uery: 2 LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 61 LSD + V +W K+ G + L R+ +P+T F + D S ++Sbjct: 3 LSDKDKAAVRALWSKIGKSSDAIGNDALSRMIVVYPQTKIYFSHWP-----DVTPGSPNI 57
Query: 62 KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 121 K HG V+ + + K + + L++ HA K ++ + ++ CI+ V+ + PSbjct: 58 KAHGKKVMGGIALAVSKIDDLKTGLMELSEQHAYKLRVDPSNFKILNHCILVVISTMFPK 117
Query: 122 DFGADAQGAMNKALELFRKDMASNYK 147 +F +A +++K L +A Y+Sbjct: 118 EFTPEAHVSLDKFLSGVALALAERYR 143
We have done homology searching (FASTA, BLAST, PSI-BLAST) in earlier lectures
![Page 37: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/37.jpg)
Structure-based function prediction
Structure-based methods could possibly detect remote homologues that are not detectable by sequence-based method– using structural information in addition to sequence
information– protein threading (sequence-structure alignment) is a
popular method
Structure-based methods could provide more than just “homology” information
![Page 38: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/38.jpg)
Threading
Query sequence
Template sequence
+
Template structure
Compatibility score
![Page 39: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/39.jpg)
Threading
Query sequence
Template sequence
+
Template structure
Compatibility score
![Page 40: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/40.jpg)
Structure-based function prediction
Threading Scoring function for measuring to what extend query sequence fits into template structure
For scoring we have to map an amino acid (query sequence) onto a local environment (template structure)
We can use the following structural features for scoring:
o Secondary structure
o Is environment inside or outside? – Residue accessible surface area (ASA)
o Polarity of environment
The best (highest scoring) “thread” through the structure gives a so-called structural alignment, this looks exactly the same as a sequence alignment but is based on structure.
![Page 41: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/41.jpg)
Threading – inverse foldingMap sequence to structural environments
N C
?Query Template
What is the optimal thread for each local environment?
Find the best compromise over all environments
hydrophobic
hydrophilic
environment•Secondary structure
•ASA
•Polarity of environment
![Page 42: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/42.jpg)
Fold recognition by threading
Query sequence
Compatibility scores
Fold 1
Fold 2
Fold 3
Fold N
What is the most compatible structure?The one with the highest threading score
![Page 43: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/43.jpg)
Structure-based function prediction SCOP (http://scop.berkeley.edu/) is a protein structure
classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities
![Page 44: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/44.jpg)
Structure-based function prediction SCOP hierarchy – the top level: 11 classes
![Page 45: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/45.jpg)
Structure-based function prediction
All-alpha protein
Coiled-coil proteinAll-beta protein
Alpha-beta proteinmembrane protein
![Page 46: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/46.jpg)
Structure-based function prediction SCOP hierarchy – the second level: 800 folds
![Page 47: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/47.jpg)
Structure-based function prediction SCOP hierarchy - third level: 1294 superfamilies
![Page 48: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/48.jpg)
Structure-based function prediction
SCOP hierarchy - third level: 2327 families
![Page 49: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/49.jpg)
Structure-based function prediction
Using sequence-structure alignment method, one can predict a protein belongs to a
– SCOP family, superfamily or fold
Proteins predicted to be in the same SCOP family are orthologous Proteins predicted to be in the same SCOPE superfamily are homologous Proteins predicted to be in the same SCOP fold are structurally
analogous
folds
superfamilies
families
![Page 50: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/50.jpg)
Structure-based function prediction
Prediction of ligand binding sites– For ~85% of ligand-binding proteins, the largest largest cleft
is the ligand-binding site– For additional ~10% of ligand-binding proteins, the second
largest cleft is the ligand-binding site
![Page 51: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/51.jpg)
Structure-based function prediction
Prediction of macromolecular binding site– there is a strong correlation between macromolecular
binding site (with protein, DNA and RNA) and disordered protein regions
– disordered regions in a protein sequence can be predicted using computational methods
http://www.pondr.com/
![Page 52: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/52.jpg)
Motif-based function prediction
Prediction of protein functions based on identified sequence motifs
PROSITE contains patterns specific for more than a thousand protein families.
ScanPROSITE -- it allows to scan a protein sequence for occurrence of patterns and profiles stored in PROSITE
![Page 53: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/53.jpg)
Motif-based function prediction
Search PROSITE using ScanPROSITE
The sequence has ASN_GLYCOSYLATION N-glycosylation site: 242 - 245 NETL
MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRKGALKKKNVFNVKDHCFIARFFKQPTFCSHCKDFICGYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEYVTFICPGKDKGNETLIDSDSPKTQH ……..
![Page 54: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/54.jpg)
Regular expressions
Alignment
ADLGAVFALCDRYFQSDVGPRSCFCERFYQADLGRTQNRCDRYYQADIGQPHSLCERYFQ
Regular expression
[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q
{PG} = not (P or G)
For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)
![Page 55: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/55.jpg)
Regular expressions
Regular expression No. of exact matches in DB
D-A-V-I-D 71
D-A-V-I-[DENQ] 252
[DENQ]-A-V-I-[DENQ] 925
[DENQ]-A-[VLI]-I-[DENQ] 2739
[DENQ]-[AG]-[VLI]2-[DENQ] 51506
D-A-V-E 1088
![Page 56: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/56.jpg)
Prosite
In addition to regular expressions, the Prosite database also contains so-called extended profiles
Extended profiles contain more explicit information than classical profiles, for example to describe expected gap lengths, etc.
This is because some patterns are better described using regular expressions (e.g. short motifs), while others are better formalised using (extended) profiles
![Page 57: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/57.jpg)
Domain swappingDomain swapping is a structurally viable mechanism for forming oligomeric assemblies (Bennett et al., 1995). In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein.
Domain swapping can range from secondary structure elements to whole structural domains. It also represents a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997).
![Page 58: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/58.jpg)
Domain databases
![Page 59: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/59.jpg)
COGS Domain databaseThe COGs (Clusters of Orthologous Groups) database is a phylogenetic classification of the proteins encoded within complete genomes (Tatusov et al., 2001).
It primarily consists of bacterial and archaeal genomes.
Operational definition of orthology is based on bidirectional best hit
Incorporation of the larger genomes of multicellular eukaryotes into the COG system is achieved by identifying eukaryotic proteins that fit into already existing COGs. Eukaryotic proteins that have orthologs within different COGs are split into their individual domains.
The COGs database currently consists of 3166 COGs including 75,725 proteins from 44 genomes.
![Page 60: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/60.jpg)
COGs: the beginning (1997)
In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven complete genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This relation automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis.
![Page 61: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/61.jpg)
COG2813: 16S RNA G1207 methylase RsmC
COG members are mapped onto the genomes included in the DB
![Page 62: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/62.jpg)
PRINTS database•PRINTS is a compendium of protein fingerprints.
•A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power (false positives and false negatives) is refined by iterative scanning of a SWISS-PROT/TrEMBL composite database.
•Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space.
•Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours
•PRINTS contains the most discriminating groups of regular expressions for each protein sequence
•Release 31.0 of PRINTS contains 1,550 entries, encoding 9,531 individual motifs.
![Page 63: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/63.jpg)
INITIAL MOTIF SETS
BETAHAEM1 Length of motif = 17 Motif number = 1Beta haemoglobin motif I - 1 PCODE ST INT GRLLVVYPWTQRYFDSF HBB1_RAT 29 29 GRLLVVYPWTQRYFDSF HBB1_MOUSE 29 29 GRLLVVYPWTQRFFEHF HBB_ALCAA 28 28 GRLLVVYPWTQRFFEHF HBB_ODOVI 28 28 GRLLVVYPWTQRFFESF HBB_BOVIN 28 28 GRLLVVYPWTQRFFESF HBB_ATEGE 29 29 GRLLVVYPWTQRFFESF HBB_HUMAN 29 29 GRLLVVYPWTQRFFESF HBB_ANTPA 29 29 ARLLIVYPWTQRFFASF HBB_ANAPL 29 29 SRCLIVYPWTQRHFSGF HBB_NOTAN 29 29
BETAHAEM2 Length of motif = 16 Motif number = 2 Beta haemoglobin motif II - 1 PCODE ST INT DLSSASAIMGNPKVKA HBB1_RAT 47 1 DLSSASAIMGNAKVKA HBB1_MOUSE 47 1 DLSTADAVMHNAKVKE HBB_ALCAA 46 1 DLSSAGAVMGNPKVKA HBB_ODOVI 46 1 DLSTADAVMNNPKVKA HBB_BOVIN 46 1 DLSTPDAVMSNPKVKA HBB_ATEGE 47 1 DLSTPDAVMGNPKVKA HBB_HUMAN 47 1 DLSNAGAVMGNAKVKA HBB_ANTPA 47 1 NLSSPTAILGNPMVRA HBB_ANAPL 47 1 NLYNAEAILGNANVAA HBB_NOTAN 47 1
BETAHEAM: 2 of 5 PRINTS motifs making the fingerprint
After iteration the number of sequences for each motif can grow dramatically. Both the initial motifs (example here) and final motifs are provided to the user
![Page 64: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/64.jpg)
The PRODOM Database
ProDom is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases
![Page 65: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/65.jpg)
The PRODOM Database
ProDom (Corpet et al., 2000) is a database of protein domain families automatically generated from SWISSPROT and TrEMBL sequence databases (Bairoch and Apweiler, 2000) using a novel procedure based on recursive PSI-BLAST searches (Altschul et al., 1997). Release 2001.2 of ProDom contains 283,772 domain families, 101,957 having at least 2 sequence members. ProDom-CG (Complete Genome) is a version of the ProDom database which holds genome-specific domain data.
![Page 66: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/66.jpg)
The PROSITE Database
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs
PROSITE (Hofmann et al., 1999) is a good source of high quality annotation for protein domain families. A PROSITE sequence family is represented as a pattern or profile, providing a means of sensitive detection of common protein domains in new protein sequences.
PROSITE release 16.46 contains signatures specific for 1,098 protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.
![Page 67: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/67.jpg)
The PROSITE Database
A PROSITE sequence family is represented as a pattern or a profile.
A pattern is given as a regular expression (next slide)
The generalised profiles used in PROSITE carry the same increased information as compared to classical profiles as Hidden Markov Models (HMMs).
![Page 68: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/68.jpg)
Regular expressions
Alignment
ADLGAVFALCDRYFQSDVGPRSCFCERFYQADLGRTQNRCDRYYQADIGQPHSLCERYFQ
Regular expression
[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q
{PG} = not (P or G)
For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)
![Page 69: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/69.jpg)
Regular expressions
Regular expression No. of exact matches in DB
D-A-V-I-D 71
D-A-V-I-[DENQ] 252
[DENQ]-A-V-I-[DENQ] 925
[DENQ]-A-[VLI]-I-[DENQ] 2739
[DENQ]-[AG]-[VLI]2-[DENQ] 51506
D-A-V-E 1088
![Page 70: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/70.jpg)
Rationale for regular expressions “I want to see all sequences that ...
– ... contain a C” --- C
– ... contain a C or an F” -- [CF]
– ... contain a C and an F” -- (C.*F | F.*C) (‘|’ means ‘or’ and ‘.*’ means don’t care for any length)
– ... contain a C immediately followed by an F” -- CF
– ... contain a C later followed by an F” -- C.*F
– ... begin with a C” -- ^C (‘^’ means ‘starting with’)
– ... do not contain a C” -- {C}
– ... contain at least three Cs” -- C3-
– ... contain exactly three Cs” -- C3
– ... has a C at the seventh position” -- .6C
– ... either contain a C, an E, and an F in any order except CFE, unless there are also at most three Ps, or there is a ....
![Page 71: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/71.jpg)
Regex limitations regex cannot remember indeterminate counts !!!
– “I want to see all sequences with ...☺ ... six Cs followed by six Ts”
– C6T6
☺ ... any number of Cs followed by any number of Ts”✰ C*T*
☹ ... Cs followed by an equal number of Ts” (This cannot be done..)✰ CnTn
✰ (CT|CCTT|CCCTTT|C4T4| ... )?
![Page 72: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/72.jpg)
The PFAM Database
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can:
Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures Search with Hidden Markov Model (HMM) for each alignment
![Page 73: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/73.jpg)
The PFAM Database
Pfam is a database of two parts, the first is the curated part of Pfam containing over 5193 protein families (Pfam-A). Pfam-A comprises manually crafted multiplealignments and profile-HMMs . To give Pfam a more comprehensive coverage of known proteins we automatically generate a supplement called Pfam-B. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found.
![Page 74: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/74.jpg)
The PFAM Database
Sequence coverage Pfam-A : 73% (Gr)Sequence coverage Pfam-B : 20% (Bl)Other (Grey)
![Page 75: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/75.jpg)
CYB_TRYBB/1-197 M...LYKSG..EKRKG..LLMSGC.....LYR.....IYGVGFSLGFFIALQIIC..GVCLAWLFFSCFICSNWYFVLFLCYB_MARPO/1-208 M.ARRLSILKQPIFSTFNNHLIDY.....PTPSNISYWWGFGSLAGLCLVIQILTGVFLAMHYTPHVDLAFLSVEHIMR.CYB_HETFR/1-205 MATNIRKTH..PLLKIINHALVDL.....PAPSNISAWWNFGSLLVLCLAVQILTGLFLAMHYTADISLAFSSVIHICR.CYB_STELO/1-204 M.TNIRKTH..PLMKILNDAFIDL.....PTPSNISSWWNFGSLLGLCLIMQILTGLFLAMHYTPDTTTAFSSVAHICR.CYB_ASCSU/1-196 ...........MKLDFVNSMVVSL.....PSSKVLTYGWNFGSMLGMVLGFQILTGTFLAFYYSNDGALAFLSVQYIMY.CYB6_SPIOL/1-210 M.SKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTDAFASVQYIMT.CYB6_MARPO/1-210 M.GKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTEAFSSVQYIMT.CYB6_EUGGR/1-210 M.SRVYDWF..EERLEIQAIADDVSSKYVPPHVNIFYCLGGITFT..CFIIQVATGFAMTFYYRPTVTEAFLSVKYIMN.
CYB_TRYBB/1-197 WDFDLGFVIRSVHICFTSLLYLLLYIHIFKSITLIILFDTH..IL....VWFIGFILFVFIIIIAFIGYVLPCTMMSYWGCYB_MARPO/1-208 .DVKGGWLLRYMHANGASMFFIVVYLHFFRGLY....YGSY..ASPRELVWCLGVVILLLMIVTAFIGYVLPWGQMSFWGCYB_HETFR/1-205 .DVNYGWLIRNIHANGASLFFICIYLHIARGLY....YGSY..LLKE..TWNIGVILLFLLMATAFVGYVLPWGQMSFWGCYB_STELO/1-204 .DVNYGWFIRYLHANGASMFFICLYAHMGRGLY....YGSY..MFQE..TWNIGVLLLLTVMATAFVGYVLPWGQMSFWGCYB_ASCSU/1-196 .EVNFGWIFRVLHFNGASLFFIFLYLHLFKGLF....FMSY..RLKK..VWVSGIVILLLVMMEAFMGYVLVWAQMSFWACYB6_SPIOL/1-210 .EVNFGWLIRSVHRWSASMMVLMMILHVFRVYL....TGGFKKPREL..TWVTGVVLGVLTASFGVTGYSLPWDQIGYWACYB6_MARPO/1-210 .EVNFGWLIRSVHRWSASMMVLMMILHIFRVYL....TGGFKKPREL..TWVTGVILAVLTVSFGVTGYSLPWDQIGYWACYB6_EUGGR/1-210 .EVNFGWLIRSIHRWSASMMVLMMILHVCRVYL....TGGFKKPREL..TWVTGIILAILTVSFGVTGYSLPWDQVGYWA
CYB_TRYBB/1-197 LTVFSNIIATVPILGIWLCYWIWGSEFINDFTLLKLHVLHV.LLPFILLIILILHLFCLHYFMCYB_MARPO/1-208 ATVITSLASAIPVVGDTIVTWLWGGFSVDNATLNRFFSLHY.LLPFIIAGASILHLAALHQYGCYB_HETFR/1-205 ATVITNLLSAFPYIGDTLVQWIWGGFSIDNATLTRFFAFHF.LLPFLIIALTMLHFLFLHETGCYB_STELO/1-204 ATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHF.ILPFIITALAAVHLLFLHETGCYB_ASCSU/1-196 SVVITSLLSVIPVWGFAIVTWIWSGFTVSSATLKFFFVLHF.LVPWGLLLLVLLHLVFLHETGCYB6_SPIOL/1-210 VKIVTGVPDAIPVIGSPLVELLRGSASVGQSTLTRFYSLHTFVLPLLTAVFMLMHFLMIRKQGCYB6_MARPO/1-210 VKIVTGVPEAIPIIGSPLVELLRGSVSVGQSTLTRFYSLHTFVLPLLTAIFMLMHFLMIRKQGCYB6_EUGGR/1-210 VKIVTGVPEAIPLIGNFIVELLRGSVSVGQSTLTRFYSLHTFVLPLLTATFMLGHFLMIRKQG
A PFAM alignment
![Page 76: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/76.jpg)
INTERPRO combined database
Because the underlying construction and analysis methods of the above domain family databases are different, the databases inevitably have different diagnostic strengths and weaknesses.
The InterPro database (Apweiler et al., 2000) is a collaboration between many of the domain database curators.
It aims to be a central resource reducing the amount of duplication between the databases.
Release 3.2 of InterPro contains 3,939 entries, representing 1,009 domains, 2,850 families, 65 repeats and 15 posttranslational modification sites. Entries are accompanied by regular expressions, profiles, fingerprints and Hidden Markov Models which facilitate sequence database searches.
![Page 77: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/77.jpg)
Databases integrated in INTERPRO:
The UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours.
The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches (Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ, 1997, Nucleic Acids Res., 25:3389-3402; Gouzy J., Corpet F. & Kahn D., 1999, Computers and Chemistry 23:333-340.) Large families are much better processed with this new procedure than with the former DOMAINER program (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci., 3:482-492).
![Page 78: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/78.jpg)
Databases integrated in INTERPRO (Cont.):
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.
TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function.
PIR Superfamily (PIRSF) is a classification system based on evolutionary relationship of whole proteins. Members of a superfamily are monophyletic (evolved from a common evolutionary ancestor) and homeomorphic (homologous over the full-length sequence and sharing a common domain architecture). A protein may be assigned to one and only one superfamily. Curated superfamilies contain functional information, domain information, bibliography, and cross-references to other databases, as well as full-length and domain HMMs, multiple sequence alignments, and phylogenetic tree of seed members. PIRSF can be used for functional annotation of protein sequences.
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the SUPERFAMILY website.
![Page 79: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/79.jpg)
Domain structure databases
Several methods of structural classification have been developed to classify the large number of protein folds present in the PDB.
The most widely used and comprehensive databases are CATH, 3Dee, FSSP and SCOP, which use four unique methods to classify protein structures at the domain level.
![Page 80: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/80.jpg)
Examples of domain structure databases
CATH 3DEE FSSP SCOP
![Page 81: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/81.jpg)
CATH
The CATH domain database assigns domains based on a consensus approach using the three algorithms PUU (Holm and Sander, 1994), DETECTIVE (Swindells, 1995) and DOMAK (Siddiqui and Barton, 1995) as well as visual inspection (Jones et al., 1998). The CATH database release 2.3 contains approximately 30,000 domains ordered into five major levels: Class; Architecture; Topology/fold; Homologous superfamily; and Sequence family.
![Page 82: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/82.jpg)
CATH
Class covers , , and / proteinsArchitecture is the overall shape of a domain as defined by the packing of secondary structural elements, but ignoring their connectivity. The topology-level consists of structures with the same number, arrangement and connectivity of secondary structure based on structural superposition using SSAP structure comparison algorithm (Taylor and Orengo, 1989). A homologous superfamily contains proteins having high structural similarity and similar functions, which suggests that they have evolved from a common ancestor. Finally, the sequence family level consists of proteins with sequence identities greater than 35%, again suggesting a common ancestor.
![Page 83: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/83.jpg)
CATH
CATH classifies domains into approximately 700 fold families; ten of these folds are highly populated and are referred to as ‘super-folds’.
Super-folds are defined as folds for which there are at least three structures without significant sequence similarity (Orengo et al., 1994).
The most populated is the / -barrel super-fold.
![Page 84: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/84.jpg)
3Dee
3Dee structural domain repository (Siddiqui et al., 2001) stores alternative domain definitions for the same protein and organises the domains into sequence and structural hierarchies. Most of the database creation and update processes are performed automatically using the DOMAK (Siddiqui and Barton, 1995) algorithm. However, somedomains are manually assigned. It contains non-redundant sets of sequences and structures, multiple structure alignments for all domain families, secondary structure and fold name definitions. The current 3Dee release is now a few years old and contains 18,896 structural domains.
![Page 85: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/85.jpg)
FSSP
FSSP (Holm and Sander, 1997) is a complete comparison of all pairs of protein structures in the PDB. It is the basis for the Dali Domain Dictionary (Dietmann et al., 2001), a numerical taxonomy of all known structures in the PDB.
The taxonomy is derived automatically from measurements of structural, functional and sequence similarities.
The database is split into four hierarchical levels corresponding to super-secondary structural motifs, the topology of globular domains, remote homologues (functional families) and sequence families.
![Page 86: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/86.jpg)
FSSP
The top level of the fold classification corresponds to secondary structure composition and super-secondary structural motifs. Domains are assigned by the PUU algorithm (Holm and Sander, 1994) and classified into one of five ‘attractors’, which can be characterised as all-, all-, / , - meander, and antiparallel -barrels. Domains which are not clearly defined to a single attractor are assigned to a mixed class.
In September 2000, the Dali classification contained 17,101 chains, 1,375 fold types and 3,724 domain sequence families. The database contains definitions of structurally conserved cores and a library of multiple alignments of distantly related protein families.
![Page 87: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/87.jpg)
SCOPThe SCOP database (Structural Classification of Proteins) is a manual classification of protein structure (Murzin et al., 1995). The classification is at the domain level for many proteins, but in general, a protein is only split into domains when there is a clear indication that the individual domains may have existed as independent proteins.
Therefore, many of the domain definitions in SCOP will be different to those in the other structural domain databases. The principal levels of hierarchy are family, superfamily and fold, split into the traditional four domain classes, all-, all-, + and / .
Release1.55 of the SCOP database contains 13,220 PDB entries, 605 fold types and 31,474 domains.
![Page 88: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/88.jpg)
Gene Ontology (GO) Not a genome sequence database Developing three structured, controlled
vocabularies (ontologies) to describe gene products in terms of:– biological process– cellular component– molecular function
in a species-independent manner
![Page 89: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/89.jpg)
The GO ontology
![Page 90: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/90.jpg)
Gene Ontology MembersFlyBase - database for the fruitfly Drosophila melanogaster Berkeley Drosophila Genome Project (BDGP) - Drosophila informatics; GO database & software, Sequence Ontology development
Saccharomyces Genome Database (SGD) - database for the budding yeast Saccharomyces cerevisiae Mouse Genome Database (MGD) & Gene Expression Database (GXD) - databases for the mouse Mus musculus
The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis thaliana
WormBase - database for the nematode Caenorhabditis elegans EBI GOA project : annotation of UniProt (Swiss-Prot/TrEMBL/PIR) and InterPro databases Rat Genome Database (RGD) - database for the rat Rattus norvegicus DictyBase - informatics resource for the slime mold Dictyostelium discoideum GeneDB S. pombe - database for the fission yeast Schizosaccharomyces pombe (part of the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute)
GeneDB for protozoa - databases for Plasmodium falciparum, Leishmania major, Trypanosoma brucei, and several other protozoan parasites (part of the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute)
Genome Knowledge Base (GK) - a collaboration between Cold Spring Harbor Laboratory and EBI) TIGR - The Institute for Genomic Research Gramene - A Comparative Mapping Resource for Monocots Compugen (with its Internet Research Engine) The Zebrafish Information Network (ZFIN) - reference datasets and information on Danio rerio
![Page 91: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/91.jpg)
Protein interaction database There are numerous databases of protein-protein
interactions
DIP is a popular protein-protein interaction database
The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions.
![Page 92: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/92.jpg)
Protein interaction databases
BIND - Biomolecular Interaction Network DatabaseDIP - Database of Interacting ProteinsPIM – HybrigenicsPathCalling Yeast Interaction Database MINT - a Molecular Interactions DatabaseGRID - The General Repository for Interaction DatasetsInterPreTS - protein interaction prediction through tertiary structureSTRING - predicted functional associations among genes/proteinsMammalian protein-protein interaction database (PPI)InterDom - database of putative interacting protein domains FusionDB - database of bacterial and archaeal gene fusion eventsIntAct ProjectThe Human Protein Interaction Database (HPID)ADVICE - Automated Detection and Validation of Interaction by Co-evolutionInterWeaver - protein interaction reports with online evidencePathBLAST - alignment of protein interaction networksClusPro - a fully automated algorithm for protein-protein dockingHPRD - Human Protein Reference Database
![Page 93: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/93.jpg)
Protein interaction database
![Page 94: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/94.jpg)
Network of protein interactions and predicted functional links involving silencing information regulator (SIR) proteins. Filled circles represent proteins of known function; open circles represent proteins of unknown function, represented only by their Saccharomyces genome sequence numbers ( http://genome-www.stanford.edu/Saccharomyces). Solid lines show experimentally determined interactions, as summarized in the Database of Interacting Proteins19 (http://dip.doe-mbi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone method12. Dotted lines show functional links predicted by phylogenetic profiles16. Some predicted links are omitted for clarity.
![Page 95: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/95.jpg)
Network of predicted functional linkages involving the yeast prion protein20 Sup35. The dashed line shows the only experimentally determined interaction. The other functional links were calculated from genome and expression data11 by a combination of methods, including phylogenetic profiles, Rosetta stone linkages and mRNA expression. Linkages predicted by more than one method, and hence particularly reliable, are shown by heavy lines. Adapted from ref. 11.
![Page 96: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/96.jpg)
STRING - predicted functional associations among genes/proteins
STRING is a database of predicted functional associations among genes/proteins.
Genes of similar function tend to be maintained in close neighborhood, tend to be present or absent together, i.e. to have the same phylogenetic occurrence, and can sometimes be found fused into a single gene encoding a combined polypeptide.
STRING integrates this information from as many genomes as possible to predict functional links between proteins.
Berend Snel en Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)
![Page 97: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/97.jpg)
STRING - predicted functional associations among genes/proteins STRING is a database of known and predicted protein-protein interactions.The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources:
1. Genomic Context (Synteny) 2. High-throughput Experiments 3. (Conserved) Co-expression 4. Previous Knowledge
STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently contains 736429 proteins in 179 species
![Page 98: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/98.jpg)
STRING - predicted functional associations among genes/proteins
Conserved Neighborhood
This view shows runs of genes that occur repeatedly in close neighborhood in (prokaryotic) genomes. Genes located together in a run are linked with a black line (maximum allowed intergenic distance is 300 bp). Note that if there are multiple runs for a given species, these are separated by white space. If there are other genes in the run that are below the current score threshold, they are drawn as small white triangles. Gene fusion occurences are also drawn, but only if they are present in a run (see also the Fusion section below for more details).
![Page 99: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/99.jpg)
Functional inference at systems level
Function prediction of individual genes could be made in the context of biological pathways/networks
Example – phoB is predicted to be a transcription regulator and it regulates all the genes in the pho-regulon (a group of co-regulated operons); and within this regulon, gene A is interacting with gene B, etc.
![Page 100: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/100.jpg)
Functional inference at systems level
KEGG is database of biological pathways and networks
![Page 101: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/101.jpg)
Functional inference at systems level
![Page 102: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/102.jpg)
Functional inference at systems level
![Page 103: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/103.jpg)
Functional inference at systems level
By doing homologous search, one can map a known biological pathway in one organism to another one; hence predict gene functions in the context of biological pathways/networks
![Page 104: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/104.jpg)
Wrapping up
We have seen a number of ways to infer a putative function for a protein sequence
To gain confidence, it is important to combine as many different prediction protocols as possible (the STRING server is an example of this)
![Page 105: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/105.jpg)
Homework
Give an example of two proteins having the same structural fold but different biological functions through searching SCOP and Swiss-prot
What is the biological function of phoR in the two-component system of prokaryotic organism based on KEGG database search
![Page 106: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/106.jpg)
Protein function
Lecture 17:
Introduction to BioinformaticsIntroduction to Bioinformatics
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
![Page 107: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/107.jpg)
Domain fusionFor example, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) 1.
In insects, the polypeptide appears as GARs-(AIRs)2-GARt. However, GARs-AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997).
1GAR: glycinamide ribonucleotide synthetase AIR: aminoimidazole ribonucleotide synthetase
![Page 108: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/108.jpg)
Genetic mechanisms influencing the layout of multidomain proteins include gross rearrangements such as inversions, translocations, deletions and duplications, homologous recombination, and slippage of DNA polymerase during replication (Bork et al., 1992).
Although genetically conceivable, the transition from two single domain proteins to a multidomain protein requires that both domains fold correctly and that they accomplish to bury a fraction of the previously solvent-exposed surface area in a newly generated inter-domain surface.
Domain fusion
![Page 109: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/109.jpg)
Pathways and Pathway Diagrams Pathways
– Set of nodes (entities) and edges (associations)
Pathway Diagrams– XY coordinates– Node splitting allowed– Multiple views of the
same pathway– Different abstraction
levels
![Page 110: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/110.jpg)
Kegg database (Japan)
Metabolic Metabolic networksnetworks
Glycolysis Glycolysis and and
GluconeogenesisGluconeogenesis
![Page 111: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/111.jpg)
Domains, their prediction and domain databases
Lecture 16:
Introduction to BioinformaticsIntroduction to Bioinformatics
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
![Page 112: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/112.jpg)
Sequence
Structure
Function
Threading
Homology searching (BLAST)
Ab initio prediction and folding
Function prediction from structure
Sequence-Structure-Function
impossible but for the smallest structures
very difficult
![Page 113: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/113.jpg)
TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)
Genome
Expressome
Proteome
Metabolome
Functional Genomics – Systems Functional Genomics – Systems BiologyBiology
Metabolomics
fluxomics
![Page 114: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/114.jpg)
Systems Biology
is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes
the interactions are nonlinear the interactions give rise to emergent properties, i.e. properties
that cannot be explained by the components in the system Biological processes include many time-scales, many
compartments and many interconnected network levels (e.g. regulation, signalling, expression,..)
![Page 115: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/115.jpg)
Systems Biology
understanding is often achieved through modeling and simulation of the system’s components and interactions.
Many times, the ‘four Ms’ cycle is adopted:
Measuring
Mining
Modeling
Manipulating
![Page 116: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/116.jpg)
‘The silicon cell’
(some people think ‘silly-con’ cell)
![Page 117: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/117.jpg)
![Page 118: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/118.jpg)
A system response
Apoptosis: programmed cell death
Necrosis: accidental cell death
![Page 119: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/119.jpg)
This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (square boxes in red) and the pathway itself have occurred (yeast has one altered (‘overtaking’) path in the graph)
We need to be able to do automatic pathway comparison (pathway alignment)
Human Yeast
‘Comparative metabolomics’
Important difference with human pathway
![Page 120: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/120.jpg)
Experimental
Structural genomics Functional genomics Protein-protein interaction Metabolic pathways
Expression data
![Page 121: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/121.jpg)
Issue when elucidating function experimentally
Partial information (indirect interactions) and subsequent filling of the missing steps
Negative results (elements that have been shown not to interact, enzymes missing in an organism)
Putative interactions resulting from computational analyses
![Page 122: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/122.jpg)
Protein function categories Catalysis (enzymes) Binding – transport (active/passive)
– Protein-DNA/RNA binding (e.g. histones, transcription factors)
– Protein-protein interactions (e.g. antibody-lysozyme) (experimentally determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H) screening )
– Protein-fatty acid binding (e.g. apolipoproteins)
– Protein – small molecules (drug interaction, structure decoding)
Structural component (e.g. -crystallin) Regulation Signalling Transcription regulation Immune system Motor proteins (actin/myosin)
![Page 123: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/123.jpg)
Catalytic properties of enzymes
[S]
Mo
les/
s
Vmax
Vmax/2
Km
Michaelis-Menten equation:
Km kcat
E + S ES E + P E = enzyme S = substrate ES = enzyme-substrate complex (transition state) P = product Km = Michaelis constant Kcat = catalytic rate constant (turnover number) Kcat/Km = specificity constant (useful for comparison)
Vmax × [S]V = ------------------- Km + [S]
![Page 124: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/124.jpg)
Protein interaction domains
http://pawsonlab.mshri.on.ca/html/domains.html
![Page 125: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/125.jpg)
Energy difference upon binding
Examples of protein interactions (and of functional importance) include:
Protein – protein (pathway analysis); Protein – small molecules
(drug interaction, structure decoding); Protein – peptides, DNA/RNA
The change in Gibb’s Free Energy of the protein-ligand binding interaction can be monitored and expressed by the following equation:
G = H – T S
(H=Enthalpy, S=Entropy and T=Temperature)
![Page 126: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/126.jpg)
![Page 127: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/127.jpg)
Protein-protein interaction networks
![Page 128: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/128.jpg)
Protein function Many proteins combine functions Some immunoglobulin structures are
thought to have more than 100 different functions (and active/binding sites)
Alternative splicing can generate (partially) alternative structures
![Page 129: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/129.jpg)
Protein function & Interaction
Active site / binding cleft
Shape complementarity
![Page 130: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/130.jpg)
Protein function evolution
Chymotrypsin
![Page 131: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/131.jpg)
How to infer function Experiment Deduction from sequence
– Multiple sequence alignment – conservation patterns
– Homology searching Deduction from structure
– Threading– Structure-structure comparison– Homology modelling
![Page 132: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/132.jpg)
Cholesterol Biosynthesis:
Cholesterol biosynthesis primarily occurs in eukaryotic cells. It is necessary for membrane synthesis, and is a precursor for steroid hormone production as well as for vitamin D. While the pathway had previously been assumed to be localized in the cytosol and ER, more recent evidence suggests that a good deal of the enzymes in the pathway exist largely, if not exclusively, in the peroxisome (the enzymes listed in blue in the pathway to the left are thought to be at least partly peroxisomal). Patients with peroxisome biogenesis disorders (PBDs) have a variable deficiency in cholesterol biosynthesis
![Page 133: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/133.jpg)
Mevalonate plays a role in epithelial cancers: it can inhibit EGFR
Cholesterol Biosynthesis: from acetyl-Coa to mevalonate
![Page 134: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/134.jpg)
Epidermal Growth Factor as a Clinical Target in Cancer
A malignant tumour is the product of uncontrolled cell proliferation. Cell growth is controlled by a delicate balance between growth-promoting and growth-inhibiting factors. In normal tissue the production and activity of these factors results in differentiated cells growing in a controlled and regulated manner that maintains the normal integrity and functioning of the organ. The malignant cell has evaded this control; the natural balance is disturbed (via a variety of mechanisms) and unregulated, aberrant cell growth occurs. A key driver for growth is the epidermal growth factor (EGF) and the receptor for EGF (the EGFR) has been implicated in the development and progression of a number of human solid tumours including those of the lung, breast, prostate, colon, ovary, head and neck.
![Page 135: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/135.jpg)
Energy housekeeping:Adenosine diphosphate (ADP) – Adenosine triphosphate (ATP)
![Page 136: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/136.jpg)
Chemical Reaction
![Page 137: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/137.jpg)
Add Enzymatic Catalysis
![Page 138: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/138.jpg)
Add Gene Expression
![Page 139: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/139.jpg)
Add Inhibition
![Page 140: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/140.jpg)
Metabolic Pathway: Proline Biosynthesis
Proline as end product effects a negative feedback loop
![Page 141: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/141.jpg)
Transcriptional Regulation
![Page 142: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/142.jpg)
Methionine Biosynthesis in E. coli
![Page 143: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/143.jpg)
Shortcut Representation
![Page 144: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/144.jpg)
High-level Interaction representation
![Page 145: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/145.jpg)
Levels of Resolution
![Page 146: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/146.jpg)
SREBP Pathway
![Page 147: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/147.jpg)
Signal Transduction
Important signalling pathways: Map-kinase (MapK) signalling pathway, or TGF- pathway
![Page 148: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/148.jpg)
Transport
![Page 149: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/149.jpg)
Phosphate Utilization in Yeast
![Page 150: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/150.jpg)
Multiple Levels of Regulation
Gene expression Protein posttranslational modification Protein activity Protein intracellular location Protein degradation Substrate transport
![Page 151: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/151.jpg)
Graphical Representation – Gene Expression
![Page 152: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/152.jpg)
Protein interaction domains
Protein Interaction Domains
http://pawsonlab.mshri.on.ca/index.php?option=com_content&task=view&id=30&Itemid=63
![Page 153: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/153.jpg)
Domain function
Active site / binding cleft
![Page 154: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/154.jpg)
Protein-protein (domain-domain) interaction
Shape complementarity
![Page 155: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/155.jpg)
A domain is a:
Compact, semi-independent unit (Richardson, 1981).
Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).
Recurring functional and evolutionary module (Bork, 1992).“Nature is a tinkerer and not an inventor” (Jacob, 1977).
Smallest unit of function
![Page 156: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/156.jpg)
Delineating domains is essential for:• Obtaining high resolution structures (x-ray but
particularly NMR – size of proteins)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary
structure)• Fold recognition and threading• Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis
![Page 157: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/157.jpg)
Domain connectivity
linker
![Page 158: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/158.jpg)
Pyruvate kinasePhosphotransferase
barrel regulatory domain
barrel catalytic substrate binding domain
nucleotide binding domain
1 continuous + 2 discontinuous domains
Structural domain organisation can be nasty…
![Page 159: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/159.jpg)
Domain sizeThe size of individual structural domains varies widely
– from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998)
– the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995)
– with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds.Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).
![Page 160: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/160.jpg)
![Page 161: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/161.jpg)
![Page 162: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/162.jpg)
![Page 163: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/163.jpg)
![Page 164: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/164.jpg)
![Page 165: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/165.jpg)
![Page 166: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/166.jpg)
![Page 167: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/167.jpg)
Analysis of chain hydrophobicity in multidomain proteins
![Page 168: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/168.jpg)
Analysis of chain hydrophobicity in multidomain proteins
![Page 169: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/169.jpg)
Domain characteristics
Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).
![Page 170: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/170.jpg)
Protein function evolution- Gene (domain) duplication -
Chymotrypsin
Active site
![Page 171: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/171.jpg)
Pyruvate phosphate dikinase
3-domain protein Two domains catalyse 2-step reaction
A B C Third so-called ‘swivelling domain’
actively brings intermediate enzymatic product (B) over 45Å from one active site to the other
/
![Page 172: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/172.jpg)
Pyruvate phosphate dikinase
3-domain protein Two domains catalyse 2-step reaction
A B C Third so-called ‘swivelling domain’
actively brings intermediate enzymatic product (B) over 45Å from one active site to the other
/
![Page 173: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/173.jpg)
The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.
http
://w
ww
.msh
ri.o
n.ca
/paw
son
![Page 174: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/174.jpg)
Detecting Structural Domains A structural domain may be detected as a compact,
globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983).
Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997).
Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).
![Page 175: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/175.jpg)
Detecting Structural Domains
However, approaches encounter problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation.
Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).
![Page 176: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/176.jpg)
Detecting Domains using Sequence only
Even more difficult than prediction from structure!
![Page 177: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/177.jpg)
SnapDRAGON
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
Integrating protein multiple sequence alignment, secondary and tertiary structure
prediction in order to predict structural domain boundaries in sequence data
![Page 178: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/178.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 179: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/179.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 180: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/180.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 181: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/181.jpg)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
![Page 182: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/182.jpg)
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George)
1. Input: Multiple sequence alignment (MSA) and predicted secondary structure
2. Generate 100 DRAGON 3D models for the protein structure associated with the MSA
3. Assign domain boundaries to each of the 3D models (Taylor, 1999)
4. Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted windowGeorge R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural
domains from sequence data, J. Mol. Biol. 316, 839-851.
![Page 183: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/183.jpg)
SnapDragonFolds generated by Dragon
Boundary recognition
(Taylor, 1999)Summed and Smoothed Boundaries
CCHHHCCEEE
Multiple alignment
Predicted secondary structure
![Page 184: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/184.jpg)
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George)
1. Input: Multiple sequence alignment (MSA)
1. Sequence searches using PSI-BLAST (Altschul et al., 1997)
2. followed by sequence redundancy filtering using OBSTRUCT (Heringa et al.,1992)
3. and alignment by PRALINE (Heringa, 1999)
and predicted secondary structure4. PREDATOR secondary structure prediction
programGeorge R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.
![Page 185: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/185.jpg)
Distance Regularisation Algorithm for Geometry OptimisatioN
(Aszodi & Taylor, 1994)
Domain prediction using DRAGON
•Fold proteins based on the requirement that (conserved) hydrophobic residues cluster together.
•First construct a random high dimensional C distance matrix.
•Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.
![Page 186: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/186.jpg)
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George)
2. Generate 100 DRAGON (Aszodi & Taylor, 1994) models for the protein structure associated with the MSA– DRAGON folds proteins based on the requirement that
(conserved) hydrophobic residues cluster together– (Predicted) secondary structures are used to further
estimate distances between residues (e.g. between the first and last residue in a -strand).
– It first constructs a random high dimensional C (and pseudo C) distance matrix
– Distance geometry is used to find the 3D conformation corresponding to a prescribed matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure)
DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN
![Page 187: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/187.jpg)
•The C distance matrix is divided into smaller clusters.
•Separately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.
3NN
NN
C distancematrix
Targetmatrix
N
CCHHHCCEEE
Multiple alignment
Predicted secondary structure100 randomised
initial matrices
100 predictions Input data
![Page 188: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/188.jpg)
Lysozyme 4lzm
PDB
DRAGON
![Page 189: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/189.jpg)
Methyltransferase 1sfe
DRAGON
PDB
![Page 190: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/190.jpg)
Phosphatase 2hhm-A
PDB DRAGON
![Page 191: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/191.jpg)
Taylor method (1999)DOMAIN-3D3. Assign domain boundaries to each
of the 3D models (Taylor, 1999) Easy and clever method Uses a notion of spin glass theory (disordered
magnetic systems) to delineate domains in a protein 3D structure
Steps:1. Take sequence with residue numbers (1..N)2. Look at neighbourhood of each residue (first shell)3. If (“average nghhood residue number” > res no) resno =
resno+1else resno = resno-1
4. If (convergence) then take regions with identical “residue number” as domains and terminate
Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :203-216
![Page 192: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/192.jpg)
Taylor method (1999)
41
5
6
89
56
78
repeat until convergence
if 41 < (5+6+56+78+89)/5
then Res 41 42 (up 1)
else Res 41 40 (down 1)
![Page 193: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/193.jpg)
Taylor method (1999)
continuous
discontinuous
![Page 194: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/194.jpg)
SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George)
4. Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window (assign central position)
Window score = 1≤ i ≤ l Si × Wi
Where Wi = (p - |p-i|)/p2 and p = ½(n+1).
It follows that l Wi = 1George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.
i
Wi
![Page 195: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/195.jpg)
SNAPDRAGONStatistical significance: Convert peak scores to Z-scores using
z = (x-mean)/stdev If z > 2 then assign domain boundary
Statistical significance using random models: Test hydrophibic collapse given distribution of
hydrophobicity over sequence Make 5 scrambled multiple alignments (MSAs) and predict
their secondary structure Make 100 models for each MSA Compile mean and stdev from the boundary distribution
over the 500 random models If observed peak z > 2.0 stdev (from random models) then
assign domain boundary
![Page 196: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/196.jpg)
SnapDRAGON prediction assessment
• Test set of 414 multiple alignments;183 single and 231 multiple domain proteins.
• Boundary predictions are compared to the region of the protein connecting two domains (maximally 10 residues from true boundary)
![Page 197: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/197.jpg)
SnapDRAGON prediction assessment• Baseline method I:
• Divide sequence in equal parts based on number of domains predicted by SnapDRAGON
• Baseline method II: • Similar to Wheelan et al., based on domain length
partition density function (PDF)• PDF derived from 2750 non-redundant structures
(deposited at NCBI) • Given sequence, calculate probability of one-
domain, two-domain, .., protein• Highest probability taken and sequence split equally
as in baseline method I
![Page 198: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/198.jpg)
Continuous set Discontinuous set Full set
SnapDRAGONCoverage 63.9 (± 43.0) 35.4 (± 25.0) 51.8 (± 39.1)
Success 46.8 (± 36.4) 44.4 (± 33.9) 45.8 (± 35.4)
Baseline 1Coverage 43.6 (± 45.3) 20.5 (± 27.1) 34.7 (± 40.8)
Success 34.3 (± 39.6) 22.2 (± 29.5) 29.6 (± 36.6)
Baseline 2Coverage 45.3 (± 46.9) 22.7 (± 27.3) 35.7 (± 41.3)
Success 37.1 (± 42.0) 23.1 (± 29.6) 31.2 (± 37.9)
Average prediction results per protein
Coverage is the % linkers predicted (TP/TP+FN)Success is the % of correct predictions made (TP/TP+FP)
![Page 199: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/199.jpg)
Average prediction results per protein
![Page 200: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/200.jpg)
Phylogenetic profile analysis
Function prediction of genes based on “guilt-by-association” – a non-homologous approach
The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome
Because proteins that participate in a common structural complex or metabolic pathway are likely to co-evolve, the phylogenetic profiles of such proteins are often ``similar''
![Page 201: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/201.jpg)
Phylogenetic profile analysis
Phylogenetic profile (against N genomes)– For each gene X in a target genome (e.g., E coli),
build a phylogenetic profile as follows– If gene X has a homolog in genome #i, the ith bit
of X’s phylogenetic profile is “1” otherwise it is “0”
![Page 202: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/202.jpg)
Phylogenetic profile analysis
Example – phylogenetic profiles based on 60 genomes
orf1034:1110110110010111110100010100000000111100011111110110111010101orf1036:1011110001000001010000010010000000010111101110011011010000101orf1037:1101100110000001110010000111111001101111101011101111000010100orf1038:1110100110010010110010011100000101110101101111111111110000101orf1039:1111111111111111111111111111111111111111101111111111111111101orf104: 1000101000000000000000101000000000110000000000000100101000100orf1040:1110111111111101111101111100000111111100111111110110111111101orf1041:1111111111111111110111111111111101111111101111111111111111101orf1042:1110100101010010010110000100001001111110111110101101100010101orf1043:1110100110010000010100111100100001111110101111011101000010101orf1044:1111100111110010010111010111111001111111111111101101100010101orf1045:1111110110110011111111111111111101111111101111111111110010101orf1046:0101100000010001011000000111110000010100000001010010100000000orf1047:0000000000000001000010000001000100000000000000010000000000000orf105: 0110110110100010111101101010111001101100101111100010000010001orf1054:0100100110000001100001000100000000100100100001000100100000000
Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999)
By correlating the rows (open reading frames (ORF) or genes) you find out about joint presence or absence of genes: this is a signal for a functional connection
gene
genome
![Page 203: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/203.jpg)
Phylogenetic profile analysis
Phylogenetic profiles contain great amount of functional information
Phlylogenetic profile analysis can be used to distinguish orthologous genes from paralogous genes
Subcellular localization: 361 yeast nucleus-encoded mitochondrial proteins are identified at 50% accuracy with 58% coverage through phylogenetic profile analysis
Functional complementarity: By examining inverse phylogenetic profiles, one can find functionally complementary genes that have evolved through one of several mechanisms of convergent evolution.
![Page 204: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/204.jpg)
Prediction of protein-protein interactions
Rosetta stone
Gene fusion is the an effective method for prediction of protein-protein interactions– If proteins A and B are homologous to two domains of a
protein C, A and B are predicted to have interaction
Though gene-fusion has low prediction coverage, it false-positive rate is low
A B
C
![Page 205: Introduction to Bioinformatics Lecture 4: Bioinformatics infrastructure Centre for Integrative Bioinformatics VU (IBIVU) C E N T R F O R I N T E G R A.](https://reader035.fdocuments.us/reader035/viewer/2022062217/56649ebd5503460f94bc6d37/html5/thumbnails/205.jpg)
Domain fusion exampleVertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt). In insects, the polypeptide appears as GARs-(AIRs)2-GARt. In yeast, GARs-AIRs is encoded separately from GARtIn bacteria each domain is encoded separately (Henikoff et al., 1997).
GAR: glycinamide ribonucleotide AIR: aminoimidazole ribonucleotide