Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance...
-
Upload
thomasine-cox -
Category
Documents
-
view
213 -
download
0
Transcript of Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance...
![Page 1: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/1.jpg)
Thursday and Friday
Dr Michael CartonFormerly VO’F group, now National Disease Surveillance Centre (NDSC)
Wed (tomorrow) 10am - this suite booked for BLAST searches
![Page 2: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/2.jpg)
TODAY
www.nuigalway.ie/microbiology/bioinformaticsnode/home.html
Lots of definitions - don’t worry!!But, later on, look stuff up on Google
or Scirus
![Page 3: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/3.jpg)
Remember: Homology:- sequences are homologous
if they are related by divergence from a common ancestor
![Page 4: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/4.jpg)
Sequence alignment In order to detect sequence
homology we must first align sequences.
An alignment is a hypothesis of positional homology between nucleotides/amino acids.
![Page 5: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/5.jpg)
Alignment example
Take the case of a hypothetical ancestral sequence (GAATTCGC). Over time mutation may lead to two different forms of this sequence, GAATTCGC and GATTGGC.
![Page 6: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/6.jpg)
Example continued Alignment without gaps
GAATTCGC
GATTGGC
** * Alignments with gaps
GAATTCGC or GAATTC–GC
GA–TTGGC GA–TT–GGC
** ** ** ** ** **
![Page 7: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/7.jpg)
Types of alignment Local
Local alignment finds short regions of similarity between a pair of sequences
Global Global alignments attempts to find
the optimal alignment over the entire length of the sequences.
![Page 8: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/8.jpg)
Local alignment Finds domains and short regions of
similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores.
![Page 9: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/9.jpg)
Local alignment This feature of local similarity
searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene
![Page 10: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/10.jpg)
Global alignment Finds the optimal alignment over the entire
length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descent from a common ancestor), global alignment works well.
![Page 11: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/11.jpg)
PROGRAMS USED Local
Blast Fasta3
Global Clustalw Clustalx
![Page 12: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/12.jpg)
Terminology Exact (Exhaustive):
This is a method of looking at all possibilities for a particular problem and then choosing the best one. It is the most rigorous method.
Heuristic: This class of methods takes short-cuts
and attempts to arrive at an optimal solution by making educated guesses.
![Page 13: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/13.jpg)
Matrices Write one sequence horizontally Write the other sequence vertically
to form a grid:T A T T G
T
A
A
T
G
1 1 0
0 1 0
1 0 1
![Page 14: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/14.jpg)
Calculating an Alignment Score
An alignment’s score is calculated using Scoring matrix Gap Opening Penalty Gap extension penalty
![Page 15: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/15.jpg)
Scoring an alignment
A C T G
A 1
C 0 1
T 0 0 1
G 0 0 0 1
![Page 16: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/16.jpg)
Previous Example Alignment without gaps
GAATTCGC
GATTGGC
** * Alignments with gaps
GAATTCGC or GAATTC–GC
GA–TTGGC GA–TT–GGC
** ** ** ** ** **
![Page 17: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/17.jpg)
Dotplot Matrix I
![Page 18: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/18.jpg)
Dotplot Matrix II
![Page 19: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/19.jpg)
Noise is caused by matches that have occurred by chance without any homology present. Can use a filter to reduce the noise, eg. only place a dot when a specified portion of a smallgroup of successive bases match, eg. window of 10 only highlighted if 6 of the 10 bases match
Chimpanzee haeomoglobin intergenic DNA plotted againstitself c. 400 bases
![Page 20: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/20.jpg)
8 out of 10, even less noise
![Page 21: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/21.jpg)
IDENTITY DOT BLOT-identity blocks-looks for blocks of perfect identity, -reduces time required
Chimp and spider monkey DNA, but c. 4,000 bases this time
![Page 22: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/22.jpg)
Scoring matrix In reality, we know that certain mutations
are more likely to have occurred than others.
Conservation of the secondary structure of proteins is an important consideration.
The mutation of the third base in a codon often results in no change in the amino acid coded for.
Observations of alignments of amino acid sequences have been used to calculate the probability of certain substitutions.
![Page 23: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/23.jpg)
Scoring Matrices Scoring matrices tell how similar
amino acids are. There are two main sets of scoring
matrices: PAM and BLOSUM. PAM is based on evolutionary
distances BLOSUM is based on
structure/function similarities
![Page 24: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/24.jpg)
AA Matrices Assigning a score to all of the 210
possible amino acid substitutions has been done by several authors but 2 are especially noteworthy
Dayhoff et al. (1978) used amino acid alignments of sequences that were 85% similar as a basis for the PAM mutation data matrices
![Page 25: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/25.jpg)
AA Matrices Henikoff and Henikoff (1992) used several
different alignments to produce the BLOSUM matrices.
The Blosum 62 Matrix is based on an alignment of sequences that are at least 62% similar
This is possibly the most used of amino acid substitution matrices and is the default matrix used in several applications
![Page 26: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/26.jpg)
Scoring matrices These have been empirically determined
and have been calculated by the direct comparison of related protein sequences.
In general, amino acid substitutions that are seen to occur very rarely are given a negative value.
Conservative substitutions (i.e., isoleucine for leucine) are given a positive value. Identical matches are also given a positive value.
![Page 27: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/27.jpg)
![Page 28: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/28.jpg)
The bottom line on PAMFrequencies of alignmentFrequencies of occurrence
The probability that two amino acids, i and j arealigned by evolutionary descent divided by the
probability that they are aligned by chance
![Page 29: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/29.jpg)
BLOSUM Matrices BLOSUM is built from distantly
related sequences whereas PAM is built from closely related sequences.
BLOSUM is built from conserved blocks of aligned protein segment found in the BLOCKS database.
![Page 30: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/30.jpg)
PAM and BLOSUM Running searches with different matrices
will help find different sorts of hits. PAM30 will preferentially find
homologues that are evolutionarily close PAM250 will tend to find long, weak
diffuse matches typical of distantly related proteins.
BLOSUM62 is based on alignments of proteins that are at least 62% similar.
![Page 31: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/31.jpg)
Evolutionary Basis of Sequence Alignment
1. Similarity: Quantity that relates to how alike two sequences are.2. Identity: Quantity that describes how aliketwo sequences are in the strictest terms.3. Homology: a conclusion drawn from datasuggesting that two genes share a commonevolutionary history.
![Page 32: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/32.jpg)
Evolutionary Basis of Sequence Alignment (Cont. 1)
1. Example: Shown on the next page is a pairwise alignment of two proteins. One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity.
2. Underlined residues are identical. Asterisks and diamond represent those residues that participate in catalysis. Five gaps are placed to optimize the alignment.
![Page 33: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/33.jpg)
![Page 34: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/34.jpg)
Evolutionary Basis of Sequence Alignment (Cont. 2)
Why are there regions of identity?
1) Conserved function-residues participate in reaction.
2) Structural-residues participate in maintaining structure of protein. (For example, conserved cysteine residues that
form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a
common ancestor gene.
![Page 35: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/35.jpg)
Sequence Homology Searching
Find related sequences in the database
![Page 36: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/36.jpg)
Original BLAST Segment pair- this is a pair of
subsequences of the same length that form an ungapped alignment.
BLAST searches for all segment pairs between the query sequence and all of the sequences in the database (above a certain threshold).
HSP-High-Scoring Pair.
![Page 37: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/37.jpg)
Original Blast HSPs are derived by first finding
the pairs that satisfy the threshold (T) conditions. Then the alignment is extended in both directions unyil the quality of the alignment drops off dramatically or falls to zero
The HSPs are then sorted according to their score
![Page 38: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/38.jpg)
Gapped BLAST The original BLAST suffered from the
limitation of not being able to introduce gaps into the alignment.
Gapped BLAST is an effort to circumvent this shortcoming.
Experience shows that often several ungapped non-overlapping alignments result from a match to a single database entry.
![Page 39: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/39.jpg)
Two-Hit method Find 2 HSPs within a distance m of
each other on the same diagonal. Do not attempt an HSP extension
unless you find two regions that meet this criterion.
Attempt to generate a single gapped alignment in this region.
![Page 40: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/40.jpg)
FastA algorithm Is the alignment significant? Could we see an alignment like this
purely by chance? What are the statistics involved?
![Page 41: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/41.jpg)
ktups
Sequence X GAATTCGCATCThis 11 base sequence can be divided into six 6-
long segments of DNA GAATTC AATTCG ATTCGC TTCGCA TCGCAT CGCATC
These are known as ‘ktuples’ (ktup Fasta).Sequences in databases are stored in this
form.
![Page 42: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/42.jpg)
Global Alignment vs. Local Alignment Global alignment is used when the overall
gene sequence is similar to another sequence-often used in multiple sequence alignment e.g. Clustal W algorithm
Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA
![Page 43: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/43.jpg)
Different forms of BLAST and FASTA
You have a nucleotide sequence. Want to compare with other
nucleotide sequences Blastn Fasta3
![Page 44: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/44.jpg)
Different forms of BLAST and FASTA
To compare the 6-frame conceptual translation of the nucleotide sequence against a protein database Blastx Fastx3 Fasty3
![Page 45: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/45.jpg)
Different forms of BLAST and FASTA
If we translate our nucleotide sequence, we can compare it to the translation of a nucleotide database; tBlastn tFasty3
![Page 46: Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.](https://reader035.fdocuments.us/reader035/viewer/2022081519/56649e265503460f94b1568f/html5/thumbnails/46.jpg)
Homology Search Tools BLAST (Basic Local Alignment
Search Tool) by Stephen Altschul http://www.ncbi.nih.gov/
FASTA by William Pearson http://www.ebi.ac.uk/
Open a new word file and 3 web browser windows