1 P6a Extra Discussion Slides Part 1. 2 Section A.
-
Upload
jerome-pearson -
Category
Documents
-
view
225 -
download
0
Transcript of 1 P6a Extra Discussion Slides Part 1. 2 Section A.
![Page 1: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/1.jpg)
1
P6a Extra Discussion Slides Part 1
![Page 2: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/2.jpg)
2
Section A
![Page 3: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/3.jpg)
3
Low complexity filter
![Page 4: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/4.jpg)
What are low-complexity sequences?
• Sequences that have low compositional complexity, such as repeats
• Examples
– Protein: PPCDPPPPPKDKKKKDDGPP
– Nucleotide: AAATAAAAAAAATAAAAAAT
4
![Page 5: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/5.jpg)
When filter is on, how are low-complexity sequences displayed on blast results page
• Old blast algorithms/versions: The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX").
• New blast algorithm: The filter substitutes any low-complexity sequence with lowercase grey characters. This allows you to see the sequence that was filtered instead of the "X"s and "N"s of the previous BLAST output.
5
![Page 6: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/6.jpg)
When to use and not to use the filter?
• In general, filters are used (turned on/ticked or checked) to remove low-complexity sequences because they can cause artifactual hits
• Because filtering can affect the % identity and % positive computation, you should turn it off, if you want an accurate representation of % identity and % positives to infer homology
• Examples:
AAAABCDEFGHI XXXXBCDEFGHI
AAAABLDEFGHI XXXXBLDEFGHI
% identity = 91% % identity = 88%
6
![Page 7: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/7.jpg)
Limiting Blast Results
7
![Page 8: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/8.jpg)
8
Blast results page top section: graphical overview
Length of query
Database information
![Page 9: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/9.jpg)
9
What is graphical overview for?
Graphic representation of results• Top of graph represents query sequence
• Underlying bars show where hits occur
• Colors represent alignment scores
• Grey areas represent non similar regions surrounded by similar regions
• Scrolling over bar shows accession and description of hit
• Clicking on a bar takes you to its alignment with the query
![Page 10: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/10.jpg)
10
Blast results page middle section: descriptions
![Page 11: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/11.jpg)
11
What is Bit Score?
![Page 12: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/12.jpg)
12
• Bit scores– Gives an indication of how good the alignment is -
higher is better– A score in bits is a normalized raw score– Raw score = sum of substitution scores and gap
penalties– Normalized on basis of scoring method – Can compare searches scored using different matrices
What is Bit Score and Raw Score?
![Page 13: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/13.jpg)
13
What is E-value?
![Page 14: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/14.jpg)
14
• E-values– It is a measure of the reliability of the S score– In another words, it is the probability of
alignment significance• Number of times an alignment with the same score
could have arose by chance
– Lower is better– E-values decrease exponentially as scores for an
alignment increase– E value of 9e-78 means 9 x 10-78
What is E-value?
![Page 15: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/15.jpg)
15
• Why do we need two measures, E-value and bit score, when both more or less tell you “how good a blast hit is”?
Why Bit score and E-value?
![Page 16: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/16.jpg)
16
Blast results page bottom section: alignments
![Page 17: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/17.jpg)
17
Anatomy of an alignment
![Page 18: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/18.jpg)
18
Anatomy of an alignment
• Description line of hit sequence- provides descriptions such as name, accession number, and sometimes function and species of isolation (from which species the sequence was isolated from) of the hit sequence
![Page 19: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/19.jpg)
19
Anatomy of an alignment
Length of hit sequence
• How do you get length of query sequence?- count yourself since it was provided by you, or- refer to the top section of the blast results page
Length of query sequence
![Page 20: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/20.jpg)
Length of query, hit & alignment
Query
Database
11
1 165
165900
50 100
850 900
Original length of input query (length of query) = 165aaOriginal length of hit (length of hit) = 900aaLength of alignment = 900-850+1+ gaps (if any) Or 100-50+1+gaps (if any)
Length of alignment between query and hit (sbjct)
Query
Sbjct
QueryHit
1) BLAST
2) Alignment of hits
1 900
![Page 21: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/21.jpg)
21
Anatomy of an alignment
S Score provides alignment score in both normalized (bits) and raw (in the bracket) form
![Page 22: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/22.jpg)
22
Anatomy of an alignment
E-value is a measure of the reliability of the S score
![Page 23: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/23.jpg)
23
Anatomy of an alignment
Identities provides the fraction of number of identical residues (boxed in red above) over the total length of alignment (% identity)
No. of identical residues
Length of alignment
% identity
*Note that this alignment taken from another blast hit is shown to demonstrate the equation of Positives below and it is not corresponding to the Positives value above for the hit >gi|122295
![Page 24: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/24.jpg)
24
Anatomy of an alignment
Positives provides the fraction of positive residues (number of identical residues + number of residues with the + sign) over the length of the alignment (% positives)
No. of positive residuesLength of alignment % positives
*Note that this alignment taken from another blast hit is shown to demonstrate the equation of Positives below and it is not corresponding to the Positives value above for the hit >gi|122295
![Page 25: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/25.jpg)
25
Anatomy of an alignment
Query refers to your own input sequence that you are investigating
Sbjct or subject refers to the hit sequence from the database that matched your query sequence
![Page 26: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/26.jpg)
26
Anatomy of an alignment
Local alignment start position for query and subject sequences
![Page 27: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/27.jpg)
27
Anatomy of an alignment
Local alignment end position for query and subject sequences
![Page 28: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/28.jpg)
28
Anatomy of an alignment
Aligned length of query = end position – start position + 1Aligned length of hit = end position – start position + 1
![Page 29: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/29.jpg)
29
Anatomy of an alignment
•The frame number of the ORF that matched the query sequence.•The frame number will only be shown if either the query or the database sequence is translated (blastx, tblastn, tblastx)
![Page 30: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/30.jpg)
30
• Five key parameters of blast local alignment to be analyzed if one wants to infer homology
1) Length of the alignment
2) E value
3) S score
4) Percentage Identity
5) Percentage Positives
Alignments are analyzed to infer homology
![Page 31: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/31.jpg)
31
Some rules to note when inferring homology
• Similarity can be indicative of homology• Generally, if two sequences are significantly similar
over entire length they are likely homologous• You cannot measure homology - you cannot say two
sequences are 90% homologous; instead, based on the similarity you infer whether they are homologous or not.
![Page 32: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/32.jpg)
32
Why the discrepancy?
![Page 33: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/33.jpg)
33
The culprit: query sequence matching multiple parts of hit sequences
Query
Length of Hit
![Page 34: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/34.jpg)
34
Section B
![Page 35: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/35.jpg)
35
BLAST Flavors
Query Database BLAST flavor Blast output
DNA DNA
DNA Protein
Protein Protein
Protein DNA
DNA DNA
• Currently, 5 different basic BLAST flavors available- 5 different combinations
• How to remember?-when you have “X” after “blast” – the query is translated
-when you have “T” before “blast”– the database is translated
BLASTN
BLASTX
BLASTP
TBLASTN
TBLASTX
DNA
Protein
Protein
Protein
Protein
![Page 36: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/36.jpg)
36
BLAST Flavors
Query Database BLAST flavor Blast output
DNA DNA
DNA Protein
Protein Protein
Protein DNA
DNA DNA
• TBLASTX You would use this instead of blastn when you want the output in
protein format, instead of DNA However, this flavour is often limited in usage because the six-frame
translation of the large number of sequences in the database (which can number up to few millions or even billions) requires a lot of processing time
BLASTN
BLASTX
BLASTP
TBLASTN
TBLASTX
DNA
Protein
Protein
Protein
Protein
![Page 37: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/37.jpg)
Popular NCBI databases for BLAST
GenBank or NCBI Nucleotide databaseor Nucleotide collection (nt)
GenPept Or NCBI protein database or non-redundant protein sequences (nr)
DNA databases Protein databases
Reference genomic sequences(RefSeq_genomic)
Protein Data Bank (PDBnt)
Reference Protein (RefSeq_protein)
Swissprot protein sequences (swissprot)
Protein Data Bank proteins (PDBaa)
![Page 38: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/38.jpg)
Are nt or nr really non-redundant?
• Though NR and NT are called non- redundant databases, they are actually redundant. When they were first created, they were intended to be non-redundant (no redundancy) databases of protein and nucleotide sequences, respectively. However, for some unknown reason, NCBI was not able to keep it non-redundant. But the phrase “non-redundant” remained attached to these databases. So it is kind of misleading calling a database that is redundant as non-redundant. But, there is nothing much we can do about the name because NCBI seems to have decided to keep it as it is.
![Page 39: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/39.jpg)
What are RefSeq databases?
• Recently, NCBI created two databases called RefSeq_Protein and RefSeq_Genomic, designed to reduce duplication in NR/NT by selecting unique representative sequences for each locus
• Example:
– Take all the sequences from NR of the protein of interest (e.g human p53)
– Remove duplicate and partial sequences of the protein of interest (e.g human p53)
– Take one representative sequence and place a copy in RefSeq database
– Add lots of annotation to the record
• RefSeq_Protein contains reference protein sequences
• RefSeq_Genomic contains reference DNA sequences
![Page 40: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/40.jpg)
NR/NT versus RefSeq_Protein/Genomic
• NR/NT database contains ALL known sequences reported at NCBI (including duplicates).
• RefSeq databases are reference databases of non-redundant and representative sequences from NR/NT. RefSeq databases are subsets of NR/NT
• RefSeq records are usually highly curated
![Page 41: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/41.jpg)
Which database is good for hits from single species or multiple species?
• NR is good when the user is interested in all hits, either from the same species or multiple species
– make sure you set the description and alignment limit to maximum in order to see all the hits
• RefSeq is good when the user is interested in reference hits, either from single or multiple species.
![Page 42: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/42.jpg)
What is Swissprot database and how does it differ from RefSeq?
• Swissprot or Uniprot is a database of highly curated protein sequences (the sequence records are enriched with information from the literature)
• This database represents an effort to annotate/enrich all the protein sequence records in NR
• RefSeq protein versus SwissProt:– Swissprot is larger in size than RefSeq– Both contain highly curated protein sequence records
42
![Page 43: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/43.jpg)
43
Section C
![Page 44: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/44.jpg)
44
Blast2Seq
• BLAST 2 Sequences (bl2seq) - aligns two sequences of your choice
-The sequence you input in the first text box is treated as the query- The sequence you input in the second text box is treated as a sequence from an “imaginary” database- Hence, even though you are comparing only two sequences, the different blast flavours can also be applied here- Also, provides a dot-plot like output
![Page 45: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/45.jpg)
45
60amino acids
Why the discrepancy?
![Page 46: 1 P6a Extra Discussion Slides Part 1. 2 Section A.](https://reader030.fdocuments.us/reader030/viewer/2022033105/56649efa5503460f94c0ba43/html5/thumbnails/46.jpg)
46
60amino acids
Why the discrepancy?
Query: 84-25+1= 60
Sbjct: 183-4+1= 180
180/3 bases per codon = 60aaSbjct position refers to nucleotide position