Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is...

49
Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file. Learn how to search Genbank for information. Understand difference between header, features and sequence. Learn the difference between a primary database and secondary database. Principle of similarity searches using the BLAST program

Transcript of Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is...

Page 1: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Module 2Sequence DBs and Similarity Searches

Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file. Learn how to search Genbank for information. Understand difference between header, features and

sequence. Learn the difference between a primary database and

secondary database. Principle of similarity searches using the BLAST

program

Page 2: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

What is GenBank?

Gene sequence database

Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region (limit 350 kb)

Generated from direct submissions to the DNA sequence databases from the authors.

Part of the International Nucleotide Sequence Database Collaboration.

Page 3: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Exchange of information on a daily basis

GenBank(NCBI)

EMBL (EBI)United Kingdom

DDBJJapan

International Nucleotide Sequence Database Collaboration

Page 4: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

History of GenBank

Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965)In 1986 it collaborated with EMBL and in 1987 it collaborated with DDBJ.It is a primary database-(i.e., experimental data is placed into it)Examples of secondary databases derived from GenBank/EMBL/DDBJ: Swiss-Prot, PRI.GenBank Flat File is a human readable form of the records.

Page 5: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

General Comments on GBFF

Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each

represented by a key. 3) Nucleotide sequence-each ends with // on

last line of record.

DNA-centered

Translated sequence is only a feature

Page 6: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Feature Keys

Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to

sequences

Feature Key Description conflict Separate deter’s of the same seq. differ

rep_origin Origin of replication

protein_bind Protein binding site on DNA

CDS Protein coding sequence

Page 7: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Feature Keys-Terminology

Feature Key Location/Qualifiers

CDS 23..400

/product=“alcohol dehydro.”

/gene=“adhI”

Interpretation-The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Page 8: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Feature Keys-Terminology (Cont.)

Feat. Key Location/Qualifiers

CDS join (544..589,688..1032)

/product=“T-cell recep. B-ch.”

/partial

Interpretation-The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

Page 9: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Record from GenBank

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999

DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and

Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.

ACCESSION U49845

VERSION U49845.1 GI:1293613

KEYWORDS .

SOURCE baker's yeast.

ORGANISM Saccharomyces cerevisiae

Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;

Saccharomycetaceae; Saccharomyces.

Modification dateGenBank division (plant, fungal and algal)

Coding regionUnique identifier (never changes)

Nucleotide sequence identifier (changes when there is a changein sequence (accession.version))

GeneInfo identifier (changes whenever there is a change)

Word or phrase describing the sequence (not based on controlled vocabulary).Not used in newer records.

Common name for organism

Formal scientific name for the source organism and its lineagebased on NCBI Taxonomy Database

Page 10: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Record from GenBank (cont.1)

REFERENCE 1 (bases 1 to 5028)

AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.

TITLE Cloning and sequence of REV7, a gene whose function is required

for DNA damage-induced mutagenesis in Saccharomyces cerevisiae

JOURNAL Yeast 10 (11), 1503-1509 (1994)

MEDLINE 95176709

REFERENCE 2 (bases 1 to 5028)

AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.

TITLE Selection of axial growth sites in yeast requires Axl2p, a

novel plasma membrane glycoprotein

JOURNAL Genes Dev. 10 (7), 777-793 (1996)

MEDLINE 96194260

Oldest reference first

Medline UID

REFERENCE 3 (bases 1 to 5028)

AUTHORS Roemer,T.

TITLE Direct Submission

JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,

New Haven, CT, USA

Submitter of sequence (always the last reference)

Page 11: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Record from GenBank (cont.2)

FEATURES Location/Qualifiers

source 1..5028

/organism="Saccharomyces cerevisiae"

/db_xref="taxon:4932"

/chromosome="IX"

/map="9"

CDS <1..206

/codon_start=3

/product="TCP1-beta"

/protein_id="AAA98665.1"

/db_xref="GI:1293614"

/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA

AEVLLRVDNIIRARPRTANRQHM"

Partial sequence on the 5’ end. The 3’ end is complete.

There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature)

Keys

Location

Qualifiers

Descriptive free text must be quotations

Start of open reading frame

Database cross-refsProtein sequence ID #

Note: only a partial sequence

Values

Page 12: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615"

/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616"

/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “

Cutoff

Cutoff

New location

New location

Page 13: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Record from GenBank (cont.4)

BASE COUNT 1510 a 1074 c 835 g 1609 t

ORIGIN

1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg

61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//

Page 14: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Primary databases contain experimental biological information

GenBank/EMBL/DDBJAlu-alu repeats in human DNAdbEST-expressed sequence tags-single pass cDNA sequences (high error freq.)

It is non-redundantHTGS-high-throughput genomic sequence database (errors!)PDB-Three-dimensional structure coordinates of biological moleculesPROSITE-database of protein domain/function relationships.

Page 15: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Types of secondary databases that contain biological information

dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping)

Genome databases-(there are over 20 genome databases that can be searched

EPD:eukaryotic promoter database

NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one.

Vector: A subset of GenBank containing vector DNA

ProDom

PRINTS

BLOCKS

Page 16: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Workshop 2 A-Look up a Genbank record. Usethe annotations to determine the the first openreading frame.

Page 17: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Similarity Searching

It is easy to score if an amino acid is identical to another (thescore is 1 if identical and 0 if not). However, it is not easy togive a score for amino acids that are somewhat similar.

+NH3CO2

- +NH3CO2

-

Leucine Isoleucine

Should they get a 0 (non-identical) or a 1 (identical) orSomething in between?

Page 18: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Purpose of finding differences and similarities of amino acids.

Infer structural information

Infer functional information

Infer evolutionary relationships

Page 19: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Evolutionary Basis of Sequence Alignment

1. Similarity: Quantity that relates to how alike two sequences are.2. Identity: Quantity that describes how aliketwo sequences are in the strictest terms.3. Homology: a conclusion drawn from datasuggesting that two genes share a commonevolutionary history.

Page 20: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Evolutionary Basis of Sequence Alignment (Cont. 1)

1. Example: Shown on the next page is a pairwise alignment of two proteins. One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity.

2. Underlined residues are identical. Asterisks and diamond represent those residues that participate in catalysis. Five gaps are placed to optimize the alignment.

Page 21: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Page 22: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Evolutionary Basis of Sequence Alignment (Cont. 2)

Why are there regions of identity?

1) Conserved function-residues participate in reaction.

2) Structural-residues participate in maintaining structure of protein. (For example, conserved cysteine residues that

form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a

common ancestor gene.

Page 23: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Evolutionary Basis of Sequence Alignment (Cont. 3)

Note: It is possible that two proteins share a high degree of similarity but have two different functions. For example, human gamma-crystallin is a lens protein that has no knownenzymatic activity. It shares a high percentage of identity withE. coli quinone oxidoreductase. These proteins likely had acommon ancestor but their functions diverged.

Analogous to railroad car and diner function.

Page 24: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Page 25: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Modular nature of proteins

The previous alignment was global. However, many proteins do not display global patterns of similarity. Instead, they possess local regions of similarity.

Proteins can be thought of as assemblies of modular domains. It is thought that this may, in some cases, be due to a process known as exon shuffling.

Page 26: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Modular nature of proteins (cont. 1)

Exon 1a Exon 2a

Duplication

Exon 1a Exon 2a Exon 2a

Exchange

Gene A

Gene B

Gene A

Gene B

Exon 1a Exon 2a Exon 3 (Ex. 2b from Gene B)

Exon 1b Exon 2b Exon 3 (Ex. 2a from Gene A)

Page 27: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Dot Plots

A T G C C T A G

A T G C C T A G

**

**

**

**

**

**

**

*

*

Window = 1

Note that 25% ofthe table will befilled due to randomchance. 1 in 4 chanceat each position

Page 28: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Dot Plots with window = 2

A T G C C T A GA T G C C T A G

**

**

**

*

Window = 2The larger the windowthe more noise canbe filtered

What is thepercent chance thatyou will receive a match randomly?1/16 * 100 = 6.25%

{{{{{{{

Page 29: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Identity Matrix

Simplest type of scoring matrix

LICA

1000L

100I

10C

1A

Page 30: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Similarity

It is easy to score if an amino acid is identical to another (thescore is 1 if identical and 0 if not). However, it is not easy togive a score for amino acids that are somewhat similar.

+NH3CO2

- +NH3CO2

-

Leucine Isoleucine

Should they get a 0 (non-identical) or a 1 (identical) orSomething in between?

Page 31: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Scoring Matrices

Importance of scoring matricesScoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of sequence alignment. Understanding theories underlying a given scoring matrix can aid in making the proper choice when performing sequence alignments.

Page 32: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Scoring MatricesWhen we consider scoring matrices, we encounter the

convention that matrices have numeric indices corresponding to the rows and columns of the matrix. For example, M11 refers to the entry at the first row and the first column. In general, Mij refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. For example, if the matrix is:

{A,C,T,G} then A = 1,1; C = 1,2, etc.

Page 33: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Steps to building the first PAM(Point Accepted Mutation)

1. Dayhoff aligned sequences that were at least 85% identical.

2. Reconstructed phylogenetic trees and inferred ancestral sequences. 71 trees containing 1,572 aa exchanges were used.

3. Tallied aa replacements "accepted" by natural selection, in all pair-wise comparisons.

Page 34: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Steps to building PAM (cont. 1)

4. Computed amino acid mutability, mj (the propensity of a given amino acid, j, to be replaced)

5. Combined data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance, according to the following formula: Replacements

Mjj = 1 - mj

MPM of aaj for aaj

Page 35: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Steps to building PAM (cont. 2)

6. Took the log odds ratio to obtain each score:

Sij = log (Mij/fi) (Note: this is what you see in the matrix)

Where fi is the normalized frequency of aai in the sequences used.

7. Note: must multiply the Mij/fi by factors of 10 prior to avoid fractions.

Page 36: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Assumptions in the PAM model

1. Replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model).

2. Sequences that are being compared have average amino acid composition.

Page 37: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

The bottom line on PAM

Frequencies of alignmentFrequencies of occurrence

The probability that two amino acids, i and j arealigned by evolutionary descent divided by the

probability that they are aligned by chance

Page 38: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Sources of error in PAM model

1. Many sequences depart from average aa composition.

2. Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 aa pairs (out of appoxi-mately 400 aa pairs) no replacements were observed!).

3. Errors in 1PAM are magnified in the extrapolation to250 PAM. (Mij

k = k PAM)

4. This process (Markov) is an imperfect representation of evolution: distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence.

Page 39: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Page 40: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

BLOSUM Matrices

BLOSUM is built from distantly related sequences whereas PAM is built from closely related sequences

BLOSUM is built from conserved blocks of aligned protein segment found in the BLOCKS database (remember the BLOCKS database is a secondary database that depends on the PROSITE Family)

Page 41: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Gap Penalties

Takes into account insertions and deletions.

Can’t have too many that may make the alignment meaningless

Typically, there is a fixed deduction for introducing a gap plus additional deduction for the length of the gap.

Gap penalty = G + Ln where G = gap opening penalty, L =gap extension penalty and n = gap length.

G = 2 to 12, L = 2

Page 42: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Global Alignment vs. Local Alignment

Global alignment is used when the overall gene sequence is similar to another sequence-often used in multiple sequence alignment. Clustal W algorithm (Needleman-Wunsch)

Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA Smith-Waterman algorithm

Page 43: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Two proteins that are similar in certain regions

Tissue plasminogen activator (PLAT)Coagulation factor 12 (F12).

Page 44: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

The Dotter Program

• Program consists of three components:

•Sliding window

•A scoring matrix that gives a score for each amino acid

•A graph that converts the score to a dot of certain pixel density

Page 45: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Page 46: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Region ofsimilarity

Single region on F12is similar to two regionson PLAT

Page 47: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

BLAST

Basic Local Alignment Search Tool

Speed is achieved by: Pre-indexing the database before the search Parallel processing

Uses a hash table that contains neighborhood words rather than just identical words.

Page 48: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Neighborhood words

The program declares a hit if the word taken from the query sequence has a score >= T when a substitution matrix is used.

This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity.

If T is increased by the user the number of background hits is reduced and the program will run faster

Page 49: Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.

Workshop for module 2: Use the Dotter program to determinethe optimal alignment between two sequences. Perform a Blastsearch on a protein sequence.