Alignments Lecture

download Alignments Lecture

of 15

Transcript of Alignments Lecture

  • 8/8/2019 Alignments Lecture

    1/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    1

    Alignments in bioinformatics

    Lecture notes

    Compiled by:

    Ola Spjuth [[email protected]]Department of Pharmaceutical Biosciences

    Uppsala University

  • 8/8/2019 Alignments Lecture

    2/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    2

    ALIGNMENTS IN BIOINFORMATICS 1

    Sequence Analysis 3 Biological Background for Sequence Analysis 3 Searching of databases for sequences similar to a new sequence 4

    Sequence alignment 5

    Multiple sequence alignment 6 Evaluating local multiple alignments 7

    Tools for sequence alignment 8 BLAST 8 Clustal 10

    Uses of multiple alignment 11 Searching 13 PCR primer design 13

    Structural alignments 14 Data produced by structural alignment 15

    References 15

  • 8/8/2019 Alignments Lecture

    3/15

  • 8/8/2019 Alignments Lecture

    4/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    4

    Four different nucleotides taken three at a time can result in 64 different possibletriplet codes; more than enough to encode 20 amino acids. The way that these 64codes are mapped onto 20 amino acids is first, that one amino acid may be encoded

    by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons,specify "end of peptide sequence". Where multiple codons specify the same aminoacid, the different codons are used with unequal frequency and this distribution of frequency is referred to as "codon usage". Codon usage varies between species.

    The fact that DNA nucleotides need to be read three at a time to specify a proteinsequence implies that a DNA sequence has three different reading frames determined

    by whether you start at nucleotide one, two, or three. (Nucleotide four will be in thesame frame as nucleotide one and so on). Both strands of DNA can be copied intoRNA (for translation into protein). Thus, a DNA sequence with its (inferred)complementary strand can specify six different reading frames.

    It is possible to chemically determine the sequence of amino acids in a protein and of nucleotides in RNA or DNA. However, it is vastly easier at present to determine thesequence of DNA than that of RNA or protein. Since the sequence of a protein can bedetermined from the DNA sequence that encodes it, most protein sequences are infact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is asimple laboratory proceedure, so RNA molecules are themselves sequenced as cDNAcopies.

    Searching of databases for sequences similar to a new sequence

    If you have just determined a sequence of an interesting bit of DNA, one of the firstquestions you are likely to ask yourself is "has anybody else seen anything like this?"Fortunately, there has been a very successful international effort to collect all thesequences people have determined in one place so they can be searched. For DNAsequences, three groups have cooperated in this effort, one in Japan, one in Europe,and one in the United States to produce DDBJ, EMBL and GenBank, respectively.These databases are frequently reconciled with each other, so that searching any oneis virtually the same as searching all three. The problem is that these databases areHUGE and, as a result, you must compare your sequence with this vast number of other sequences efficiently. A number of programs have been written to rapidlysearch a database for a query sequence, two of which, BLAST and FASTA, will bediscussed in this course. The techniques used by these programs to make searchingrapid result in some loss of rigor of comparison. It is possible (although, as it turnsout, unlikely) that a weak but relevant similarity could be missed by these programs.In addition, many times these programs will flag a sequence as being similar to your query sequence when this similarity is not significant. Thus, these programs should beseen as tools for identifying a small subset of sequences from the database for retrieval and further analysis rather than ends in themselves.

    Databases of protein sequences, including Uniprot and PIR, also exist and cansimilarly be searched.

  • 8/8/2019 Alignments Lecture

    5/15

  • 8/8/2019 Alignments Lecture

    6/15

  • 8/8/2019 Alignments Lecture

    7/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    7

    First 90 positions of a protein multiple sequence alignment of instances of the acidicribosomal protein P0 (L10E) from several organisms. Generated with ClustalW.

    Sequences can be aligned across their entire length (global alignment) or only incertain regions (local alignment). This is true for pairwise and multiple alignments.Global alignments need to use gaps (representing insertions/deletions) while localalignments can avoid them, aligning regions between gaps.

    Evaluating local multiple alignments

    Some programs give quantitative measures for the significance of the alignment.These are usually based on the chance occurrence of such alignments and depend onthe size and composition of the aligned sequences. Empirical measures are alsoextremely useful for deciding the 'correctness' of the multiple alignment. Consistencyis a powerful measure for correct multiple alignments. If the same alignment is foundin the sequence-to-sequence searches and various multiple alignment methods it ismost probably correct. One pitfall to avoid is biased sequence composition that maylead to trivial alignments.

    Experimental data can be used in evaluating, and even constructing, multiplealignments. For example, if we know the catalytic site in the aligned proteins we

  • 8/8/2019 Alignments Lecture

    8/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    8

    expect the sites to be aligned together and may 'force' that alignment. Such manualalignments can serve as a seed to an alignment with more sequences.

    Local multiple alignments (blocks) from different programs can be joined or usedtogether. Another approach is 'divide and conquer'. Blocks present in all sequencesdivide them into separate parts, in each of which more blocks can be searched for.

    Tools for sequence alignment

    BLAST

    BLAST is an acronym for Basic Local Alignment Search Tool, and it consists of a setof algorithms for comparing biological sequences such as nucleotides or proteinsequences. A nucleotide sequence is nothing but a DNA (or part of) sequenceexpressed as a long string of 4 characters: A,T,C and G. They stand for Adenine,Guanine, Cytosine and Thymine. So, every nucleotide sequence consists of only thesefour characters arranged in different orders.

    BLAST allows you to compare your sequence against a database of sequences andinforms you if your sequence matches any of the sequences in the database, alongwith a lot of information like:

    * Homology of match (% of characters matched)* Alignment length (over what length did the nucleotides match)

  • 8/8/2019 Alignments Lecture

    9/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    9

    * Evalue (Expectation value. The number of different alignents with scoresequivalent to or better than S that are expected to occur in a database search bychance. The lower the E value, the more significant the score)

    For a complete BLAST glossary you may visithttp://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html

    So, now that you know BLAST can be used to align two sequences and to study thesimilarity between two or more sequences, let us look into the principles of sequencealignment briefly.

    Sequence alignment refers to arranging two sequences in an order such that their similar portions are highlighted.

    For ex:

    AGCTATGGGCAAATTTGGAACAAACCAAAAAGT........ ........ ...............

    AGCTATGGACAAATTTGCAACAAACCAAAAAGT

    The portions in the sequence which do not match are shown by gaps in the alignment.

    Global Alignment: It refers to the alignment in which all the characters in bothsequences participate in the alignment.

    Local Alignment: It refers to finding closely matching regions between sequences. In

    local alignment the beginning part (say 0.100 nucleotides) of a sequence may alignwith the ending part of another sequence (say 400-500).

    BLAST flavours

    The BLAST programs are widely used tools for searching DNA and protein databasesfor sequence similarity to identify homologs to a query sequence. While often referredto as just "BLAST", this can really be thought of as a set of programs: blastp, blastn,

    blastx, tblastn, and tblastx.

    The five flavours of BLAST perform the following tasks:

    blastpo Compares an amino acid query sequence against a protein sequence

    database blastn

    o Compares a nucleotide query sequence against a nucleotide sequencedatabase

    blastxo Compares the six-frame conceptual translation products of a nucleotide

    query sequence (both strands) against a protein sequence database tblastn

  • 8/8/2019 Alignments Lecture

    10/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    10

    o Compares a protein query sequence against a nucleotide sequencedatabase dynamically translated in all six reading frames (bothstrands).

    Tblastxo Compares the six-frame translations of a nucleotide query sequence

    against the six-frame translations of a nucleotide sequence database.(Due to the nature of tblastx, gapped alignments are not available withthis option)

    Links for BLAST:

    NCBI's blast tool can be found at http://www.ncbi.nlm.nih.gov/blast/ An article on methodology behind blast:

    http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

    How to interpret BLAST output:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut2.html

    Clustal

    Clustal is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequenceredundancy. Trees can also be calculated from multiple alignments (see below). The

    program has some adjustable parameters with reasonable defaults. ClustalW is

    available on the WWW and for various computer operating systems.

    How does Clustalw Work (very simple explanation)?1. Determine all pairwise alignments between sequences and the degree of similarity

    between them:

    2. Construct a similarity tree.

    3. Combine the alignments from 1 in the order specified in 2 using the rule " once agap always a gap"

    In stage 1:

    1.1. clustalw uses a pairwise alignment to compute pairwise alignments.

    1.2. Using the alignments from 1.1 it computes a distance.

    1.2.1. The distance is commonly calculated by looking at the non-gapped positionsand count the number of mistmatches between the two sequences. Then divide thisvalue by the number of non-gapped pairs to calculate the distance. Once all distancesfor all pairs are calculated they go into a matrix. This follows on in stage 2.

  • 8/8/2019 Alignments Lecture

    11/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    11

    2. Using the matrix from 1.2.1. and Neighbor-Joining, Clustalw constructs thesimilarity tree. The root is placed in the middle of the longest chain of consecutiveedges.

    3. Combine the alignments, starting from the closest related groups (going form thetips of the tree towards the root).

    Uses of multiple alignment

    The basic information from a multiple alignment of protein sequences is the positionand nature of the conserved regions in each member of the group. Conservedsequence regions correspond to functionally and structurally important parts of the

    protein. We often only know the sequence-to-function relation for one or twomembers of the group. Multiple alignments let us transfer that knowledge to the other members in the group. Hypotheses about functional importance or specific roles canthen be directly tested by mutagenesis and truncation experiments.

    ViewingMultiple alignments of many sequences and those with different sequence weights aredifficult to visualize. Sequence logos are a graphical way for presenting multiplealignments.

    ID ADH_IRON_1; BLOCK AC BL00913C; distance from previous block=(56,76)

    DE Iron-containing alcohol dehydrogenases proteins.BL HHG motif; width=22; seqs=11; 99.5%=492; strength=1428 ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL 66

    FUCO_ECOLI ( 262) VHGMAHPLGAFYNTPHGVANAI 44

    GLDA_BACST ( 259) HNGFTALEGEIHHLTHGEKVAF 100

    GLDA_ECOLI ( 269) VHNGLTAIPDAHHYYHGEKVAF 100

    MEDH_BACMT ( 259) VHSISHQVGGVYKLQHGICNSV 78

    ADH1_CLOAB ( 258) CHSMAHKTGAVFHIPHGCANAI 47 ADHE_ECOLI ( 721) CHSMAHKLGSQFHIPHGLANAL 47

    ADH2_ZYMMO ( 261) VHAMAHQLGGYYNLPHGVCNAV 36 ADH4_YEAST ( 263) VHALAHQLGGFYHLPHGVCNAV 41

    ADHA_CLOAB ( 266) CHPMEHELSAYYDITHGVGLAI 50 ADHB_CLOAB ( 266) VHLMEHELSAYYDITHGVGLAI 49

    //

  • 8/8/2019 Alignments Lecture

    12/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    12

    Figure: Block and logo of a conserved region in iron containing alcohol dehydrogenases. The block is first transformed into a position specific scoring matrix(PSSM) that allows for the sequence weights and expected frequencies of different amino acids (aa). The logo shows the aa present in each alignment position. Thehigher the aa and the stack the more conserved they are. The conservation is shownin bits and the aa are shaded according to their properties. The conserved histidines

    probably bind the ferrous on(s) required for these enzymes activity.

    A different graphical view of multiply aligned sequences is by a tree relating their sequence similarity. This is very useful when the aligned sequences are of several

    functional subtypes and we wish to know to which one does our sequence/s belong. Away to estimate the significance of a tree is by bootstrap values. Simply put, thesevalues show how many times was each bifurcation (branching point) observed withdifferent models of the input data. The higher the fraction of the bootstrap value(number of observations/number of trials) the more confident we can be that thesequences emerging from that branch point cluster together.

  • 8/8/2019 Alignments Lecture

    13/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    13

    Fig: A tree made from the three blocks in the iron containing alcohol dehydrogenases family. Bootstrap values are for 100 trials. The tree was calculated from the blockswith the ClustalW program and drawn with the TreeView program.

    Searching

    Multiple alignments are powerful tools for identifying new members of the alignedgroup. It is possible to query databases of multiple alignments with single sequencesand to query sequence databases with multiple alignments. It has been shown thatsuch searches are more sensitive and selective than sequence-to-sequence searches. Asimple (but very effective !) 'hybrid' approach is to use a properly made consensussequence

    PCR primer design

    Design of degenerate PCR primers is emerging as a major use for multiplealignments. PCR can identify the sequence of a gene in genomic or other DNA fromtwo short flanking segments (primers). Conserved sequence regions are (bydefinition) a good source for primer design. When designing primers the conservationof the regions, the degeneracy of the genetic code and parameters of the PCR reactionmust be considered. The Blocks WWW server designs PCR primers for each familyin the database, for sequence groups submitted to be aligned and for multiplealignment submitted to be reformatted. These primers are degenerate at the 3' end andconsensus at the 5' end (codehop- COnsensus DEgenerate Hybrid OligonucleotidePrimers). The design is fully automatic but the user can set the requested Tm, genetic

  • 8/8/2019 Alignments Lecture

    14/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    14

    code and bias the primers toward some of the sequences. codehop primers wereshown more effective than simple degenerate primers in various cases.

    Structural alignments

    Structural alignment is a form of sequence alignment based on comparison of shape.These alignments attempt to establish equivalences between two or more polymer structures based on their shape and three-dimensional conformation. This process isusually applied to protein tertiary structures but can also be used for large RNAmolecules. In contrast to simple structural superposition, where at least someequivalent residues of the two structures are known, structural alignment requires no a

    priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionaryrelationships between proteins cannot be easily detected by standard sequencealignment techniques. Structural alignment can therefore be used to implyevolutionary relationships between proteins that share very little common sequence.However, caution should be used in using the results as evidence for sharedevolutionary ancestry because of the possible confounding effects of convergentevolution by which multiple unrelated amino acid sequences converge on a commontertiary structure.

    Structural alignments can compare two sequences or multiple sequences. Becausethese alignments rely on information about all the query sequences' three-dimensionalconformations, the method can only be used on sequences where these structures areknown. These are usually found by X-ray crystallography or NMR spectroscopy. It is

    possible to perform a structural alignment on structures produced by structure prediction methods. Indeed, evaluating such predictions often requires a structuralalignment between the model and the true known structure to assess the model'squality. Structural alignments are especially useful in analyzing data from structuralgenomics and proteomics efforts, and they can be used as comparison points toevaluate alignments produced by purely sequence-based bioinformatics methods.

    The outputs of a structural alignment are a superposition of the atomic coordinate setsand a minimal root mean square distance (RMSD) between the structures. The RMSD

    of two aligned structures indicates their divergence from one another. Structuralalignment can be complicated by the existence of multiple protein domains within oneor more of the input structures, because changes in relative orientation of the domains

    between two structures to be aligned can artificially inflate the RMSD.

  • 8/8/2019 Alignments Lecture

    15/15

    Pharmaceutical Bioinformatics, 7.5pLecture notes

    15

    Fig: Structural alignment of thioredoxins from humans and the fly Drosophilamelanogaster. The proteins are shown as ribbons, with the human protein in red, and the fly protein in yellow. Generated from PDB 3TRX and 1XWC.

    Data produced by structural alignment

    The minimum information produced from a successful structural alignment is a set of superposed three-dimensional coordinates for each input structure. (Note that oneinput element may be fixed as a reference and therefore its superposed coordinates donot change.) The fitted structures can be used to calculate mutual RMSD values, aswell as other more sophisticated measures of structural similarity such as the globaldistance test (GDT, the metric used in CASP). The structural alignment also implies acorresponding one-dimensional sequence alignment from which a sequence identity,or the percentage of residues that are identical between the input structures, can becalculated as a measure of how closely the two sequences are related.

    References

    http://en.wikipedia.org/wiki/Structural_alignment http://en.wikipedia.org/wiki/Sequence_alignment_software http://en.wikipedia.org/wiki/Multiple_sequence_alignment http://puneetwadhwa.blogspot.com/2005/10/introduction-to-blast-basic-local.html http://en.wikipedia.org/wiki/BLAST