Introduction to Bioinformatics From Pairwise to Multiple Alignment.

27
Introduction to Bioinformatics From Pairwise to Multiple Alignment
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    2

Transcript of Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Page 1: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Introduction to Bioinformatics

From Pairwise to Multiple Alignment

Page 2: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Outline

• Advances in BLAST

• Multiple Sequence Alignment- CLUSTAL

Page 3: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Scoring system for BLAST

Substitution Matrix +

Gap Penalty

Page 4: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Substitution Matrix

• BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps

• PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

Page 5: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Gap penalty

• Example showed -1 score per indel– So gap cost is proportional to its length

• Biologically, indels occur in groups– We want our gap score to reflect this

• Standard solution: affine gap model– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 6: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Statistical significance

Page 7: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

E-value• The number of hits (with the same similarity score) one can

"expect" to see just by chance when searching the given string in a database of a particular size.

• higher e-value lower similarity– “sequences with E-value of less than 0.01 are almost always

found to be homologous”

• The lower bound is normally 0 (we want to find the best)

Page 8: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Expectation Values

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

Page 9: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Remote homologues

• Sometimes BLAST isn’t enough.

• Large protein family, and BLAST only gives close members. We want more distant members

PSI-BLAST

Page 10: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

PSI-BLAST

• Position Specific Iterated BLAST

Regular blast

Construct profile from blast results

Blast profile search

Final results

Page 11: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

PSI-BLAST

• Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends

• Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration

Page 12: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Multiple Sequence Alignment

MSA

Page 13: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Like pairwise alignment BUT compare n sequences instead of 2

Rows represent individual sequences Columns represent ‘same’ position

May be gaps in some sequences

Page 14: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Why multiple alignments?

BLAST Usually obtains many sequences that are significantly similar to the query sequence

PracticallyComparing each and every sequence to every other may impractical when the number of sequences is large

Solution generating a profile

Page 15: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

MSA MSA can give you a better picture of functional sites on proteins

and nucleic acids as well as the forces that shape evolution!

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGSSSNIGS--ITVNWYQQLPGLRLSCTGSGFIFSS--YAMYWYQQAPGLSLTCTGSGTSFDD-QYYSTWYQQPPG

• Important amino acids or nucleotides are not allowed to mutate• Less important positions change more easily

Page 16: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Alignment Example

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC

1*12*0.7511*0.5

Score=8

4*111*0.752*0.5

Score=13.25

Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Page 17: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Example of 3 sequences:

Page 18: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Dynamic Programming

• Pairwise A–B alignment table– Cell (i,j) = score of best alignment between first i

elements of A and first j elements of B– Complexity: length of A length of B

• 3-way A–B–C alignment table– Cell (i,j,k) = score of best alignment between first i

elements of A, first j of B, first k of C– Complexity: length A length B length C

• Example: protein family alignment– 100 proteins, 1000 amino acids each– Complexity: 10300 table cells– Calculation time: beyond the big bang!

Page 19: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Feasible Approach

• Based on pairwise alignment scores– Build n by n table of pairwise scores

• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences

Page 20: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

– For n sequences, there are n(n-1)/2 pairs

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

Page 21: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC

1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C

Page 22: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

CLUSTAL method

• Higgins and Sharp 1988 – ref: CLUSTAL: a package for performing

multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline]

An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one

Progressive Sequence Alignment

Page 23: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

ABCD

A B C D

A

B 11

C 3 1

D 2 2 10

Compute the pairwise Compute the pairwise alignments for all against allalignments for all against all

the similarities are stored in a the similarities are stored in a tabletable

First step:

Page 24: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

A B C D

A

B 11

C 3 1

D 2 2 10

A

D

C

B

cluster the sequences to create a cluster the sequences to create a treetree

•Represents the order in which pairs of Represents the order in which pairs of sequences are to be alignedsequences are to be aligned•similar sequences are neighbors in the similar sequences are neighbors in the tree tree •distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step:

Page 25: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

N Y L S N K Y L S N F S N F L S

N K/- Y L S N F L/- S

N K/- Y/F L/- S

Join alignments

Page 26: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Treating Gaps in ClustalW

• Penalty for opening gaps and additional penalty for extending the gap

• Gaps found in initial alignment remain fixed

• New gaps are introduced as more sequences are added (decreased penalty if gap exists)

• Decreased within stretches of hydrophilic residues

Page 27: Introduction to Bioinformatics From Pairwise to Multiple Alignment.

MSA Approaches• Progressive approach

CLUSTALW (CLUSTALX)http://www.ebi.ac.uk/clustalw/

PILEUPT-COFFEE

• Iterative approach: Repeatedly realign subsets of sequences.

MultAlin, DiAlign.

• Statistical Methods:Hidden Markov Models

SAM2K

• Genetic algorithmSAGA