Pairwise sequence comparison

45
Pairwise sequence comparison Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]

description

Pairwise sequence comparison. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. One-minute responses. Need more clarification on the alignment of sequences. - PowerPoint PPT Presentation

Transcript of Pairwise sequence comparison

Page 1: Pairwise sequence comparison

Pairwise sequence comparison

Prof. William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

[email protected]

Page 2: Pairwise sequence comparison

One-minute responses• Need more clarification on the alignment

of sequences.• More precise definitions, included on

slides.• Python programming very useful.• A lot to learn at once.• I did not get the Python part, but it’s easy

when I try it myself.• Please write on the board sometimes.• I liked the part about evolutionary theory.• I understood about 50% of the lecture.• You speak too fast.• I did not understand Moore’s law.• I liked the biology video and the analogy

with Moore’s law.• The Python code is different from what

we are used to.• Give a practical example of BLAST usage.

• More explanation on sys.argv.• I like the summary of the previous lecture.• Sometimes you talk a long time before

asking for questions.• Biology of cells and genomes wasn’t clear

enough.• Take time explaining the biology concepts.• The last part of the lecture was not clear.• Give more examples.• The class could go a bit slower on difficult

concepts.• It is a good habit to accept comments.• We may forget what you are saying

because we are not taking notes.• I didn’t understand the mechanism by

which we can physically read the DNA sequence.– This topic is outside the scope of this class

to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”

Page 3: Pairwise sequence comparison
Page 4: Pairwise sequence comparison
Page 5: Pairwise sequence comparison

Other questions

• How many assignments will we do?

• Will there be a test?• Is there a lot of

mathematics here, or just Python and biology?

• Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics?

• Are we going to write this feedback every day?

• How can I make a dictionary from a .txt file?

• When you compare two protein sequences, how do you guess where the first codon begins?

• What is evolution? Is it dogma?

Page 6: Pairwise sequence comparison

Outline

• Responses from last class• Sequence alignment– Motivation– Scoring alignments

• Python

Page 7: Pairwise sequence comparison

Revision

• What is the difference between the DNA and RNA alphabets?– In RNA, “T” is changed to “U.”

• What is a codon, and why is it significant?– A set of three adjacent RNA nucleotides. One codon codes for

a single amino acid.• What is the universal genetic code?– A set of rules for translating codons into amino acids.

• What is the purpose of aligning two DNA or protein sequences?– To infer common ancestry, function or structure.

Page 8: Pairwise sequence comparison

Sequence comparison overview• Problem: Find the “best” alignment between a query

sequence and a target sequence.• To solve this problem, we need– a method for scoring alignments, and– an algorithm for finding the alignment with the best score.

• The alignment score is calculated using– a substitution matrix, and– gap penalties.

• The algorithm for finding the best alignment is dynamic programming.

Page 9: Pairwise sequence comparison

A simple alignment problem.

• Problem: find the best pairwise alignment of GAATC and CATAC.

Page 10: Pairwise sequence comparison

Scoring alignments

• We need a way to measure the quality of a candidate alignment.

• Alignment scores consist of two parts: a substitution matrix, and a gap penalty.

GAATCCATAC

GAATC-CA-TAC

GAAT-CC-ATAC

GAAT-CCA-TAC

-GAAT-CC-A-TAC

GA-ATCCATA-C

Page 11: Pairwise sequence comparison

Scoring aligned bases

A C G TA 10 -5 0 -5C -5 10 -5 0G 0 -5 10 -5T -5 0 -5 10

A hypothetical substitution matrix:

GAATC | |CATAC

-5 + 10 + -5 + -5 + 10 = 5

Page 12: Pairwise sequence comparison

A R N D C Q E G H I L K M F P S T W Y V B Z XA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1

BLOSUM62

Page 13: Pairwise sequence comparison

• Linear gap penalty: every gap character receives a score of d.

• Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e.

Scoring gaps

GAAT-C d=-4CA-TAC

-5 + 10 + -4 + 10 + -4 + 10 = 17

G--AATC d=-4CATA--C e=-1

-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

Page 14: Pairwise sequence comparison

A simple alignment problem.

• Problem: find the best pairwise alignment of GAATC and CATAC.

• Use a linear gap penalty of -4.• Use the following substitution matrix:

A C G TA 10 -5 0 -5C -5 10 -5 0G 0 -5 10 -5T -5 0 -5 10

Page 15: Pairwise sequence comparison

How many possibilities?

• How many different alignments of two sequences of length n exist?

GAATCCATAC

GAATC-CA-TAC

GAAT-CC-ATAC

GAAT-CCA-TAC

-GAAT-CC-A-TAC

GA-ATCCATA-C

Too many to enumerate!

Page 16: Pairwise sequence comparison

DP matrix

G A A T C

C

A

T

A

C

-8

The value in position (i,j) is the score of the best alignment of the first i

positions of the first sequence versus the first j positions of the second

sequence.

-G-CAT

Page 17: Pairwise sequence comparison

DP matrix

G A A T C

C

A

T -8 -12A

C

Moving horizontally in the matrix introduces a

gap in the sequence along the left edge.

-G-ACAT-

Page 18: Pairwise sequence comparison

DP matrix

G A A T C

C

A

T -8

A -12C

Moving vertically in the matrix introduces a gap in the sequence along

the top edge.

-G--CATA

Page 19: Pairwise sequence comparison

Initialization

G A A T C

0

C

A

T

A

C

Page 20: Pairwise sequence comparison

Introducing a gap

G A A T C

0 -4

C

A

T

A

C

G-

Page 21: Pairwise sequence comparison

DP matrix

G A A T C

0 -4

C -4

A

T

A

C

-C

Page 22: Pairwise sequence comparison

DP matrix

G A A T C

0 -4

C -4 -8

A

T

A

C

Page 23: Pairwise sequence comparison

DP matrix

G A A T C

0 -4

C -4 -5

A

T

A

C

GC

Page 24: Pairwise sequence comparison

Three legal moves

• A diagonal move aligns a character from the left sequence with a character from the top sequence.

• A vertical move introduces a gap in the sequence along the top edge.

• A horizontal move introduces a gap in the sequence along the left edge.

Page 25: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8

T -12

A -16

C -20

-----CATAC

Page 26: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 ?

T -12

A -16

C -20

Page 27: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12

A -16

C -20

-4

0 -4

-GCA

G-CA

--GCA-

-4 -9 -12

Page 28: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12 ?

A -16 ?

C -20 ?

Page 29: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5

A -8 -4

T -12 -8

A -16 -12

C -20 -16

Page 30: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 ?

A -8 -4 ?

T -12 -8 ?

A -16 -12 ?

C -20 -16 ?

Page 31: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9

A -8 -4 5

T -12 -8 1

A -16 -12 2

C -20 -16 -2

Page 32: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9

A -8 -4 5

T -12 -8 1

A -16 -12 2

C -20 -16 -2 ?

Find the optimal alignment, and its

score.

Page 33: Pairwise sequence comparison

DP matrix

G A A T C

0 -4 -8 -12 -16 -20

C -4 -5 -9 -13 -12 -6

A -8 -4 5 1 -3 -7

T -12 -8 1 0 11 7

A -16 -12 2 11 7 6

C -20 -16 -2 7 11 17

Page 34: Pairwise sequence comparison

DP in equation form

• Align sequence x and y.• F is the DP matrix; s is the substitution matrix;

d is the linear gap penalty.

djiF

djiF

yxsjiF

jiF

F

ji

1,

,1

,1,1

max,

00,0

Page 35: Pairwise sequence comparison

DP in equation form

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Page 36: Pairwise sequence comparison

Summary

• Scoring a pairwise alignment requires a substition matrix and gap penalties.

• Dynamic programming is an efficient algorithm for finding the optimal alignment.

• Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions.

• DP iteratively fills in the matrix using a simple mathematical rule.

Page 37: Pairwise sequence comparison

A simple exampleA C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

A A G

A

G

C 1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.

Page 38: Pairwise sequence comparison

Some useful Python tidbits

Page 39: Pairwise sequence comparison

sys.argv

• sys.argv is a list containing the strings given on the command line

• Write a program that adds up all of the numbers on the command line.

> add-numbers.py 1 2 36

Page 40: Pairwise sequence comparison

sys.stdout versus sys.stderr

• sys.stdout and sys.stderr are two different streams to print to– Use sys.stdout for the primary output of your

program.– Use sys.stderr to report errors or give status

updates.

Page 41: Pairwise sequence comparison

write() versus print()

• The write() function differs from print()– write() does not automatically add an end-of-line.– write() requires that you specify the name of the

file or stream to be written to.– write() only accepts a single string.

Page 42: Pairwise sequence comparison

%

• The “%” operator substitutes values into a string based on the presence of format strings.– %s = string– %d = integer– %g = float

• Place “%” between a string and a tuple of values.

>>> "%s %s %s" % ("larry", "curly", "moe")'larry curly moe’>>> "%d + %d" % (21, 15)'21 + 15'

Page 43: Pairwise sequence comparison

Using sys.stderr• Write a program to divide one number by a second number.> divide.py 8 32.66667

• Print the usage message if exactly two numbers are not given.> ./divide.pyUSAGE: divide.py <value1> <value2>

Divide <value1> by <value2> and report the result.

• Stop and print an error if the second number is zero.> divide.py 8 0Divide by zero error.

Page 44: Pairwise sequence comparison

Using write() and %

• Write a program that prints the command line arguments with “+” signs between, and then reports their sum.

> add-numbers2.py 1 2 31 + 2 + 3 = 6

Page 45: Pairwise sequence comparison

One-minute response

At the end of each class• Write for about one minute.• Provide feedback about the class.• Was part of the lecture unclear?• What did you like about the class?• Do you have unanswered questions?• Sign your nameI will begin the next class by responding to the one-minute responses