Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture...

38
Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016

Transcript of Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture...

Page 1: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Algorithmic Approaches for Biological Data, Lecture #20

Katherine St. John

City University of New YorkAmerican Museum of Natural History

20 April 2016

Page 2: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 3: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 4: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 5: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 6: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 7: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 8: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 9: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 10: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 11: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 12: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 13: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 14: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Affine Gap

Burr Settles, U Wisconsin, 2008

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 6 / 16

Page 15: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 16: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 17: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 18: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 19: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 20: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 21: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Global versus Local Alignment

Paul Reiners, IBM, 2008

Global: Needleman & Wunsch, 1970.

Local: Smith & Waterman, 1981.

Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.

Examples:

I finding motifs (conservedpatterns) across sequences,

I comparing sequences againstlonger sequences (e.g. blastsearch).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16

Page 22: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Global versus Local Alignment

Paul Reiners, IBM, 2008

Global: Needleman & Wunsch, 1970.

Local: Smith & Waterman, 1981.

Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.

Examples:

I finding motifs (conservedpatterns) across sequences,

I comparing sequences againstlonger sequences (e.g. blastsearch).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16

Page 23: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

The equation is slightly different:

s(i , j) = max

σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0

Initialize: first row and first column set to 0’s

Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16

Page 24: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

The equation is slightly different:

s(i , j) = max

σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0

Initialize: first row and first column set to 0’s

Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16

Page 25: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 10 / 16

Page 26: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

T

T

A

A

G

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16

Page 27: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

T

T

A

A

G

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16

Page 28: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

0 0 0 0 0

T 0

T 0

A 0

A 0

G 0

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 12 / 16

Page 29: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

0 0 0 0 0

T 0 0 0 0 0

T 0 0 0 0 0

A 0 1 1 0 1

A 0 1 2 0 1

G 0 0 0 3 1

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 13 / 16

Page 30: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 31: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 32: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 33: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Two common strategies:

I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.

I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.

Bookkeeping is important:

I Keep a “To Do” list (priority queue) ofnodes still to visit.

I Mark nodes as you visit them, so, youknow not to visit again.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16

Page 34: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Two common strategies:

I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.

I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.

Bookkeeping is important:

I Keep a “To Do” list (priority queue) ofnodes still to visit.

I Mark nodes as you visit them, so, youknow not to visit again.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16

Page 35: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 36: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 37: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 38: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16