Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona...

Post on 21-Dec-2015

213 views 0 download

Transcript of Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona...

Sequence Alignments

Chi-Cheng Lin, Ph.D.Associate Professor

Department of Computer ScienceWinona State University – Rochester Center

clin@winona.edu

Intro to Bioinformatics – Sequence Alignment 2

Sequence Alignments Cornerstone of bioinformatics What is a sequence?

• Nucleotide sequence• Amino acid sequence

Pairwise and multiple sequence alignments What alignments can help

• Determine function of a newly discovered gene sequence

• Determine evolutionary relationships among genes, proteins, and species

• Predicting structure and function of protein

Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program.

Intro to Bioinformatics – Sequence Alignment 3

DNA Replication Prior to cell division, all the

genetic instructions must be “copied” so that each new cell will have a complete set

Intro to Bioinformatics – Sequence Alignment 4

Over time, genes accumulate mutations Environmental factors

• Radiation

• Oxidation Mistakes in replication or

repair

• Deletions, Duplications

• Insertions, Inversions

• Translocations

• Point mutations

Intro to Bioinformatics – Sequence Alignment 5

Codon deletion:ACG ATA GCG TAT GTA TAG CCG…• Effect depends on the protein, position, etc.• Almost always deleterious• Sometimes lethal

Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?…• Almost always lethal

Deletions

Intro to Bioinformatics – Sequence Alignment 6

Indels Comparing two genes it is generally

impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:

ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT

Intro to Bioinformatics – Sequence Alignment 7

The Genetic Code

SubstitutionsSubstitutions are mutations accepted by natural selection.

Synonymous: CGC CGA

Non-synonymous: GAU GAA

Intro to Bioinformatics – Sequence Alignment 8

Point Mutation Example: Sickle-cell Disease

Wild-type hemoglobin

DNA

3’----CTT----5’

mRNA

5’----GAA----3’

Normal hemoglobin

------[Glu]------

Mutant hemoglobin

DNA

3’----CAT----5’

mRNA

5’----GUA----3’

Mutant hemoglobin

------[Val]------

Intro to Bioinformatics – Sequence Alignment 9image credit: U.S. Department of Energy Human Genome Program, http://www.ornl.gov/hgmis.

Intro to Bioinformatics – Sequence Alignment 10

Comparing Two Sequences Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

Intro to Bioinformatics – Sequence Alignment 11

Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA

• What does it do?What does it do?

One approach: Is there a similar gene in another species?• Align sequences with known genes• Find the gene with the “best” match

Intro to Bioinformatics – Sequence Alignment 12

Scoring a Sequence Alignment Match score: +1 Mismatch score: +0

Gap penalty: –1ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)

Score = +11Score = +11

Intro to Bioinformatics – Sequence Alignment 13

How can we find an optimal alignment? Finding the alignment is computationally

hard:ACGTCTGATACGCCGTATAGTCTATCTCTGAT---TCG-CATCGTC--T-ATCT

There are ~888,000 possibilities to align the two sequences given above.

Algorithms using a technique called “dynamic programming” are used – out of the scope of this workshop.

Intro to Bioinformatics – Sequence Alignment 14

Global and Local alignments Global alignments – score the entire

alignment Local alignment – find the best matching

subsequence Why local sequence alignment?

• Subsequence comparison between a DNA sequence and a genome

• Protein function domains• Exons matching

Intro to Bioinformatics – Sequence Alignment 15

Example Compare the two sequences:TTGACACCCTCCCAATT ACCCCAGGCTTTACACAG

Global alignment (does it look good?)TTGACACCCTCC-CAATT || || || ACCCCAGGCTTTACACAG

Local alignment (does it look good?)---------TTGACACCCTCCCAATT || |||| ACCCCAGGCTTTACACAG--------

Intro to Bioinformatics – Sequence Alignment 16

Dot Plots One of the simplest and oldest methods for

sequence alignment Visualization of regions of similarity

• Assign one sequence on the horizontal axis• Assign the other on the vertical axis• Place dots on the space of matches• Diagonal lines means adjacent regions of

identity

Intro to Bioinformatics – Sequence Alignment 17

A Simple Example Construct a simple

dot plot for

TAGTCGATGTGGTCATC

The alignment is

TAGTCGATGTGGTC-ATC

T A G T C G A T G

T * * *

G * * *

G * * *

T * * *

C *

A * *

T * * *

C *

Intro to Bioinformatics – Sequence Alignment 18

What else can it do (and how)? Gaps Inverse substring Repeat Palindrome Gene conservation and order study