The Dot Matrix Method - bio.fsu.edustevet/BSC5936/DotMatrix.pdf · Steve Thompson 2 The Dot Matrix...

11
Steve Thompson 1 Special Topics BSC5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 9, 2003 The Dot Matrix Method Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT )

Transcript of The Dot Matrix Method - bio.fsu.edustevet/BSC5936/DotMatrix.pdf · Steve Thompson 2 The Dot Matrix...

Steve Thompson

1

Special Topics BSC5936:

An Introduction to Bioinformatics.Florida State University

The Department of Biological Science

www.bio.fsu.edu

Sept. 9, 2003

The Dot Matrix Method

Steven M. Thompson

Florida State University School ofComputational Science and

Information Technology (CSIT)

Steve Thompson

2

The Dot Matrix Method.Gets you started thinking about sequence alignment in general.

Provides a ‘Gestalt’ of all possible alignments between two

sequences.

To begin — I will use a very simple 0, 1 (match, no-match) identity

scoring function without any windowing. As you will see later

today, more complex scoring functions will normally be used in

sequence analysis (especially with amino acid sequences). This

example is based on an illustration in Sequence Analysis Primer

(Gribskov and Devereux, editors, 1991).

The sequences to be compared are written out along the x and y

axes of a matrix.

Put a dot wherever symbols match; identities are highlighted.

A general way to see similarities in pair-wisecomparisons:

S E Q U E N C E A N A L Y S I S P R I M E R

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

A • •

N • •

A • •

L •

Y •

S • • •

I • •

S • • •

P •

R • •

I • •

M •

E • • • •

R • •

Since this is a comparison between two of the same sequences, anintra-sequence comparison, the most obvious feature is the mainidentity diagonal. Two short perfect palindromes can also be seen ascrosses directly off the main diagonal; they are “ANA” and “SIS.”

Steve Thompson

3

Since your own mind and eyes are still better than computers

at discerning complex visual patterns, especially when more

than one pattern is being considered, you can see all these

‘less than best’ comparisons as well as the main one and

then you can ‘zoom-in’ on those regions of interest using

more detailed procedures.

If the previous plot was a double-stranded DNA or RNA

sequence self comparison, the inverted repeat regions would

be indicative of potential cruciform structures at that point.

Direct internal repeats will appear as parallel diagonals off of

the main diagonal.

The biggest asset of dot matrix analysis is it allows

you to visualize the entire comparison at once, not

concentrating on any one ‘optimal’ region, but rather

giving you the ‘Gestalt’ of the whole thing.

Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It isimpossible to tell whether the evolutionary event that caused the discrepancy betweenthe two sequences was an insertion or a deletion and hence this phenomena is calledan ‘indel.’ A jump or shift in the register of the main diagonal on a dotplot clearly pointsout the existence of an indel. (again zero:one match score function)

S E Q U E N C E A N A L Y S I S P R I M E R

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

P •

R • •

I • •

M •

E • • • •

R • •

Check out the ‘mutated’ inter-sequence comparison below:

Steve Thompson

4

S E Q U E N C E A N A L Y S I S P R I M E R

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

Another phenomenon that is very easy to visualize with dot matrixanalysis are duplications or direct repeats. These are shown in thefollowing example:

The ‘duplication’ here is seen as a distinct column of diagonals; wheneveryou see either a row or column of diagonals in a dotplot, you are looking atdirect repeats.

Now consider the more complicated ‘mutation’ in thefollowing comparison:

S E Q U E N C E A N A L Y S I S P R I M E R

A • •

N • • •

A • •

L •

Y • •

Z

E • • •

S • • • •

E • • •

Q •

U •

E • • •

N • •

C • •

E • • • •

S • • •

Again, notice the diagonals. However, they have now been displaced off of the centerdiagonal of the plot and, in fact, in this example, show the occurrence of a‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate suchtranspositions in sequences. Inverted repeats still show up as perpendicular lines tothe diagonals, they are just now not on the center of the plot. The ‘deletion’ of‘PRIMER’ is shown by the lack of a corresponding diagonal.

Steve Thompson

5

Reconsider the same plot. Notice the extraneous dots that neither

indicate runs of identity between the two sequences nor inverted

repeats. These merely contribute ‘noise’ to the plot and are due

to the ‘random’ occurrence of the letters in the sequences, the

composition of the sequences themselves.

How can we ‘clean up’ the plots so that this noise does not detract

from our interpretations? Consider the implementation of a

filtered windowing approach; a dot will only be placed if some

‘stringency’ is met.

What is meant by this is that if within some defined window size, and

when some defined criteria is met, then and only then, will a dot

be placed at the middle of that window. Then the window is

shifted one position and the entire process is repeated. This very

successfully rids the plot of unwanted noise.

Filtered Windowing —

The only remaining dots indicate the two runs of identity between the two sequences; however, anyindication of the palindrome, “ANA” has been lost. This is because our filtering approach was toostringent to catch such a short element. In general you need to make your window about the samesize as the element you are attempting to locate. In the case of our palindrome, “AN” and “NA”’ arethe inverted repeat sequences and since our window was set to three, we will not be able to see anelement only two letters long. Had we set our stringency filter to one in a window of two, then thesewould be visible. The Wisconsin Package’s implementation of dot matrix analysis, the pairedprograms Compare and DotPlot use the window/stringency method by default.

S E Q U E N C E A N A L Y S I S P R I M E R

A •

N •

A •

L •

Y •

Z

E

S •

E •

Q •

U •

E •

N •

C •

E •

S

In this plot a window of size

three and a stringency of two

is used to considerablyimprove the signal to noise

ratio (remember, I am using a

1:0 identity scoring function).

Steve Thompson

6

You need to be careful with window/stringency dot matrix

methods. Default window sizes and stringencies may

not be appropriate for the analysis at hand.

The Wisconsin Package default window size and

stringency for protein sequences are 30 and 10

respectively (based on BLOSUM scores [soon to be

explained in Dr. Quine’s lecture]).

Sometimes this is perfectly reasonable.

Take for instance the next real-life example — the human

calmodulin protein sequence compared to itself.

Filtered dot plot techniques —

Human calmodulin x itself —W

hat

’s y

ou

r in

terp

reta

tio

n?

Do

yo

u k

no

w w

hat

th

e E

F-h

and

is?

Steve Thompson

7

The calmodulin structure —The four EF-Hand Helix-

Loop-Helix conformations(at positions 20,56, 93, and

129) bind Ca++ ions toaffect several biologicalsystems, including:

mediate control of a large

number of Ca++ dependent

enzymes,

in particular several protein

kinases and phosphotases,

many of which affect systemsranging from muscle action

and cAMP to insulin release.

Calmodulin x alpha actinin —default parameters Æ some confusion

window=24/stringency=24 Æ clearer picture

Alpha actinin has two EF-hand motifs to calmodulin’s four.

Steve Thompson

8

Even more can be done with RNA —Consider the following set of examples from the

phenylalanine transfer RNA (tRNA-Phe)molecule from Baker’s yeast.

The sequence and structure of this molecule isalso known; the illustration will show howsimple dot-matrix procedures can quickly leadto functional and structural insights (evenwithout complex folding algorithms).

If run with all default settings (including a 0,1scoring table) the dotplot from a comparison ofthis sequence with itself is quite uninformative,only showing the main identity diagonal:

Default RNA self comparison(window of 21 and stringency of 14 with the 0, 1 scoring function) —

Steve Thompson

9

However, if you adjust the window size down to find finer features some elements ofsymmetry become apparent. Here I have changed the window size to 7 and thestringency value to 5. As a general guide pick a window size about the same size asthe feature that you are trying to recognize and a stringency such that unwantedbackground noise is just filtered away enough to enable you to see that desired feature.

Several direct repeats are now obvious that remained obscured in the previous analysis.

RNA comparisons of the reverse, complement of a sequence to itself can often be veryinformative. Here the yeast tRNA sequence is compared to its reverse, complement using thesame 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNAclover-leaf molecular shape become obvious. They appear as clearly delineated diagonals runningperpendicular to an imaginary main diagonal running oppositely than before. Take for instance themiddle stem; the region of the molecule centered at approximately base number 38 has a clearpropensity to base pair with itself without creating a loop since it crosses the main diagonal andthen just after a small unpaired gap another stem is formed between the region from about basenumber 24 through 30 with approximately 46 through 40.

Steve Thompson

10

That same region ‘zoomed in on’ has some small direct repeats seenby comparing the sequence against itself without reversal:

But looking at the same region of the sequence against its reverse-complement shows a wealth of potential stem-loop structure in thetransfer RNA:

Steve Thompson

11

22 GAGCGCCAGACT G 12, 22 || | ||||| | A48 CTGGAGGTCTAG A 3

Base position 22 through position 33 base pairs with (think — is quite similar to the reverse-complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA foldingalgorithm uses base pairing energies to find the family of optimal and suboptimal structures; themost stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43.However the region around position 38 is represented as a loop. The actual modeled structureas seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

FOR EVEN MORE INFO...

http://bio.fsu.edu/~stevet/workshop.html

Contact me ([email protected]) for specific bioinformatics assistanceand/or collaboration.

What about these alike areas? What’s the best ‘path’ through the dot matrix? How long do Iextend it? How can I ‘zoom-in’ on it to see exactly what’s happening? Where, specifically, is thisalignment; how can I see the ‘best’ ones? And, what can I learn from these alignments?

This brings up the alignment problem. It is easy to see that two sequences are aligned when theyhave identical symbols at identical positions, but what happens when symbols are not identical orthe sequences are not the same length? How can we know that the most alike portions of oursequences are aligned, when is an alignment optimal, and does optimal mean biologically correct?

But, how to do all of this?

A ‘brute force’ approach just won’t work. Even without considering the introduction of gaps, thecomputation required to compare all possible alignments between two sequences requires timeproportional to the product of the lengths of the two sequences. Therefore, if the two sequencesare approximately the same length (N), this is a N2 problem. To include gaps, we would have torepeat the calculation 2N times to examine the possibility of gaps at each possible position withinthe sequences, now a N4N problem. Waterman illustrated the problem in 1989 stating that to aligntwo sequences 300 symbols long, 1088 comparisons would be required, about the same number ofelementary particles estimated to exist in the universe!

Part of a better solution . . . enter

the dynamic programming algorithm and Dr. Jack Quine’s lecture.

Conclusions —