Plasmid mapping computer program

Volume 12 Number 1 1984 Nucleic Ac ids Research

Plasmid mapping computer program

Garry P.Nolan*, Claude V.Maina and Aladar A.SzalayH

Boyce Thompson Institute for Plant Research, Cornell University, Tower Road, Ithaca, NY 14853,USA

Received 11 October 1983

ABSTRACT

Three new computer algorithms are described which rapidly orderthe restriction fragments of a plasmid DNA which has been cleaved withtwo restriction endonucleases in single and double digestions. Two ofthe algorithms are contained within a single computer program (calledMPCIRC). The Rule-Oriented algorithm, constructs all logical circularmap solutions within sixty seconds (14 double-digestion fragments) whenused in conjunction with the Permutation method. The program iswritten in Apple Pascal and runs on an Apple II Plus Microcomputer with64K of memory. A third algorithm is described which rapidly mapsdouble digests and uses the above two algorithms as adducts.Modifications of the algorithms for linear mapping are also presented.

INTRODUCTION

Molecular biologists utilizing the tools of recombinant DNA technology

frequently face the task of mapping DNA fragments using restriction

endonucleases. Mapping of a circular DNA molecule involves the placement of

restriction endonuclease sites on a circular map. The relative position of

each restriction site should be consistent with the data obtained from

electrophoretic analysis of DNA, following cleavage of DNA in single and

double digestions with restriction endonucleases. Calculation of

restriction maps is time-consuming, and it requires considerable effort to

verify that the resulting map represents the unique solution.

For construction of restriction endonuclease maps, several computer

programs and their associated algorithms have been described (1-4). The

main advantages of using a computer in DNA mapping are: speed, as it is

able to examine hundreds of restriction site map orders per second; and

accuracy in the computing of all consistent maps within the imposed error

range. However, due to the mathematical nature of restriction mapping (1),

the number of different map orders that must be examined is equal to the

number of DNA fragments in the double digestion factorial. When the number

© IRL Press Limited, Oxford, England. 71 7Downloaded from https://academic.oup.com/nar/article-abstract/12/1Part2/717/2903943by gueston 26 March 2018

Nucleic Acids Research

SINGLEDIGESTION (SD)

DOUBLEDIGESTION (DD)

AA

13.515

BB910

17.5

AB

7.5810

Table I: a and b) . DNA fragment sizes generated in single digestion ofthe plasmid in Figure I by the restriction enzymes "A" and "B"respectively. c). Fragment sizes of DNA from plasmid in Figure I afterdouble digestion with restriction enzymes "A" and "B."

of DNA fragments derived from a double digestion exceeds ten, the number of

maps to be analyzed becomes unmanageable in terms of time and memory, even

with the use of a large computer. Programs whose algorithms use the single

digests to "permute" map orders are also quickly taxed (?).

Microcomputers are unable to run some of the published programs due to

the speed at which they execute instructions. Programs, including those

previously published 1-3, which take one or two seconds on a mainframe

computer may take several minutes or hours if run or a microcomputer.

Therefore, the development of algorithms rhat can rapidly compute complex

restriction maps was considered essential for use with microcomputers. In

contrast to the known mapping programs, these algorithms are able to readily

generate maps that are comprised of more than 10 double digested fragments.

ALGORITHMS

The three algorithms are explained using data from a hypothetical

plasmid (Fig. 2a). Table I lists the fragment sizes that are found upon

cleavage with restriction endonucleases "A" and "B" in single and

double-digestions. To refer to the lists in Table I, the following terms

will be used throughout the text: Single Digestion (SD): An "SD" fragment

may be of type "AA" or "BB", depending upon whether the DNA fragment WPS cut

on both ends with restriction endonuclease A or B respectively. Double

Digestion (DD): A DD fragment may have been cleaved at its termini by the

restriction endonuclease(s) A, B or A & B, resulting in, respectively, type

AA, BB, or AB.

Fig. 1 represents how the fragment sizes of members of the DD list can

add up to the sizes of fragments from the SD lists. Each grouping of DD

718

Downloaded from https://academic.oup.com/nar/article-abstract/12/1Part2/717/2903943by gueston 26 March 2018


7.5 6 10 3 2

6 3 8 2 10 10 7.5 8 7.5 2

Figure 1. Graphic presentation of the "fork list" generated fromthe data. a). Each SD fragment (subscripted with AA or BB) can becomprised of the summation (forks) of different DD fragements. b). 8has two forks, '8' and "6,2." This fork structure is the basis for themapping algorithms.

fragments which possibly comprises the component(s) of an SD fragment is

called a "fork". All the forks presented in Fig. 1 are subsets of possible

map orders. Only some forks are correct and consistent with each other.

The map would be essentially solved if it could be determined which forks

are the correct forks. With the data set up in this fashion, all three

algorithms can proceed to map.

The Rule-Oriented Method. The Rule-Oriented Method (ROM) uses a roster

of seven rules to discard forks which are illogical, leaving only those

forks which can produce circular restriction maps that are consistent with

the data.

ROM generates two lists to detail specific knowledge about members of

the DD list. The first, called the "Termini Type" list, contains

information regarding the DD fragments' molecular termini. Rules 1 and 2 in

Table IV provide details on how to determine the termini type for DD

fragments.

The second list ROM generates is the "Times Known" list. Whenever only

one fork is left for a given SD fragment, those fork members must comprise

that SD fragment. The number of times given DD fragments are found in

different "unique" forks is listed in the "Times Known" column. The results

from the application of this observation and Rule 2 to the "unique" forks in

Fig. 1 is listed in Table II b^

The algorithm successively sweeps through the listing of forks, using

the known data and rules of logic (Table IV) to systematically discard

inconsistent forks. As forks are removed, unique forks are left from which

more data is gained to remove more forks.

The mapping algorithm is finished when given the data, no more forks

719Downloaded from https://academic.oup.com/nar/article-abstract/12/1Part2/717/2903943by gueston 26 March 2018


Termini

236

7.5810

Type

ABABABAB- -- -

TimesKnown

000000

236

7.5810

TerminiType

ABABABAB

BB

TimesKnown

I22101

a bTable II: The leftmost coluinns of a and b contain the weights of thedouble digest fragments. The middle column lists information concerningthe molecular termini of each double digest (' — ' indicates no terminitype is currently known). The rightmost column contains the number oftimes a given double digest fragment is absolutely known to be a part ofa single remaining fork for some SD parental fragment. a) status ofinformation deduced about double digest fragments of the hypotheticalexample plasmid after the application of Rule 1 to the data. b) .information known about double digest fragments after subsequentapplication of several rules from ROM.

IAI

abcdef9hi

ik1

m

j

i

:

finish

7.5

7.5

7.57.5

8

10

IBI

8 22 6

62 6

8 22 32 10

2 108 22 8

75 2

7.5

3

1033 6

6 7.53 63

ICI

AABBAABBAABBBBAA

AABBAAAA

AA

AA

2

X

X

X

X

X

X

X

X

X

3

X

X

X

IDI6 75 8

X

X

X X

X

X

X

X

X

X

X

X

X

10

X

X

X

X

AA T^A

2-22-

IEI

~\A ^ B 1 0 BB 1 7 5 BB1.1

1- 11V11

1- 11 -1V 1

1

1-

V

2-

[Fl

263621036

6328

2

7.5

Table III: Basic strategy for mapping by Permutation Method.A) Letters referring to horizontal rows cited in text, B)Representation of fork building process. Groups of numbers representforks from SD fragments with the DD fragments being ordered and tested,c). Lists the type of forks in which the new seed will be sought. An"*" indicates a map order has been determined and no search type isneeded. D) Updated detailing of which DD fragments have been used (Xrepresents used fragments). E) Lists of which SD fragments and theirforks (number under SD fragments) have been used in generating maporders. An "*" indicates which fork was last added. F) Seed DDfragment used jn search. The seed is always the last element in afragment order. An "*" in this column indicates a map has been found.

720



nay be removed. The RON program passes the sets of consistent forks to the

PM portion of the program which then disects the different reaps from each

other and prints them out.

The Permutation Method. The Permutation Method (PM) uses the fork

structure, as represented in Fig. 1, to trace a path through the data. The

algorithm, starting at an arbitrary point in the data set, carefully builds

one fork onto another, testing every possible fork and fragment order. The

efficiency of this permutation method is manifested by its ability to

rapidly and logically discard thousands of incorrect fragment orders without

directly analyzing each one.

PM initiates the mapping process by sequentially stepping through each

fork of the largest SD fragment (see Fig. 1). A chosen fork, which must

have more than one DD member, is then placed as in row £ under the first

column of Table IIT(A). The rightmost element of this row, "2," is called

the "seed." The leftmost element of the fork, "7.5," is called the

"endmap." The algorithm traces a path through the fork structures using the

seed to initiate a search such that the last element listed is the same as

the "endmap" and such that all the elements have been used. When the seed

equals endmap. and all the DD fragments have been used, a consistent map

order has been determined.

Enzyme Group Technique. A third algorithm has been developed in our

laboratory to rapidly determine which groups of double digest forks are

consistent with each other. The algorithm is termed the Fnzyme Group

Technique (EGT).

The data is again set up in the form of forks as in Fig. 1. Starting

with the largest single digest fragment of type AA, that is 15,,, its fork

is placed in a list: "(10, 3, 2)". New forks are chosen, and their members

added to a given enzyme Ust, from subsequently smaller SD fragments of the

same digest type such that no fork is picked that contains a member which is

already in the list. Only one fork is picked at a time from each SD fork

list. The other possibilities are noted and attended to each in their turn.

In summary, the following are the lists computed:

AA : "(10, 3, 2); (7.5, 6); (8)"

BB : "(10, 7.5); (8, 2); (6, 3)"

"(8, 7.5, 2); (10); (6, 3)"

The algorithm continues by utilizing the ROM procedure as a judge of

consistency. In so doing, the algorithm carefully pairs each AA list (in



this case there is only one) with each BB list and passes each pair to ROM.

ROM analyzes each pair in turn and sends back a "Yes" or "No" answer as soon

as it has determined if a pairing is allowable. If a pairing is correct,

these forks are then sent to the PM algorithm to be ordered into a printable

map.

PROGRAM USE

All procedures for entering data, editing data, computing maps,

suppressing unwanted maps, and printing maps along with pertinant data are

contained within a Pascal UCSD program called MPCIRC. The researcher works

with the program in several stages in the analysis of the data (outlined

below). The program allows for a flexible treatment of the data at many

points in the mapping process, giving the researcher the chance to impose

relevant conditions, edit and review map generation.

Data Entry. The program prompts the user when this option is chosen

for information regarding the map to be generated. First, the error range

within which forks are to be generated (see Algorithm) is entered. This

error range is crucial to correct analysis of the map - too high an error

range and too many "faulty" maps are generated; too low an error and no maps

might be produced. Then, the restriction endonucleases used in the

digestions are entered. The program then asks the user to enter the

restriction data for first the single digestions and then the double

digestion. The researcher must take care that several conditions are met

before mapping. First, the sum of the number of fragments in each of the

single digest lanes must add up to the number of double digest fragments.

One must therefore be careful to correctly detect small DNA fragments and

doublets. In addition, the sums of the weights of the digests must be equal

within some reasonable error (2% to 10%). If these initial conditions are

not met, incorrect map(s) will result.

Edit Data. The researcher may, after entering data, or after mapping,

decide that the data should be altered. This procedure allows the user to

add, delete or change the size of any single or double digest. The user may

also change the error range value here as well.

Enter Special Known Data. This procedure allows the user to input

known parameters of the plasmid. If the researcher knows that some DD

fragments are known or known not to be contained as a part of certain SD

fragments, this information can be entered here. If the termini type of

some DD fragments is known, this may also be set with this procedure. In

722



the construction of fork groups (see below) this information is used to

discard those groups that do not meet these input conditions. In this

manner, unwanted maps are suppressed.

The user may also directly input fork groups for any fragments. By

this, the researcher can input the known map portions from the vector for

instance — this permits the program to focus only on the unknown portions

of the map. The CONSTRUCT GROUPS procedure will skip those SD fragments

that have inputted groups of this fashion.

This procedure may be called even after the CONSTRUCT GROUPS has been

called to change the data or remove unwanted groups manually.

Construct Groups. This procedure must be called before Rulmap (ROM) or

Permap (PM) are used to map. This central procedure uses all the input data

and special data to construct fork groups from the DD fragments within the

error range for each SD fragment. It examines and generates all the forks,

eliminating those forks as input condition dictate. After this procedure is

called, the researcher may examine the forks for any group, remove them, or

add to them as desired (See EDIT DATA and ENTER SPECIAL KNOWN DATA).

Display Data. Here, all data and input conditions may be displayed for

any single or double digest fragment. This procedure is also constantly

called by the ENTER DATA, EDIT and ENTER SPECIAL KNOWN DATA procedures to

update information on the screen as new data is entered or changed. This

provides the researcher with a visual assurance of what is being entered.

Rulmap• When groups have been constructed, this procedure may be

called to discard those groups deduced to be incorrect (see ROM algorithm).

This procedure operates until no more groups may be removed using the seven

listed rules. This procedure will update the termini type, times known, and

fork group lists automatically. To print out maps, the remaining fork

groups must be passed to PERMAP (PM).

Permap. The procedure uses the algorithm as described under the

Permutation Method to order and find all the possible restriction maps.

PERMAP does not alter the termini type, times known or the fork lists in any

manner. This procedure may be used alone on the fork groups or after

RULMAP has been called. It orders and stores all maps for subsequent

printing.

Print Maps. This procedure is called to print out in high resolution

graphics in both linear or circular form, on screen or paper, the maps

generated (see Fig. 2).

Example: Using data from the sequence analysis of the plasmid pBR3?2



ABToleration = 0.00

7.50 8.00

10 Degrees

1.01 KB

2.00 10.00

RSAI BGLI

Toleration = 3.2

B B

3.00 6.00

bR B

0 25 0.75 0.70 0.35

BGLI

Figure 2. Circular representation of computed maps. The computerprints the circular map in the standard format; printing starts at 0degrees on the cosine/sine map and proceeds counterclockwise to 360degrees. Inside the innermost ring are dots placed at 10 degreeintervals; the computer prints out in the upper left hand corner thescale for 10 degrees. a), circular restriction site map of exampleplasmid. b). circular restriction site map of SV40 example.

(6), the fragments sizes which would be generated upon single and double

digestion with the enzymes Rsa I and Bgl I were determined. These fragment

sizes were entered into the program and reviewed by DISPLAY DATA for

correctness. An error range of 0% for construction of groups was originally

entered. Forks were generated by CONSTRUCT GROUP and the PERMAP procedure

was run. It took PERMAP 4.0 seconds to find the unique map order in Fig.

2b.

However, real data has an associated error. To approximate this using

the above data, the error range was changed to 3.25% and CONSTRUCT GROUPS

was called again to generate fork groups. Whereas a 0% error range gives

only one fork for each SD fragment, a 3.25% error gives 5 or more forks for

most SD fragments. When PERMAP was called to order and store all naps

within this error range, it took 3 minutes to find 5 possible maps.

However, when RULMAP was used on the same set of forks (constructed at

724



Table IV: Rules for the Rule-Oriented Method

1) Any fragment from the DD list with a size that is not found ineither SD list is presumed to be a molecule cut at one terminus withrestriction endonuclease A and at the other terminus with restrictionendonuclease B. This DD fragment has termini type AB.2) Given a single remaining fork for a given SD fragment:

A) Given one DD fragment in a fork of parental termini type XX(where XX can denote either AA or BB), the DD fragment is concludedto be of type XX as well.B) Given two DD fragments in a fork of parental termini type XX,both DD fragments are concluded to be of type AB.C) Given greater than two DD fragments, call this number N, in afork of type XX:1. If two DD fragments are known to have AB ends, all other (N-2)

DD fragments in the fork are concluded to be of type YY. Withmore than 3 DD fragments, the order of the interior DDfragments cannot be stated with certainty.

2. If one DD fragment is of type AB and (N-2) DD fragments areknown to be of type YY, then the unknown DD fragment isconcluded to be of type AB.

3. Given 2 unknowns in the fork, and N-2 knowns of type YY, thenthe 2 unknowns are concluded to be of type AB.

D) If none of the above conditions apply at this point, thenecessary data is not available to draw sufficient conclusions.The DD fragments in the fork, however, are still marked "known"once more in the Tiraes Known column.

3) A DD fragment which is absolutely known to be a member of two SDforks (as indicated in the Times Known list for that DD fragment) cannotbe a part of any "other" fork. The "other" fork must be discarded asillegitimate.4) A DD fragment may be contained within only one SD fork of parentaltype XX. If any "other" fork with parental type XX contains this DDfragment, that "other" fork must be discarded.5) The number of X termini of DD fragments within an SD fork ofparental type XX is logically 2. If any of the following conditionsapply to any fork, that fork must be discarded.

A) If the termini of some fragments are known, the total numberof X termini may not exceed 2.B) Given more than 1 fork member, no DD fragment may be of typeXX.C) Given that the termini types of each fork member is known,there must be 2 termini of type X.

6) Corollary of Rule 4. Given a questionable fork of an SD fragmentof type XX. If any of the DD fork members are contained at least oncein all the forks of another SD fragment of type XX then the questionablefork, by inference, violates rule 4 and must be discarded.7) Given that:

1) both enzymes cut the plasmid more than once, and2) a single unique fork for a given SD fragment, then no other

fork, call it Z, of any other SD fragment may contain more thanone DD fragment of the single remaining fork. If Z does containsuch, then Z may be discarded.

725



Table V: Rules from the Rule-Oriented method that must be changed toaccomodate linear mapping.

2) Given only one fork for an SD fragment:A) Given one DD fragment in a fork of parental type X>c or Xz_, the

DD fragment is determined to be of type X>c or X^, respec-tively, as well, (z_ may remain ambiguous until the map isordered and the "end" pieces determined).

B) Given two DD fragments in a fork of parental terraini type Xx_:1. If one DD fragment is XY, then the other DD fragment is

2. If one DD fragment is Y^, then the other DD fragmentis XY. ~

3. If neither DD fragment's termini are known, then:a) If the parental fragment is XX, both DD fragmentsare XY.b) If the parental fragment is XE or Xz, assume bothfragments to be Y^ (since it is impossible to determinewhich is XY and which is XE.)

C) Given greater than two DD fragments, call this number N, in afork of type Xx̂ or Xz_:1. If two DD fragments have Yx and/or Yz termini, then allother (N-2) DD fragments in the fork are concluded to be oftype YY.2. If one DD fragment is of type Yx or Yz_ and (N-2) DDfragments are of type YY, then the unknown DD fragment is oftype XY or Yz, respectively.3. Given 2 unknowns in the fork, (N-2) knowns of type YY,and X;c=XX then the two unknowns are XY.4. Given 2 unknowns in the fork, (N-2) knowns of type YY,and parental type XE or Xz_ then the two unknowns are of typeYz_.

5) The number of X + E + z_ termini of DD fragments within an SD forkis logically 2. If any of the following conditions apply to anyfork, that fork must be discarded,a) If the termini of some fragments are known, the total number

of X + E + 7. termini may not exceed 2.c) Given that the termini of each fork member is known, the total

of X + E + ̂ termini must be 2.8) Given one fork in an SD fragment of type Xx̂ (where x_ is ambiguous)

if all the DD members of the fork have XY or YY termini then SDfragment X;c is concluded to be XX. If one DD fragment of the forkis XY, a second DD fragment is XE and all remaining DD fragmentsare YY then the SD fragment Xx is concluded to be XE.

3.25%) to remove invalid forks and then PERMAP was called to find the maps,

the total time (RULMAP plus PERMAP) to find all 5 maps, and store them, was

less than 30 seconds. Through such analysis, it has been found that with

certain maps, a significant reduction in the time spent to generate the

possible map solutions can be accomplished by using RULMAP and PERMAP in

tandem to first remove invalid fork groups and then order them into

printable maps.

726



Of the 5 maps generated, only one Is truly correct (the other 4

"appear" correct due to the error range). By entering more data as it is

learned, invalid nap solutions may be suppressed until only one answer is

left. This example clearly shows the importance of careful measurement of

fragment size.

LINEAR MAPPING

Because it is often necessary to map linear pieces of DNA rather than

whole circular plasmids, modifications of the three algorithms ROM, PM and

EOT are presented to map linear DNA molecules.

Rule-Oriented Method for Linear Mapping. The number of digest

fragments in a linear map must be 1 + (number of fragments in single

digest A) + (number of fragments in single digest B) . The difficulty in

linear mapping is to determine how to name the termini types of the ends of

the undigested linear DNA (e.g. in the original DNA to be mapped there are

two "ends" with no termini type). For linear mapping, the two "ends" will

have termini type 'E'. Definitions of termini type:

X: Either A or B.

Y: Opposite of X (if X = A then Y = B).

jc: known to be X, or E.

7T. (X or E) i.e. unknown.

Because the SD fragment in linear mapping is ambiguous (X^) in nature,

determining the termini type from a single fork is necessarily more

difficult. x and ^ are attempts to provide variables to represent this

ambiguity. Table IV shows a brief presentation of those rules from ROM

which must be changed to accomodate linear mapping.

Permutation Method for Linear Mapping. In a circular map, the starting

point is arbitrary for the Permutation Method because the initiation point

is the same as the endpoint. In linear mapping however, there are two

endpoints (the ends of the molecule). The objective in linear mapping by

the Permutation Method is to align the fragments from the forks, such that

all the fragments have been used, to produce a consistent map order. To

allow for an arbitrary starting point (since it is not always possible to

know which are the end fragments), the Permutation Method for circular

mapping must be slightly modified to apply to linear maps.

The forks of the largest SD fragment are chosen one at a time, as in

circular PM. The map order is built outwards to the right as in circular

PM. However, the difference with linear mapping is that if one reaches a



point while building the map rightward at which no further forks can be

found (and presumably the map is not finished) then the seed must be reset

to the leftmost element (cal] it L) in the fragment order and the map built

leftward (the rightmost element, previously the seed, is now considered

"known"). The leftward building must be continued unfi] every allowable

fork and fragment order has been searched. When all possible leftward map

orders have been exhausted and the fragment order has shrunken back to the

point where rhe leftnost element L was chosen as seed, the algorithm

switches back to rightward building. At this point, it should either

permute the rightmost fork if necessary, as in normal circular PM, or remove

the fork as necessary. In this fashion, the algorithm switches back and

forth between rightward and leftward map building searching all viable

fragment orders. Extremely careful note of map parameters must be kept to

ensure all orders are checked. By this procedure, the map is built from the

inside out (unless one end fragment is permuted into the the first position

of the intitiating fork, in which case the map is built from one end to the

other). All map orders consistent with the data will be generated by this

method.

Linear Mapping Using Enzyme Group Technique. No changes are necessary

to EGT to allow it to deduce enzyme group lists from linear mapping data.

EGT for linear mapping would use linear ROM for checking and linear PM for

ordering and map printing.

CLOSINC REMARKS

We have shown here a system of rules and algorithms for mapping of

circular and linear DNA molecules. The described program is based on two of

these algorithms and can be used to rapidly map restriction endonuclease

sites for circular DNA molecules. This program also incorporates methods

for suppressing invalid maps on the basis of previously derived data.

Our work on DNA mapping programs is currently focused on developing

algorithms to effectively combine the results obtained from double digests

using several restriction endonucleases. Algorithms are also being

developed, for use with complex maps, which wi]] advise the user of

additional experiments required to resolve map ambiquities and complete the

mapping process.

*To whom correspondence concerning details of the program algorithms should

be addressed.

728



'Present address: Dept. of Cenetics, Stanford Medical Center, StanfordUniversity, Stanford, CA 9430?

To whom reprint requests shou]d be addressed

ACKNOWLEDGEMENTS

The authors would like to thank Dr. John Dill for helpful advice during

the writing of the program, Dr. Roman Legocki for critical reading of the

manuscript and to Mrs. Julie Ruocco for her patient typing. This work was

supported by the Boyce Thompson endowment and by Grant No. PCM-7820252 from

the National Science Foundation to A.A. Szalay.

REFERENCES

1) Stefik, M. (1978) Artificial Intelligence, 11:85-144.2) Pearson, W.P. (1982) Nucleic Acids Research, 10:217-227.3) Parker, R.C., Watson, R.M. and Vinograd, J. , Proc.

Matl. Acad. Sci. , U.S.A., 74:851-855.4) Schroeder, J.L. and Blattner, F.R., Gene, 4:167-174.5) Dijkstra, E.W. A Discipline of Programming, Chapter 13 "The Problem

of the Next Permutation." Prentiss Hall, 1976.6) Maniatis, T., Fritsch, E.F., Sambrook, J. (1982) Molecular Cloning:

A Laboratory Manual, Cold Spring Harbor.


Plasmid mapping computer program

Documents

Transcript of Plasmid mapping computer program