Influence of aberrant observations on high-resolution linkage ...

10
Am. J. Hum. Genet. 49:985-994, 1991 Influence of Aberrant Observations on High-Resolution Linkage Analysis Outcomes Kenneth H. Buetow Division of Population Science, Fox Chase Cancer Center, Philadelphia Summary Because of the availability of efficient, user-friendly computer analysis programs, the construction of multilocus human genetic maps has become commonplace. At the level of resolution at which most of these maps have been developed, the methods have proved to be robust. This may not be true in the construction of high- resolution linkage maps (3-cM interlocus resolution or less). High-resolution meiotic maps, by definition, have a low probability of recombination occurring in an interval. As such, even low frequencies of errors in typing (1.5% or less) may influence mapping outcomes. To investigate the influence of aberrant observations on high-resolution maps, a Monte Carlo simulation analysis of multipoint linkage data was performed. Introduction of error was observed to reduce power to discriminate orders, dramatically inflate map length, and provide significant support for incorrect over correct orders. These results appear to be due to the misclassification of nonrecombinant gametes as multiple recombinants. %2-Like goodness-of-fit analysis appears to be quite sensitive to the appearance of misclassified gametes, providing a simple test for aberrant data sets. Multiple pairwise likelihood analysis appears to be less sensitive than does multipoint analysis and may serve as a check for map validity. Introduction Meiotic gene maps have proved to be valuable tools in the field of human genetic research. These maps provide significant insights into human disease, ge- netic diversity, and chromosome/gene structure. The meiotic maps play a crucial role in integrating the in- formation obtained from physical mapping techniques with the study of genetic disorders of unknown molec- ular etiology. Studies of chronic granulomatosis dis- ease (Royer-Pokora et al. 1986), muscular dystrophy (Monaco et al. 1985), retinoblastoma (Friend et al. 1986), and cystic fibrosis (Kerem et al. 1989) have demonstrated the joint importance of meiotic and physical maps of human chromosomes in uncovering genetic lesions. The efforts associated with the search for these dis- Received May 21, 1991; revision received July 16, 1991. Address for correspondence and reprints: Kenneth H. Buetow, Ph.D., Fox Chase Cancer Center, 7701 Burholme Avenue, Philadel- phia, PA 19111. © 1991 by The American Society of Human Genetics. All rights reserved. 0002-9297/91 /4905-0010$02.00 ease genes have also demonstrated the important role of fine-structure meiotic gene maps (interlocus map intervals of 3 map units or less) in physical map con- struction. Loci ordered by meiotic methods can serve as important punctuation points in orienting noncon- tiguous large-scale clones or fragments. This is well illustrated by the current state of the chromosome 4p1 6.3 mapping efforts associated with the search for the Huntington disease locus (MacDonald et al. 1989). Physical locations for subsets of loci in the re- gion have been obtained by application of a variety of physical mapping techniques. These subsets have then been oriented with respect to each other by meiotic mapping methods. In addition to serving as anchor points in these efforts, meiotic maps permit verifica- tion of physical mapping results. For meiotic maps to be optimally utilized, it is im- perative that they be of both high resolution and high integrity. High-resolution mapping is required if mei- otic information is to be meaningfully incorporated with physical mapping results. Accurate construction of such high-resolution meiotic maps requires absolute data integrity. Disturbances in data integrity can result 985

Transcript of Influence of aberrant observations on high-resolution linkage ...

Page 1: Influence of aberrant observations on high-resolution linkage ...

Am. J. Hum. Genet. 49:985-994, 1991

Influence of Aberrant Observations on High-ResolutionLinkage Analysis Outcomes

Kenneth H. Buetow

Division of Population Science, Fox Chase Cancer Center, Philadelphia

Summary

Because of the availability of efficient, user-friendly computer analysis programs, the construction of multilocushuman genetic maps has become commonplace. At the level of resolution at which most of these maps havebeen developed, the methods have proved to be robust. This may not be true in the construction of high-resolution linkage maps (3-cM interlocus resolution or less). High-resolution meiotic maps, by definition,have a low probability of recombination occurring in an interval. As such, even low frequencies of errors intyping (1.5% or less) may influence mapping outcomes. To investigate the influence of aberrant observationson high-resolution maps, a Monte Carlo simulation analysis of multipoint linkage data was performed.Introduction of error was observed to reduce power to discriminate orders, dramatically inflate map length,and provide significant support for incorrect over correct orders. These results appear to be due to themisclassification of nonrecombinant gametes as multiple recombinants. %2-Like goodness-of-fit analysisappears to be quite sensitive to the appearance of misclassified gametes, providing a simple test for aberrantdata sets. Multiple pairwise likelihood analysis appears to be less sensitive than does multipoint analysisand may serve as a check for map validity.

Introduction

Meiotic gene maps have proved to be valuable toolsin the field of human genetic research. These mapsprovide significant insights into human disease, ge-netic diversity, and chromosome/gene structure. Themeiotic maps play a crucial role in integrating the in-formation obtained from physical mapping techniqueswith the study of genetic disorders ofunknown molec-ular etiology. Studies of chronic granulomatosis dis-ease (Royer-Pokora et al. 1986), muscular dystrophy(Monaco et al. 1985), retinoblastoma (Friend et al.1986), and cystic fibrosis (Kerem et al. 1989) havedemonstrated the joint importance of meiotic andphysical maps of human chromosomes in uncoveringgenetic lesions.The efforts associated with the search for these dis-

Received May 21, 1991; revision received July 16, 1991.Address for correspondence and reprints: Kenneth H. Buetow,

Ph.D., Fox Chase Cancer Center, 7701 Burholme Avenue, Philadel-phia, PA 19111.© 1991 by The American Society of Human Genetics. All rights reserved.0002-9297/91 /4905-0010$02.00

ease genes have also demonstrated the important roleof fine-structure meiotic gene maps (interlocus mapintervals of 3 map units or less) in physical map con-struction. Loci ordered by meiotic methods can serveas important punctuation points in orienting noncon-tiguous large-scale clones or fragments. This is wellillustrated by the current state of the chromosome4p1 6.3 mapping efforts associated with the search forthe Huntington disease locus (MacDonald et al.1989). Physical locations for subsets of loci in the re-gion have been obtained by application of a variety ofphysical mapping techniques. These subsets have thenbeen oriented with respect to each other by meioticmapping methods. In addition to serving as anchorpoints in these efforts, meiotic maps permit verifica-tion of physical mapping results.

For meiotic maps to be optimally utilized, it is im-perative that they be of both high resolution and highintegrity. High-resolution mapping is required if mei-otic information is to be meaningfully incorporatedwith physical mapping results. Accurate constructionof such high-resolution meiotic maps requires absolutedata integrity. Disturbances in data integrity can result

985

Page 2: Influence of aberrant observations on high-resolution linkage ...

Buetow

from sample mix-ups, incorrect interpretation of ge-notypes, and data-entry errors. As the number of ge-notypes necessary to generate high-resolution mapsincreases, so does the opportunity for these primaryerrors. Minimizing errors will also allow aberranciesrelated to genuine biological phenomena such as geneconversions or chromosomal microinversion poly-morphisms to be detected more reliably. Data errorsmay have a dramatic effect on meiotic maps con-structed at this level of resolution. A data-typing errorfor a biallelic marker informative for linkage will oftenbe undetectable in an offspring. Such an error willlikely appear as a double recombinant if flankingmarkers are also informative. Within a 1-cM-resolu-tion map, this double recombinant is approximately10,000 times less likely than no recombination events.Thus, substantial statistical weight will be placedagainst the correct locus order within the family. Per-haps more insidious is a typing error that negates atrue recombination event. While the former can oftenbe identified within a collection of true multiple-recombination events, the latter event would tradi-tionally go undetected.

Pragmatically, little is known about the statisticalproperties that tests currently in use have when appliedto high-resolution multipoint gene mapping. In addi-tion to the question of robustness implied above, thepowers to differentiate locus order and accurately esti-mate map distance at this level are largely unknown.For example, there is no consensus on the appropriatestatistical criteria for objective evaluation of high-resolution maps (Morton and Andrews 1989).

In the present paper, a systematic evaluation of thecurrent fine-structure meiotic mapping methods is un-dertaken. To perform this evaluation, data sets similarto the extended Centre d'Etude du PolymorphismeHumain (CEPH) pedigree reference panel have beensimulated by Monte Carlo methods. In the first evalua-tions of the simulated data sets, the empirical perfor-mance and resolution limits of meiotic mapping meth-ods are examined. A concurrent outcome of the aboveevaluations is a systematic examination of the appro-priate significance criteria for accurate fine-structuremapping. It is unlikely that even the most rigorouslaboratory quality-control procedures will be success-ful in eliminating all erroneous observations from adata set. It is therefore important to understand theinfluence of these observations on final outcomes. Toaddress these questions, simulated data sets incorpo-rating small quantities of aberrant data were gener-ated. Mapping analysis has been evaluated with re-

gard to the effect of increasing numbers of aberrantobservations.

Material and Methods

Given the complexity of high-resolution meioticmap data, it is not practical to evaluate explicitly manyof the statistical questions addressed. Thus, empiricalapproaches based on computer simulation methodshave been employed. These empiric data sets havebeen generated utilizing a common algorithm. Thefamily structure used in the simulations mimicked the61 extended CEPH reference pedigrees (Dausset etal. 1990). These data contained 974 possible meioticevents. Within these 61 CEPH pedigrees, five differentlevels of resolution maps of five loci were considered.In each map, at each level of resolution, the four in-terlocus recombination frequencies were assumed tobe equal and took a value of .01, .015, .02, .025, or.03. The multilocus gametes were generated utilizingrecombinant-gamete-probability distributions con-structed assuming no interference. The distributionsfor the five different map resolutions utilized in thepresent study are given in figure 1. The gamete distri-butions were utilized in the following manner.

For simplicity, a modified backcross mating for athree-allele system was assumed for each pedigree. Forconvenience, the father in each pedigree was initiallyassumed to be heterozygous at each of the loci in themultipoint map. Parents within the same pedigree didnot share alleles at the same locus. For example, withineach pedigree, the father received alleles 1/2, and themother received alleles 3/3. To simplify inspection,the paternal gametes were assumed to be in coupling.Therefore, a five-locus, paternal genotype would be11111/22222. Grandparents, when present in thefamily structure, were coded as homozygotes for thematernal gametes. For the paternal gametes, the fa-ther's mother was coded as 22222/22222, and thepaternal father was coded as 11111 /22222. Giventhis structure, the parental origin of each locus can bedetermined, and phase can be determined unequivo-cally. Each child was assigned one maternal allele 3gamete. The father's gamete was assigned accordingto the probabilities in figure 1. Alternative phases ofeach gamete were provided with equal probability.Each parent then transmits recombinant or nonrecom-binant gametes according to the empiric probabilitydistributions. Thus, it is not necessary to make specificassumptions about the recombination process. Datasets generated by this algorithm generate 487 meioses

986

Page 3: Influence of aberrant observations on high-resolution linkage ...

Influence of Aberrant Observations

gamete types

parental

00000 00000single recombinant

00000 0000000000 000000000 @00000000 00000

double recombinant

00000 0000000000 0000000000 0000000000 0000000000 00000000.0 00000

map resolution

1% 1.5% 2% 2.5% 3%.9594

.0100

.0100

.0100

.0100

.0001

.0001

.0001

.0001

.0001

.0001

.9388 .9176 .8964 .8746

.0150

.0150

.0150

.0150

.0002

.0002

.0002

.0002

.0002

.0002

.0200

.0200

.0200

.0200

.0004

.0004

.0004

.0004

.0004

.0004

.0250

.0250

.0250

.0250

.0006

.0006

.0006

.0006

.0006

.0006

.0300

.0300

.0300

.0300

.0009

.0009

.0009

.0009

.0009

.0009

Figure I Gamete distributions used for Monte Carlo simulation studies. The single-recombination event frequency is used to describethe map resolution. Double-recombination frequencies are products of the single-recombination probabilities. Parental frequencies wereobtained by subtraction.

informative for each locus in the simulated map. Eachmodel considered was evaluated for 1,000 simulationreplicates. Analysis of the simulated data sets was per-formed utilizing the CRIMAP (version 2.3) linkage-analysis program (P. Green, personal communica-tion).

In the first set of evaluations, these fully informativedata sets were used to determine the empirical perfor-mance and resolution limits of meiotic mapping meth-ods. First, given that the total number of meioseswithin the reference panel is fixed, maps of five differ-ent levels of resolution were considered, to determinethe resolving power of the panel.The effect of aberrant observations was evaluated

utilizing similar empiric data sets. Aberrant observa-tions were generated by changing offspring typings atindividual loci, with probability equal to the assumederror rate. Only alterations not generating apparentnonpaternity were considered. Aberrant observationrates of 0.5%, 1.0%, and 1.5% of the total typingwere considered. These rates were chosen becausethey are similar to the 0.6% error rate observed inthe multipoint data set used to construct the CEPHconsortium chromosome 1 linkage map (Dracopoliet al. 1991). A schematic of the error-introductionalgorithm is presented in figure 2.The consequences of sorting orders by likelihood

magnitude and of utilizing relative odds difference asa significance criterion were evaluated. For the maps

derived above, the distributions of log-likelihooddifferences for "best" versus "next best" and for "best"versus "known" were examined. From these differ-ences, empiric type I and II error levels were deter-mined. The robustness of the mapping statistics undervarious map conditions and under various aberrant-data levels was also examined.

Results

The above algorithm was first tested for accuracyof data generation of the five alternative map resolu-tions. A total of 100 independent simulated distribu-tions for each map resolution were tested for deviationfrom the input distributions, by goodness-of-fit x2.After adjustment for multiple comparisons, none ofthe simulated recombinant-gamete distributions dif-fered from the simulation input values (data notshown).Examination of 1,000 replicates for each map reso-

lution demonstrated that the 487 meiotic panel hadsubstantial power to correctly resolve order. Not un-expectedly, the correct order was obtained, in eachreplicate, for the .015, .020, .025, and .030 maps.More impressively, the average logio differences be-tween the highest likelihood order and the next mostlikely order were 17.2, 14.9, 12.3, and 9.8 for the.03, .025, .020, and .015 maps, respectively. In onerun at the highest map resolution (.010), however, a

987

Page 4: Influence of aberrant observations on high-resolution linkage ...

0* 0. *0

"T_1*-- *-0

Buetow

0@ 00@ 0

00 0 00 o do *-000

0 ~ @@0 ~ @@

.. [email protected]* o.

..~ 000 00.000.*-o-o

4.00

Figure 2 Ideogram showing simulation algorithm. Circles (0) indicate allele 1; black dots (@) indicate allele 2; and stippled dots(0) indicate allele 3. The x indicates a typing changed to an alternative form.

simple inversion from the input order was observed tobest fit the data, with logio odds of .088.A number of runs did not exceed the conventional

odds of 1,000 to 1 (lod 3) necessary to definitivelychoose one order over another. Utilizing the resultsof these simulations made it possible to obtain thedistribution of likelihood differences between best andsecond-best orders. From these distributions it is pos-sible to evaluate the statistical power to discriminate

4

C

C0

0CE0/2

C

0

0.0 0.5 1.0 1.5% typing error

Figure 3 Plot of percent of interlocus map inflation as func-tion of percent typing error. Results for each of the four levels ofmap resolution are shown at each typing-error frequency.

order at the various levels of map resolution. At thelowest level of resolution (3 cM) all replicates in thesimulation exceeded the lod 3 criteria, indicating a

minimum power of 99.9%. The same is not true forthe higher-resolution maps. At the .010 level ofresolu-tion 9.7% of the orders (97/1,000) did not meet thelod 3 criteria. Likewise, at the .015, .020, and .025levels of resolution 1.1%, 0.4%, and 0.2% of theorders did not meet the lod 3 criteria, respectively.The influence of the aberrant observations on map

outcomes was evaluated next. As expected, the intro-duction of aberrant typings resulted in a dramatic in-flation of map distance. Absolute interlocus distanceinflation was observed to occur proportionally to theerror rate. To test the significance of this relationship,a simple linear-regression model was fit to the absoluteinflation in recombination/error-percentage ratios(fig. 3). A correlation coefficient of .99 (P < .0001)was observed between the inflation and error rate.

Ninety-nine percent of the variance of the ratio was

explained by this model. This model indicates thatinterval specific map inflation will occur at 2.0 timesthe error rate. The net effect of such a relationship isthat higher-resolution maps will be differentially in-flated by constant error rates. An example of this in-flation is presented in figure 4.The introduction of aberrant typing also had sig-

nificant effects on the ability to recover the correct

988

0000000000

Page 5: Influence of aberrant observations on high-resolution linkage ...

Influence of Aberrant Observations

original map:%i l : 1 1.01.0 21.0

0.5%error: 1 2.0 2.0 2.0 2.0

1.0% error: ' 2.9

1.5% error: I

2.9 1 3.0 1

I a I3.9 3.9 ' 4.0 3.9 1

11.8

4 15.9

Figure 4 Map inflation due to typing errors. The interlocus and total map inflation are shown for the 1-cM resolution map at threedifferent levels of typing error.

order from the family data. More specifically, the pro-portion of orders discriminated by lod 3 criteria wasreduced, and the frequency of incorrect most-likelyorders increased. Among the latter, a significant pro-

portion of incorrect orders were supported by lod 3criteria. The results of these analyses are summarizedin table 1. Examination oftable 1 shows that the intro-duction of a small fraction of aberrant observationsgreatly reduces the power to discriminate order atthese levels of map resolution. For example, the 1-cM-resolution map has a 49% reduction in power whenerrors are introduced with a frequency of 1.5% perlocus. This power reduction appears, though, to bedependent on map resolution. The lower-resolution3-cM map showed only a 3.6% reduction in numberof orders selected.

In addition to reducing power, the introduction ofaberrant typings also influenced the accuracy of se-

lected orders. In the 1-cM-resolution maps, 5.5%,10.6%, and 15.9% ofthe lod 3 selected orders differedfrom the input order at error rates of 0.5%, 1.0%,and 1.5%, respectively. As above, this effect was mostevident in the higher-resolution maps. Less than 1%of the 2-cM-resolution-map orders were incorrect at

even the highest error rate used in the study. The rela-tive frequencies of incorrect orders among all lod 3selected orders are summarized in figure 5. Figure 5shows that the relative frequency of incorrect ordersincreases linearly with error rate and that the influenceof errors is dependent on map resolution.

It is of interest to know what types of incorrectorders are observed. Thus it is necessary to define an

error index that measures the deviation of the ob-served order from the expected order. Deviation wasmeasured by assigning each locus its rank in the correctorder. The error index was measured by taking thesum of the number of loci with a lower rank to theright of each locus in the derived order. By this methoda simple inversion between adjacent loci would havean error index of 1. Presented in figure 6 is a plot ofthe number of orders by lod difference and error indexfor the 1-cM-resolution map with 1% typing error

included. It is clear that the majority of incorrect or-

ders differ from the simulation-input order by a simpleinversion. However, higher error-index orders are ob-served, occasionally with lod 3 support.

Given the simulation algorithm, it is possible to ex-amine the gamete distribution for any data set at vari-

Table I

Influence of Introduction of Typing Errors on Multipoint Mapping Outcomes

No. (%) OF ORDERS AT ERROR RATE OF

0% .5% 1.0% 1.5%

MAP Selected Selected Selected SelectedRESOLUTION by Lod 3 by Lod 3 Incorrect by Lod 3 Incorrect by Lod 3 Incorrect

(cM) Criteria Criteria by Lod 3 Criteria by Lod 3 Criteria by Lod 3

1.0 .......... 903 723 40(5.5) 557 59 (10.6) 465 74 (15.9)1.5 .......... 989 878 8 (.9) 780 18 (2.3) 684 27 (3.9)2.0 .......... 996 959 2 (.2) 907 2 (.2) 838 3 (.4)2.5 .......... 998 982 ... 965 2 (.2) 938 2 (.2)3.0 .......... 1,000 993 ... 986 1 (.1) 964 ...

total maplength(in cM)4.0

8.0

-

989

3.0

Page 6: Influence of aberrant observations on high-resolution linkage ...

Buetow

15

10 0

0,cs0

5-0

0.0 0.5 1.0 1.5

% typing error

Figure 5 Influence of introduction of error on percentage ofincorrect orders selected by lod 3 criteria (logio difference of 3 be-tween best and second-best order). The percentage is determined bytaking the rate of incorrect orders to all orders selected by lod 3criteria. Map resolutions are 1.0 cM (-Z-), 1.5 cM (-A-) and2.0 (-0--).

ous points in the run. Inspection of these distributionsprovides insight into the likely basis of the observedmap inflation and into the incorrect orders derived.Shown in figure 7 is the distribution of gametes for anincorrect order with and without error introduction.First, it is evident that there is both a deficiency inparental (nonrecombinant) types and an excess ofdouble-recombinant types. More important, thesedouble-recombinant types are instances where a singlelocus is flanked by recombination events. A similarpattern is observed by the most external loci in thederived order. These excess types represent instances

# of inversions in bestorder when comparedto given order /

2

14

7

/

2

where, because of the introduction of typing errors,

true nonrecombinant types appear as recombinants(single or double).These deviant distributions suggest that it may be

possible to detect the presence of typing errors in datasets by goodness-of-fit statistics. To test the practical-ity of this, I performed %2-like goodness-of-fit tests ofthe observed gamete distributions for the 1.5-cM-res-olution-map simulation runs. Expected distributionswere obtained by assuming the estimated interlocusdistances and by deriving the recombinant classes as

products of these values. The x2 distribution for the1,000 simulation runs of the 1.5-cM-resolution mapwhen no aberrant observations are introduced is pre-sented in figure 8a. This distribution differs from theexpected x2 distribution with 4 df. More specifically,17.9% of the values were observed to have x2 valuesof greater than or equal to 9.49. Next the x2 distribu-tion for the 1,000 simulated 1.5-cM-resolution mapswith 1% error frequency was examined (fig. 8b). Thisdistribution is shifted to the right, with a mode valueof 151.3 observed. These results indicate that x2 testsof gamete distribution are very sensitive to the intro-duction of relatively small amounts of typing error.

The X2 values for the incorrect orders are also shownin figure 8b. Inspection of figure 8b shows that, whileoverall the introduction oftyping errors may be detect-able, incorrect orders do not show differentially highx2 values.

Multipoint methods derive a large portion of their

/07

I L7

0 / /i / 1/W

/ 00 / ito 97 / 70 / 4200-1 1-2 2-3 3-4 >-4

log 10 difference in likelihoods between top 2 orders

Figure 6 Analysis of order outcomes for 1 -cM-resolution map with 1% typing errors. The numbers of inversions measures how theselected order differed from the simulation input order (see text). The logio difference is a measure of support for the order selected in theanalysis (the conventional criteria for significance is loglo differences of 3, or lod 3). The number in each box indicates the total numberof observations in that cell.

I I I I

4 11

990

Page 7: Influence of aberrant observations on high-resolution linkage ...

Influence of Aberrant Observations

gamete typesparental

00000 00000single recombinant

00000 000000000 0000000000 .00000000 @0000

double recombinant

00000 0000000000 0000000000 0000000000 0000000000 0000000000 00000

observedrelative expected distribution

frequency distribution (withouterror)

.9594 479 483

.0100

.0100

.0100

.0100

.0001

.0001

.0001

.0001

.0001

.0001

5555

0

0

0

0

0

0

5

3

2

5

0

0

0

0

0

0

Figure 7 Gamete distribution from 1-cM map with 1% typing error incorporated. The relative-frequency column is the simulation-input distribution. The expected distribution is the number of each gamete type expected in a simulation run. The observed distributionwithout error is the Monte Carlo sample of gametes. The observed distribution with error is the former distribution after typing errors

have been incorporated.

order information from multilocus, multiple-recom-binant gametes (Lathrop et al. 1984, 1985). This isnot true for multiple pairwise methods, where infor-mation on multiple-recombination events occurring

within a single gamete is lost. Thus, it is possible thatmultiple pairwise methods could be less sensitive to

the aberrant gamete distributions than are multipointmethods. To evaluate this, the simulated data sets forthe 1-cM-resolution map were reevaluated using themultiple pairwise analysis program MAP90 (Mortonand Collins 1990). In this analysis, pairwise recombi-nation estimates were obtained using the CRIMAPTWOPOINT routine. These two-point estimateswere then used as input to MAP90, where all possiblefive locus orders were evaluated. The best lod 3-sup-ported order from the CRIMAP multipoint analysiswas then compared with the best order (regardless ofsupport) from the MAP90 multiple pairwise analysis.The results of this analysis is summarized in figure 9.As speculated, among the lod 3-supported best ordersselected by multipoint analysis, the multiple pairwisemethods consistently demonstrated a lower frequency(and absolute number) of incorrect orders. For exam-ple, when 0.5% error was introduced, 6% of the or-

ders were incorrect by multipoint methods, and 2%were incorrect by multiple pairwise methods. Perhapsmore important, there were no instances where the

multiple pairwise method gave an incorrect orderwhen the multipoint methods gave a correct order.

Discussion

The construction of multilocus human genetic mapshas become commonplace. This is due in part to theubiquity of user-friendly computer analysis programs.These programs generate maps by using well-estab-lished statistical methods implemented through effi-cient computational algorithms. Increasingly, theseprograms allow users to input raw typing data and torecover a genetic map at the completion of analysis.At the level of resolution at which most of these mapshave been developed, the methods have proved to berobust. More recently, though, efforts have been fo-cused on the construction on high-resolution linkagemaps. In these maps, the target interlocus map dis-tances are 3 cM or less. While theoretically these meth-ods should have similar analytic characteristics, prac-

tical differences in both the nature of the data and thestatistical problem may result in unexpected out-comes.

High-resolution multilocus maps differ from themajority of maps constructed to date, in two key re-

spects: the absolute number of loci in any region andthe interlocus distance between them. By definition,

observeddistribution(with 1%error)

455

7329

800608

991

Page 8: Influence of aberrant observations on high-resolution linkage ...

Buetow

a

400

300

200

100

1 3 5 7 9 11 13 15 17 19 21 o21

goodness-of-fit chi-square

b

00

Sr

40 60 1O 100 120 140 160 180 200 220o220

chi-square value

Figure 8 %2-like goodness-of-fit values for 1 .5-cM map with-out error (a) and with 1% typing error introduced (b). The blackbars represent all observed values; the hatched bars indicate thevalue runs when the best order was not the simulation-input order.

the total number of actual recombination events oc-

curring in any given interval of a high-resolution mapis small. As it is necessary to observe at least one re-

combination event to determine the relative order of a

collection of loci, increasingly higher-resolution mapsrequire ever larger data sets, to guarantee that a suffi-cient number of events will be observed. The simula-tions conducted here indicate that the extended CEPHpanel should have sufficient power to derive order fora small high-resolution map composed of highly infor-mative markers.More pragmatically, the large number of loci also

increases the absolute number of opportunities to gen-erate laboratory errors from multiple sources. Mix-ups can occur when samples are initially acquired andprocessed, when restriction digests or PCR reactionsare set up, or when samples are loaded onto gels. Ge-notypes can be incorrectly interpreted because of par-

tial digests, sample contamination, or exchanges of

06

0)0

0.0 0.5 1.0 1.5% typing error

Figure 9 Plot of relative frequency of incorrect-order out-comes for 1-cM map when evaluated by multilocus (i.e., CRIMAP)and multiple pairwise (i.e., MAP90) likelihood methods. Shown arethe percentage of incorrect orders selected as best by lod-3 criteria byCRIMAP (-0-) and MAP90 (-U-), in the same runs.

autoradiograms or may be entered into a data baseincorrectly. Since families are usually analyzed asunits, and since many errors result from exchanges ofadjacent samples, errors from sibling mix-ups will notbe detected as parental incompatibilities. Althoughmany such errors can be minimized through the use ofautomation in sample processing, data interpretation,and data entry, such technology is not yet universallyavailable, because of its cost. In the data base used toconstruct the CEPH consortium map of chromosome1, errors occurred at a rate of 0.6% for genotypes atthe same locus, even after extensive data checking byindividual laboratories. In version 4 of the CEPH datafor chromosome 4, errors are detected at a rate of1.4% for genotypes at the same locus but from differ-ent laboratories (23/817 discordant pairs).

In high-resolution maps constructed using highlypolymorphic markers, the majority of these errors willappear as recombinants. This is because the vast pro-portion of true gametes are nonrecombinant- and be-cause the introduction of a typing out of phase withthe rest of the gamete will be misclassified as recombi-nant (see fig. 7). More important, even a low fre-quency of these errors (1% or less) proportionally rep-resents a large fraction of true events. If it is notpossible to distinguish true recombination events fromthe error-induced events, a large fraction of misclassi-fied recombinant gametes with statistical weight equalto that of true recombinant gametes will be introducedinto an analysis. Therefore, as observed in table 1,substantial power to resolve order is lost when aber-rant observations are introduced.A more dramatic consequence of the introduction

of error is map expansion. Ott (1977) has suggested

SW I

992

I

I

-___-

Page 9: Influence of aberrant observations on high-resolution linkage ...

Influence of Aberrant Observations

that the introduction of typing error would inflate re-combination estimates. It was shown that, in pairwiseanalysis, introduction of error at one locus at ratep resulted in an inflation of the true recombinationestimate (0) by p (1 - 20). More recently, multipointmap inflation as a consequence of typing errors hasbeen examined by Chakravarti and Lasher (submit-ted). Similar to Ott, they observe that, with misclassi-fication frequency p for any given locus, recombina-tion is increased by 2pq(1 - 20). They show that underthe assumption of no interference a map of k intervalsis inflated by - k/2 ln(1 - 4pq). When the error fre-quency is small, this quantity is approximately 2p perinterval, identical to the 2.0-per-interval inflation esti-mate observed in the simulations performed above. Anumber of investigators have proposed methods toadjust statistically for the inflation that is due to mis-classification. These methods include postanalytic ad-justment of the final map (Morton and Collins 1990),introduction of an error parameter in multiple pair-wise analysis (Shields of et al., in press), and modelingloci with reduced penetrance in multipoint analysis(Ott 1990).Perhaps the most profound consequence of error

introduction is the observation that incorrect orderswill be selected at a nontrivial rate (see table 1). Exami-nation of figure 7 suggests that these incorrect ordersresult as a consequence of the misclassification of non-recombinant gametes as multiple recombinant ga-metes. These misclassified gametes are statisticallyvery unlikely and provide substantial evidence againsta correct order. Unlike the map-inflation problem,there is, as yet, no simple procedure for accountingfor aberrant observations in order analysis. Examina-tion of figure 6 suggests that simple solutions, such asincreasing the level of support necessary to select abest order, are unlikely to reduce the proportion ofincorrect orders selected. For example, increasing sup-port to lod 4 reduces the number of incorrect ordersby 1% but reduces the total number of orders selectedby 23%. Moreover, it is clear that, while the lod 3support criterion does not have a firm rationalizationin statistical theory, reduction of the criteria to lod 2(Morton and Collins 1990) would result in an evenhigher proportion of incorrect orders being selected.There is reason, however, for cautious optimism

regarding this problem. First, given that the expecta-tion of the misclassified gametes is very small, empiricx2-like goodness-of-fit tests appear to be quite sensitiveto the presence of aberrant typings in a data set (seefig. 8). The results obtained using multiple pairwise

methods suggest that a significant number of incorrectorders may be identified by comparing the outcome ofthis analysis with that obtained by multipoint tech-niques. While multiple pairwise methods do not havesufficient power to be used independently, contradic-tions in results obtained from the different methodsmay flag data sets that require additional evaluation.Unfortunately, concurrence of results is not a guaran-tee of correct outcome.

Following identification of questionable order out-comes, inspection of individual gametes that are sus-pect because of their low prior probability (or highnumber of crossover events) can be performed. Re-gardless of the results of attempts to identify question-able order outcomes, inspection of the distributionof recombinants in families would be prudent in allhigh-resolution maps. More specifically, clusters ofrecombination events (single or double) within ahigh-resolution interval should be confirmed by rein-spection of the primary data (or be retyped, if that ispractical). Such inspection is only marginally valuablein low-resolution maps composed oflow-heterozygos-ity markers. However, the use of highly polymorphicmarkers should greatly improve this situation. Thesemarkers will make the vast majority of intervals infor-mative and allow the location of misclassified loci tobe narrowly defined. This will be possible becauseflanking loci will be much more likely to be informa-tive, thereby assigning aberrant recombination eventsto the intervals immediately flanking the aberrant typ-ing. The positives associated with use of these newmarkers, though, will be complemented by the nega-tive that the absolute number of informative typingswill be greater, thereby increasing the total number ofopportunities for error introduction.The benefits of the availability of high-resolution,

high-integrity human maps extend beyond esotericvalue. These maps may provide insight into a numberof genuine biologic phenomena. For example, the em-piric evidence for gene conversion events is the obser-vation of a double crossover. Evidence for gene con-version in humans is increasingly widespread (Starcket al. 1990; Urabe et al. 1990). In primates, suchevents occur preferentially in the vicinity of (CA). di-nucleotide repeat elements (Fitch et al. 1990), and theincreasing reliance of genetic maps on these elements(Weber and May 1989) makes characterizing their bi-ology even more important. Identification of such con-version has also been used to identify a recombinationinitiation site in Saccharomyces cerevisiae (Nicolas etal. 1989). Chromosomal microinversions that are un-

993

Page 10: Influence of aberrant observations on high-resolution linkage ...

994 Buetow

detectable cytogenetically might also distort recombi-nation maps if they are present as polymorphisms ingiven families. Such an occurrence has been hypothe-sized at 4pl6.3 near the Huntington disease locus,on the basis of the distortion of recombination maps(Buetow et al. 1991). Such observations are only pos-sible when aberrant observations due to experimentalerror are of substantially lower probability than arebiologically interesting map anomalies.

AcknowledgmentsThe author wishes to thank 0. Jiang for his technical

assistance, R. Sonlin for preparing the manuscript, and J.Murray for useful comments and additions to the manu-script, as well as for both allowing access to data sets thatprovided the impetus for the work and numerous stimulatingconversations. This work is supported in part by USPHSgrants HG00355, HG00206, CA06925, and RR05895from the National Institutes of Health and by an appropria-tion from the Commonwealth of Pennsylvania.

ReferencesBuetow KH, Shiang R, Yang P, Nakamura Y, Lathrop GM,White R, Wasmuth JJ, et al (1991) A detailed multipointmap of human chromosome 4 provides evidence for link-age heterogeneity and position-specific recombinationrates. Am J Hum Genet 48:911-925

Chakravarti A and Lasher LK. Estimation of chromosomelengths under genotyping errors (submitted)

Dausset J, Cann H, Cohen D, Lathrop M, Lalouel J, WhiteR (1990) Centre d'Etude du Polymorphisme Humain(CEPH): collaborative genetic mapping of the human ge-nome. Genomics 6:575-577

Dracopoli NC, O'Connell P, Elsner TI, Lalouel J-M, WhiteRL, Buetow KH, Nishimura DY, et al (1991) The CEPHconsortium linkage map of human chromosome 1. Geno-mics 9:686-700

Fitch DHA, Mainone C, Goodman M, Slightom JL (1990)Molecular history of gene conversions in the primate fetaly-globin genes. J Biol Chem 265:781-793

Friend SH, Bernards R, Rojelj S, Weinberg RA, RapaportJM, Albert DM, Dryja TP (1986) A human DNA segmentwith properties of the gene that predisposes to retinoblas-toma and osteosarcoma. Nature 323:643-646

Kerem B. RommensJM, Buchanan JA, Markliewicz D, CoxTK, Chakravarti A, Buchwald M, et al (1989) Identifica-

tion of the cystic fibrosis gene: genetic analysis. Science245:1073-1080

Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategiesfor multilocus linkage analysis in humans. Proc Natl AcadSci USA 81:3443-3446

(1985) Multilocus linkage analysis in humans: detec-tion of linkage and estimation of recombination. Am JHum Genet 37:482-498

MacDonald ME, HainesJL, Zimmer M, Cheng SV, Young-man S, Whaley WL, Bucan M, et al (1989) Recombinationevents suggest potential sites for the Huntington's diseasegene. Neuron 3(2): 183-190

Monaco AP, Bertelson CJ, Middlesworth W, Colletti C-A,Aldridge J, Fischbeck KH, Bartlett R, et al (1985) Detec-tion of deletions spanning the Duchenne muscular dystro-phy locus using a tightly linked DNA segment. Nature316:842-845

Morton NE, Andrews V (1989) MAP, an expert system formultiple pairwise linkage analysis. Ann Hum Genet 53:263-269

Morton NE, Collins A (1990) Standard maps of chromo-some 10. Ann Hum Genet 54:235-251

Nicolas A, Treco D, Schultes NP, Szostak JW (1989) Aninitiation site for melotic gene conversion in the yeastSaccharomyces cerevisiae. Nature 338:35-39

Ott J (1977) Linkage analysis with misclassification at onelocus. Clin Genet 12:119-124

(1990) Genetic linkage analysis under uncertain dis-ease definition. In: Cloninger CR, BegleiterH (eds) Genet-ics and biology of alcoholism. Banbury rep 33. ColdSpring Harbor Laboratory, Cold Spring Harbor, NY, pp327-332

Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, New-burger PE, Baehner RL, Cole FS, et al (1986) Cloning thegene for an inherited human disorder -chronic granulo-matous disease - on the basis of its chromosomal loca-tion. Nature 322:32-36

Shields DC, Collins A, Buetow KH, Morton NE. Error fil-tration, interference, and the human linkage map. ProcNatl Acad Sci USA (in press)

Starck J, Bouhass R, Morle Godet J (1990) Extent and highfrequency of a short gene conversion between the humanAy and Gy fetal globin genes. Hum Genet 84:179-184

Urabe K, Kimura A, Harada F, Iwanaga T, Sasazuki T(1990) Gene conversion in steroid 21-hydroxylase genes.Am J Hum Genet 46:1178-1186

Weber JL, May PE (1989) Abundant class of human DNApolymorphisms which can be typed using the polymerasechain reaction. Am J Hum Genet 44:388-396