Is It Ordered Correctly? Validating Genome Assemblies by ...The problem of genome validation takes...

COMMENTARY

Is It Ordered Correctly? Validating Genome Assemblies by Optical Mapping

Joshua A. Udall1* and R. Kelly Dawe2 1 Plant and Wildlife Science Department, Brigham Young University, Provo, UT, 84602 2 Department of Genetics, University of Georgia, Athens, GA, 30602

* Corresponding Author: [email protected]

Short title: Genome assembly validation by optical mapping

Abstract

Long-read single molecule sequencing, Hi-C sequencing, and improved bioinformatic tools are ushering in an era where complete genome assembly will become common for species with few

or no classical genetic resources. There are no guidelines for how to proceed in such cases.

Ideally, such genomes would be sequenced by two different methods so that one assembly serves as confirmation of the other; however, cost constraints make this approach unlikely.

Over-reliance on synteny as a means of confirming and ordering contigs will lead to

compounded errors. Optical mapping is an accessible and relatively mature technology that can be used for genome assembly validation. We discuss how optical mapping can be used as a

validation tool for genome assemblies and how to interpret the results. In addition, we discuss

methods for using optical map data to enhance genome assemblies derived from both

traditional sequence contigs and Hi-C pseudomolecules.

Plant Cell Advance Publication. Published on December 20, 2017, doi:10.1105/tpc.17.00514

©2017 American Society of Plant Biologists. All Rights Reserved

2

Introduction

A paradigm shift occurs within each research community when the genome of their study

organism is sequenced. Designing and executing the assembly is generally a collaborative

effort, and the display and annotation of the sequence can become a foundation for the

research community. The genome sequence opens doors to subsequent comparative, evolutionary, and translational research efforts. This process has unfolded in community after

community, starting with Arabidopsis, rice and maize and rapidly extended into all the major

food, fiber, and energy crops. As sequencing costs continue to drop, communities that began with one reference assembly are moving to multiple assemblies (Bolger et al., 2014; Kawakatsu

et al., 2016). High-quality assemblies dramatically expand the repertoire and robustness of

analyses that can be performed and provide the foundation for subsequent laboratory experimentation.

The process of genome assembly can be divided into two phases 1) sequence assembly

(Berlin et al., 2015; Antipov et al., 2016) and 2) genome assembly (Goff et al., 2002; Schnable

et al., 2009; Schmutz et al., 2010; Paterson et al., 2012). The sequence assembly phase includes using all of the available sequence data from short reads, mate pairs, and long reads to

create contigs and scaffolds. Long-read single molecule sequencing technologies have made it

possible to dramatically extend the length of sequence contigs, often including large portions of entire chromosomes (Michael et al., 2017). The genome assembly phase includes the

integration of additional information such as prior assemblies, genetic maps, or Hi-C read pairs

to order and orient the contigs into “pseudomolecules” that are representations of the chromosomes. The genome assembly phase is particularly challenging and is difficult to

validate. Full replication of de novo sequence assembly using independent efforts could in

principle be used for assembly validation. However, genome sequencing projects generally

consume all available time and resources generating high quality data for a single assembly. Standards for assembled genomes have been previously proposed ad interim (Blakesley et al.,

2004; Chain et al., 2009), though low-quality draft sequences continue to be published.

The problem of genome validation takes on additional importance as researchers move into species that lack basic genetic resources. Indeed, the sequencing and assembly methods

required to achieve a high level of contiguity are well within reach of many laboratories,

including those working with trees (Neale et al., 2014), minor crops (Clouse et al., 2016; Jarvis

et al., 2017), or ecological systems (Martinez-Garcia et al., 2016; Olsen et al., 2016; Tang et al., 2016; Vining et al., 2016). If a research group invests in mate-pair or long-read sequencing at

3

high depth, it is natural to proceed to genome assembly, even though the assembly may never

be used for map-based cloning or genetic analysis in the traditional sense. In these cases, synteny relationships have been used for gross genome assembly validation (Gan et al., 2016;

Jin et al., 2016); yet when there are differences between two genome sequences it is unclear if

they arose from assembly error or biological differences. There is also a real risk that a synteny-

assisted genome assembly will be used as a reference to create another synteny-assisted assembly in a third species, compounding errors and drifting further from biological reality.

Comparison of assemblies from different accessions of a single species (pan-genome analyses)

address the issue of validation to a degree, but a small number of genomes (e.g. 2-15 genomes within a monophyletic branch) still have limited inference power for untangling technical and

biological differences (Li et al., 2010; Gan et al., 2011; Hirsch et al., 2014; Li et al., 2014). Here

we highlight the use of optical mapping as an alternative, affordable method for sequence assembly validation that is independent of traditional sequencing and synteny-based methods.

Validation of sequence assembly In this commentary, we do not present how to create a de novo genome assembly or

describe when a genome is completed (Veeckman et al., 2016). Instead, we look forward to

continued technical innovations in genome assembly and propose that next-generation optical maps may be used as a standard for assembly validation. Optical maps derived from high

throughput nano-channels (Bionano optical detection or Nabsys electronic detection) offer a

relatively straight-forward and independent assessment of any genome assembly that claims to provide chromosome-level contiguity. The resulting whole-genome alignment metric can be

used by reviewers and readers alike to quickly assess sequence assembly quality.

Two current technologies (Bionano Genomics and Nabsys) use nick-based labelling to

generate maps of individual DNA molecules. The characteristics and features of Bionano technology have been reviewed elsewhere (Levy-Sakin and Ebenstein, 2013; Tang et al., 2015;

Chaney et al., 2016; Yuan et al., 2017). Briefly, the method involves purifying HMW DNA and

treating with a nickase, a modified restriction enzyme that creates single stranded nicks. Nickases target specific 6- or 7-bp nucleotide recognition sites and these sites are strand-

repaired with fluorescent nucleotides. The labeled DNA molecules are then passed through

nanochannel arrays where images are iteratively collected and converted into a digital format

that reflects the nicking patterns on each molecule. Data are usually collected at 50-150X coverage and subsequently assembled to create restriction map contigs. The assembly

4

algorithm identifies overlapping fingerprints in an overlap-layout-consensus approach; yet the

assembly process is one step removed from DNA sequence because only nick-length patterns (not sequence) are used to find matches between molecules. Similar to sequence analysis,

matched and overlapping nick patterns can be condensed into an aligned consensus pattern.

The challenge for optical map technologies, as well other long read technologies, is generating

high-quality, high-molecular weight (HMW) DNA. Tissue quality is very important in these protocols (young, unstressed tissue) and each DNA preparation will differ slightly in its labeling

efficiency. This differs from Hi-C-based scaffolding technologies (see below), which do not

require purified HMW DNA and do not suffer from these limitations. The characteristics and features of the Nabsys platform have not been reviewed

elsewhere, though descriptions of the technology (Oliver et al., 2017) and verification of

structural variants in the human genome are publically available (Kaiser et al., 2017). Nicking enzymes are used to attach a proprietary tag to the HMW DNA and the DNA+tags are coated

with RecA protein. The RecA-coated DNA is moved through a nanochannel with detectors that

measure the change in electrical resistance in the nanochannel. A spike in resistance identifies

the tag as it passes the detector and the time between spikes measures the base-pair distance between tags. This platform has an expected availability date in 2018. Both the Bionano and

Nabsys systems produce nick-based physical map assemblies based on overlapping nick

patterns (for simplicity we retain the term optical map for both technologies in this commentary). To match optical map contigs with DNA sequence, the sequence is converted into a restriction

map format. Through positive matches between the nick patterns of the optical map contigs and

in silico nick patterns from the DNA sequence, the optical map can independently validate the base-pair distances between nick sites in the DNA sequence.

The widely used N50 parameter describes how much of an assembly is composed of

segments larger than a certain size, where “N” is the contig or scaffold size, and “50” is the

percent of the assembly length. The N50 term can be applied to contigs (segments with continuous sequence), scaffolds (contiguous but with N-filled gaps), or optical map contigs (no

sequence at all). An N50 of 1 Mb indicates that 50% of the assembly is contained in contigs (or

scaffolds) larger than 1 Mb; many contigs will be larger than 1 Mb, but a much greater number will be smaller. Aligning two assemblies is a process of comparing the nicking site distribution of

the optical map to the in silico distributions from the sequence assembly. The degree of

alignment generated by such a comparison provides a direct measure of the accuracy of both

assemblies, although the power of the comparison increases with N50. In practice, only assemblies with megabase-scale N50s can be validated using optical mapping technology since

5

contigs/scaffolds smaller than 100 kb generally do not have enough nick information to be

confidently aligned. The quality of the optical map is heavily dependent on the quality of the high-molecular

weight DNA used to prepare it, and the purity of the extracted HMW DNA affects the efficiency

of the nicking reaction. If two recognition sites of a nicking enzyme happen to be closely

positioned on opposite strands of a DNA molecule, the enzyme can create a double strand break instead of two nicks, with the effect of truncating contigs (these are called “fragile sites”).

The impact of double strand breaks can be minimized by generating two different optical maps

with different nicking enzymes to create more a complete hybrid scaffold (Figure 1). Dual-nick assemblies have been shown to increase the assembly N50 of hummingbirds and humans by 2-

to 3-fold (Bionano Genomics Inc., 2017).

We have aligned Bionano genome assemblies to several sequenced genomes using the runCharacterize method from Bionano (Table 1). For example, aligning optical maps to the

genome assemblies from two maize inbreds resulted in a high level of congruence between

uniquely mapped Bionano consensus molecules and the assembled sequence (96% and 98%

mapping rate of B73 and W22, respectively (Jiao et al., 2017)). The very high percentage of alignment between the maize optical and sequence maps can in part be attributed to the high

N50 of their respective sequence assemblies. In rice, Bionano contigs aligned to the

Nipponbare reference sequence with a 96% mapping rate (Chen et al., 2017). These are, however, exceptional cases, and most whole genome alignments using Bionano data are closer

to 85%. Comparing the optical map data of tetraploid cotton (G. hirsutum) TM-1 to one draft

genome sequence (Zhang et al., 2015) resulted in 85% alignment, and aligning to another draft genome of the same line (Li et al., 2015) resulted in 75%. These imperfect validations are the

result of errors in the Bionano assembly, the sequence assembly, or both.

The comparison of data from different genome assembly projects also highlights some

of the limitations associated with using nick-based physical map data for validation (Table 1). Note that the optical map length often differs slightly in size from the assembled genome

sequence. The differences can generally be ascribed to repetitive regions such as nucleolus

organizing regions (NORs) and tandem repeat arrays (centromeres or telomeres), but may also be caused by low quality assemblies on either side of the comparison. For example, in rice

(MH63) we found that five large Bionano contigs erroneously mapped to a single repetitive

region of centromere 8. Correctly mapping these molecules to the genome increased the overall

alignment from 85.6% to 87.1% (J. Udall, unpublished data). Similarly, in G. herbaceum (A-genome cotton), nearly all of the centromere-spanning physical map contigs initially mapped to

6

a single chromosome (Figure 2). This was because 1) many of the centromeric repeats were

collapsed during sequence assembly and 2) the regular spacing of BssSI sites in the cotton regions had the lowest p-value match scores of any local match between those Bionano contigs

and the genome sequence. To accurately map these contigs, we masked the repetitive regions

and used the flanking regions for appropriate placement of the physical map contigs.

In many cases, plants being considered for genome assembly retain a significant amount of heterozygosity. Although most current technologies have been designed to

accommodate the possibility of heterozygosity, it remains a significant challenge to identify

heterozygous regions and incorporate them into a final assembled product. Where there is sufficient coverage and polymorphism to differentiate heterozygous regions, two haplotypes will

be assembled separately (sequence or optical map data). This has the effect of inflating the size

of the assembled genome by creating a separate contig for each heterozygous polymorphic region. However, in practice, some polymorphic regions will be collapsed into a single haplotype

depending on specific assembly parameters. Both FALCON (PacBio sequence assembly) and

Bionano are developing haplotype detection methods, but the process of sorting out allelic

contigs remains a difficult and labor-intensive process.

Enhancement by combination with other methods including Hi-C proximity ligation

In cases where a sequence and an optical assembly are available, it makes sense to

integrate the assemblies into a single ‘hybrid’ assembly using “hybridScaffold.pl”. This process

joins contigs and creates gaps with approximated sizes based on nick distances. Hybrid scaffolding also identifies assembly conflicts, which are the result of either improperly

assembled contigs or actual mismatches between the optical assembly and sequence assembly

when different accessions/species are used. Sequence chimeras are an unavoidable outcome

of assembly in large-genome species. Chimeric contigs can be resolved (manually or automatically) by cutting or trimming the sequence contigs or the optical maps, or both. The

resulting hybrid assembly is a more accurate representation of the genome than either

individual assembly alone, as it includes structural information from the optical map, corrected chimeric scaffolds, and generally a longer N50 than either input assembly. For example, hybrid

scaffolding was used to enhance the genome assemblies of amaranth (Clouse et al., 2016),

barley (Mascher et al., 2017), and quinoa (Jarvis et al., 2017). In amaranth, the hybrid assembly

reduced the number of scaffolds from 343 to 241 and nearly doubled the final scaffold N50 by

7

making several key connections between existing large scaffolds. Similar outcomes were

observed in other genomes. Hybrid scaffolding becomes even more powerful when combined with other technologies

such as the 10x Chromium system (Weisenfeld et al., 2017), or Hi-C-based methods (Burton et

al., 2013; Korbel and Lee, 2013). Hi-C is a relatively new approach that is gaining rapid

acceptance because of the resulting useful arrangements of contigs in chromosome-scale scaffolds (Korbel and Lee, 2013; Bickhart et al., 2017; Dudchenko et al., 2017; Mascher et al.,

2017). Hi-C was originally developed to detect intra- and inter-chromosomal interactions such

as those between enhancers and promoter regions, but it has proven to be useful for long range scaffolding of sequence contigs as well. Its use in scaffolding relies on the distance-dependent

decay of physical interaction frequencies that explains much (though not all) of the observed

interaction patterns. The process involves cutting chromatin with restriction enzymes, biotin-labeling and re-ligating the ends, then sequencing the biotin-labeled regions. Hi-C identifies

physical contacts only and is a powerful complement to a sequence assembly with an excellent

contig N50. Unlike mate-pair sequencing, Hi-C has the ability to scaffold a continuous range of

distances using a log-likelihood (LOD) function that compares scaffolding results on a statistical basis. The results are presented as concatenations of contigs in the most likely order and

orientation based on the highest LOD scores. Because the process is likelihood based,

heterozygous regions can be represented by separate, adjacent contigs. Hi-C data can be generated in-house or through a service provider such as Dovetail

Genomics or Phase Genomics. While it is a powerful method, it is important to note that Hi-C

scaffolding data includes structural DNA linkages (i.e. mate-pair-like linkages) and biological linkages (i.e. loci co-localized in the cell on the same or different chromosomes) from a

collection of millions of nuclei from different tissue types and cell division stages. Often telomeric

and centromeric regions of different chromosomes are co-localized in the nucleus. Such data is

inherently messy since the highest likelihood based on co-localization frequency is assumed to be correct for pseudomolecule construction, while several nearly-as-likely orders or orientations

may have also been calculated. This is particularly true for small contigs (<100 kb) that have a

small number of Hi-C linkages and limited power for the respective LOD scores. One key difference between Hi-C-based scaffolding and optical map scaffolding lies in how sequence

gaps are handled. In Hi-C scaffolding the sequence gaps are marked by arbitrarily sized and N-

filled gaps (Hi-C provides proximity information only), whereas in optical maps the gaps are

filled with Ns to lengths that estimate their actual size. Since there are no gaps of significance in

8

a Hi-C scaffold, optical maps align to them well as long as the contigs are ordered correctly

(Bionano software will call insertions where the gap sizes are not congruent). The combination of large sequence contigs (3-10 Mb), optical maps, and Hi-C

scaffolding data provide a very powerful set of resources. When integrating these data it is best

to scaffold the sequence contigs with Hi-C data first to create initial pseudomolecules, then to

verify and modify the result using the optical map data. Optical maps generally do not align to short contigs unless they are scaffolded with additional sequence. Placing Hi-C scaffolding first

in the workflow allows a proportion of correctly placed short contigs to be confirmed or adjusted

by the optical maps. Generating a hybrid scaffold from this final product accommodates the best features of both systems (Bickhart et al., 2017). However, even in the best assemblies, some

discrepancies will require adjustments or corrections. Appropriate adjustments have strong

global support from Hi-C data (i.e. clusters of linkage data) and strong local support for order and orientation from Bionano contigs (Figure 3). The number of adjustments generally is

proportional to the N50 of the underlying sequence contigs. Some genomic regions are easily

corrected, others require multiple iterations to untangle and resolve discrepancies (Figure 3),

and others will likely remain unresolved and may require local re-assembly of the underlying DNA sequence.

To obtain the highest quality assemblies, users can opt to include high coverage of long-

read technologies such as those offered by Oxford Nanopore or PacBio, which can generate read lengths with N50s of 10 kb or higher, and contig assemblies with N50s over 10 Mb.

Michael et al. (2017) describe the assembly of an entire Arabidopsis genome using Oxford

Nanopore technology and confirm the assembly with Bionano optical mapping. PacBio frequently has been used to assemble the genomes of model species, and it combines well with

optical mapping (e.g. (Jiao et al., 2017)), but sufficient coverage can be cost limiting for many

projects.

How to interpret whole-genome validation data

What are the metrics to consider for genome assembly validation? As previously discussed, errors and chimeras may exist for both optical maps and sequence assemblies, thus

100% congruence is not expected. By aligning Bionano assemblies with different published

genomes, we have empirically identified ~85% to be a reasonable level of initial alignment

between the two assemblies (Bionano data and DNA sequence). We do not suggest that researchers and reviewers use this number as a hard threshold; rather we suggest it be used as

9

a soft threshold with a significant amount of subjectivity. For example, percentages higher than

85% should be encouraged while percentages between 70%-85% should be reasonably justified. Good reasons for low alignment might be that different accessions were used for the

two maps or that the genome has exceptionally high repeat content.

Percentages lower than 70-85% could provide a reviewer the basis to suggest sequence

assembly improvement, optical map assembly improvement, or further justification. In genomes with low (<70%) alignment between the optical map and sequence assembly, researchers can

look at other parameters to identify the potential sources of error. Chimeric contigs and the

resulting conflicts detected during hybrid scaffolding are one reason for low alignment. Conflict resolution can be ignored (i.e. flagged only), manually adjusted, or automatically resolved.

Chimeric contigs can occur in either the sequence or optical assemblies. When one assembly is

inferior to the other, there will be more conflicts assigned to it than the other. Each assembly comparison is unique and neither assembly is perfect in eukaryotic plant genomes. For

example, a G. herbaceum Bionano assembly aligned to its draft sequence assembly revealed

923 Bionano conflicts and 53 sequence conflicts, suggesting that the Bionano assembly could

be improved by increasing specificity (decreasing p-value for matches during assembly) or by closely reviewing and omitting unresolved conflicts. Manual editing of the G. herbaceum

scaffolds reduced the conflicts (to 609 Bionano conflicts and 29 sequence conflicts) and

improved the percent mapping (from 89.0 to 90.3). Several hundred or thousands of conflicts would be a red-flag indicator that one or both assemblies are of poor quality. Consequently, this

information could be used to assess the need for researchers to revisit HMW DNA preparation,

data collection, or assembly. If Hi-C data is used to create scaffolded pseudomolecules, conflicts between the Hi-C and optical map data are expected during the initial alignment. An

iterative process of local contig adjustments (where groups of one or more contigs are

considered individually) can be used to order and orient the Hi-C sequence contigs based on

the optical map (Figure 3). Currently, these adjustments are made manually but it is likely that automated pipelines will be developed in the future.

In summary, a respectable draft genome sequence will have matches to a respectable

optical map assembly. Because both assemblies are independently constructed, each has their own source of assembly limitations and errors. The optical map assembly can be used to

independently validate the distances between nick sites in the DNA sequence assembly. If the

validation percentage is high, optical map data can be directly used in hybrid scaffolding to

improve the overall assembly. If the validation percentage is low, researchers can use the data to assess the need for re-assembly, additional data collection, or both. If the validation is

10

borderline (70-85%), the lack of congruence might be justified (e.g. accession or species

differences) or it might be addressed through re-assembly and conflict resolution. Resolving conflicts might be sufficient to improve the overall alignment, although thisdepends on the

genome and the quality of the assemblies. If Hi-C scaffolding was used, adjustments to the local

order of contigs leverages the strengths of the optical map and does not necessarily invalidate

the ordering of likelihoods used for pseudomolecule construction. We anticipate dramatic improvements in all areas of sequencing and genome assembly in the coming years, but until

chromosomes can be confidently assembled from end to end from sequence data alone, optical

mapping will continue to have an important niche.

Author Contributions and Acknowledgments JAU and RKD jointly conceived and wrote the commentary and JAU prepared the figures. RKD would like to thank Florian Jupe, Jinghua Shi, and Joseph Ecker for providing training in optical

mapping. Alex Freeman identified and adjusted the genomic region described in Figures 2 and

3. Graduate students Alex, Chris Hanson, and Evan Long created the Bionano genome

assemblies in Table 1 (except for maize, created by RKD). We thank Mingcheng Luo, JonathanGent, Jianing Lui, Shawn Sullivan (Phase Genomics), and Sven Bocklandt (Bionano) for

valuable comments on an early version of this manuscript. Work in the Udall laboratory was

funded by a grant from the National Science Foundation (1339412). Another grant from theNational Science Foundation (1444514) funds work in the Dawe laboratory.

11

Table 1. Recent results of physical maps aligned to their respective reference genomes.

Bionano physical map numbers

Species (Accession) Published Reference

Total Sequence

Length # Contigs Map Length Contig N50

Total Mb Unique

Aligned* Len., (%)

G. barbadense (3-79) Zhang et al. 2015, (G. hirsutum) 2310 1711 2120 1.76 1192 (52)

G. barbadense (3-79) Liu et al. 2015 1524 1711 2120 1.76 3 (0)

G. barbadense (3-79) Yuan et al. 2015 2437 1711 2120 1.76 10 (0) G. bickii unpublished 1883 1242 1561 1.69 1486 (79)

G. herbaceum (Wagad) unpublished 1579 1842 1569 1.20 1426 (90)

G. hirsutum (TM-1) Zhang et al. 2015 2310 2196 2186 1.37 1965 (85)

G. hirsutum (TM-1) Li et al. 2015 2143 2196 2186 1.37 1606 (75) G. hirsutum (Maxxa) Zhang et al. 2015 2310 2087 2335 1.62 1739 (75)

G. longicalyx unpublished 1196 1111 1185 1.62 1158 (96)

G. raimondii - BssSI Paterson et al. 2012* 753 1020 743 0.95 634 (81)

G. raimondii - BspQI Paterson et al. 2012* 751 425 784 2.64 595 (78) Z. mays (B73) Jiao et al. 2017 2102 1488 2146 2.36 2034 (96) Z. mays (W22) unpublished, maizegdb.org 2129 1872 2171 1.65 2115 (98) O. sativa vg. indica (MH63) Wing et al. personal comm. 387 705 350 0.61 331 (87) O. sativa vg. japonica (Nipponbare) Wing et al. personal comm. 373 764 300 0.42 290 (78) O. sativa vg. japonica (Nipponbare) Kawahara et al. 2013 321 482 377 1.10 236 (96) O. sativa vg. indica (93-11) Chen et al. 2017 321 394 394 1.40 245 (85)

* This is the “Total Unique Aligned Len / Ref Length” value produced by the exp_informaticsReport in Bionano Irysview Software.

12

REFERENCES

Antipov, D., Korobeynikov, A., McLean, J.S., and Pevzner, P.A. (2016). HYBRIDSPADES: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009-1015.

Berlin, K., Koren, S., Chin, C.S., Drake, J.P., Landolin, J.M., and Phillippy, A.M. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33, 623-630.

Bickhart, D.M., Rosen, B.D., Koren, S., Sayre, B.L., Hastie, A.R., Chan, S., Lee, J., Lam, E.T., et al. (2017). Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49, 643-650.

Bionano Genomics Inc. (2017). Hybrid Scaffolding Improves Genome Assembly Accuracy and Contiguity. In White Paper Series (Bionano Genomics Inc.: Bionano Genomics Inc.).

Blakesley, R.W., Hansen, N.F., Mullikin, J.C., Thomas, P.J., McDowell, J.C., Maskeri, B., Young, A.C., Benjamin, B., et al. (2004). An intermediate grade of finished genomic sequencesuitable for comparative analyses. Genome Res 14, 2235-2244.

Bolger, M.E., Weisshaar, B., Scholz, U., Stein, N., Usadel, B., and Mayer, K.F.X. (2014). Plant genome sequencing - applications for crop improvement. Curr. Opin. Biotechnol. 26, 31-37.

Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O., and Shendure, J. (2013). Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 31, 1119-1125.

Chain, P.S., Grafham, D.V., Fulton, R.S., Fitzgerald, M.G., Hostetler, J., Muzny, D., Ali, J., Birren, B., et al. (2009). Genomics. Genome project standards in a new era of sequencing. Science 326, 236-237.

Chaney, L., Sharp, A.R., Evans, C.R., and Udall, J.A. (2016). Genome Mapping in Plant Comparative Genomics. Trends Plant Sci 21, 770-780.

Chen, P., Jing, X., Liao, B., Zhu, Y., Xu, J., Liu, R., Zhao, Y., and Li, X. (2017). BioNano genome map resource for Oryza sativa ssp. japonica and indica and its application in rice genome sequence correction and gap filling. Mol Plant 10, 895-898.

Clouse, J.W., Adhikary, D., Page, J.T., Ramaraj, T., Deyholos, M.K., Udall, J.A., Fairbanks, D.J., Jellen, E.N., et al. (2016). The Amaranth Genome: Genome, Transcriptome, and Physical Map Assembly. Plant Genome 9, 1.

Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, N.C., Shamim, M.S., Machol, I., et al. (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92-95.

Gan, X., Hay, A., Kwantes, M., Haberer, G., Hallab, A., Ioio, R.D., Hofhuis, H., Pieper, B., et al. (2016). The Cardamine hirsuta genome offers insight into the evolution of morphological diversity. Nat Plants 2, 16167.

Gan, X.C., Stegle, O., Behr, J., Steffen, J.G., Drewe, P., Hildebrand, K.L., Lyngsoe, R., Schultheiss, S.J., et al. (2011). Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419-423.

Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R.L., Dunn, M., Glazebrook, J., Sessions, A., et al. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp japonica). Science 296, 92-100.

13

Hirsch, C.N., Foerster, J.M., Johnson, J.M., Sekhon, R.S., Muttoni, G., Vaillancourt, B., Penagaricano, F., Lindquist, E., et al. (2014). Insights into the Maize Pan-Genome and Pan-Transcriptome. Plant Cell 26, 121-135.

Jarvis, D.E., Ho, Y.S., Lightfoot, D.J., Schmockel, S.M., Li, B., Borm, T.J.A., Ohyanagi, H., Mineta, K., et al. (2017). The genome of Chenopodium quinoa. Nature 542, 307-312.

Jiao, Y., Peluso, P., Shi, J., Liang, T., Stitzer, M.C., Wang, B., Campbell, M.S., Stein, J.C., et al. (2017). Improved maize reference genome with single-molecule technologies. Nature 546, 524-527.

Jin, J., Lee, M., Bai, B., Sun, Y., Qu, J., Rahmadsyah, Alfiko, Y., Lim, C.H., et al. (2016). Draft genome sequence of an elite Dura palm and whole-genome patterns of DNA variation in oil palm. DNA Res 23, 527-533.

Kaiser, M.D., Davis, J.R., Grinberg, B.S., Oliver, J.S., Sage, J.M., Seward, L., and Bready, B. (2017). Automated Structural Variant Verification In Human Genomes Using Single-Molecule Electronic DNA Mapping. bioRxiv.

Kawakatsu, T., Huang, S.S., Jupe, F., Sasaki, E., Schmitz, R.J., Urich, M.A., Castanon, R., Nery, J.R., et al. (2016). Epigenomic Diversity in a Global Collection of Arabidopsis thalianaAccessions. Cell 166, 492-505.

Korbel, J.O., and Lee, C. (2013). Genome assembly and haplotyping with Hi-C. Nature Biotechnology 31, 1099-1101.

Levy-Sakin, M., and Ebenstein, Y. (2013). Beyond sequencing: optical mapping of DNA in the age of nanotechnology and nanoscopy. Curr. Opin. Biotechnol. 24, 690-698.

Li, F., Fan, G., Lu, C., Xiao, G., Zou, C., Kohel, R.J., Ma, Z., Shang, H., et al. (2015). Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat Biotechnol 33, 524-530.

Li, R.Q., Li, Y.R., Zheng, H.C., Luo, R.B., Zhu, H.M., Li, Q.B., Qian, W.B., Ren, Y.Y., et al. (2010). Building the sequence map of the human pan-genome. Nature Biotechnology 28, 57-U83.

Li, Y.H., Zhou, G.Y., Ma, J.X., Jiang, W.K., Jin, L.G., Zhang, Z.H., Guo, Y., Zhang, J.B., et al. (2014). De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nature Biotechnology 32, 1045-1052.

Martinez-Garcia, P.J., Crepeau, M.W., Puiu, D., Gonzalez-Ibeas, D., Whalen, J., Stevens, K.A., Paul, R., Butterfield, T.S., et al. (2016). The walnut (Juglans regia) genome sequence reveals diversity in genes coding for the biosynthesis of non-structural polyphenols. Plant J 87, 507-532.

Mascher, M., Gundlach, H., Himmelbach, A., Beier, S., Twardziok, S.O., Wicker, T., Radchuk, V., Dockter, C., et al. (2017). A chromosome conformation capture ordered sequence of the barley genome. Nature 544, 427-433.

Michael, T.P., Jupe, F., Bemm, F., Motley, S.T., Sandoval, J.P., Loudet, O., Weigel, D., and Ecker, J.R. (2017). High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. bioRxiv.

Neale, D.B., Wegrzyn, J.L., Stevens, K.A., Zimin, A.V., Puiu, D., Crepeau, M.W., Cardeno, C., Koriabine, M., et al. (2014). Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol 15, R59.

14

Oliver, J.S., Catalano, A., Davis, J.R., Grinberg, B.S., Hutchins, T.E., Kaiser, M.D., Nurnberg, S., Sage, J.M., et al. (2017). High-Definition Electronic Genome Maps From Single Molecule Data. bioRxiv.

Olsen, J.L., Rouze, P., Verhelst, B., Lin, Y.C., Bayer, T., Collen, J., Dattolo, E., De Paoli, E., et al. (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 530, 331-335.

Paterson, A.H., Wendel, J.F., Gundlach, H., Guo, H., Jenkins, J., Jin, D., Llewellyn, D., Showmaker, K.C., et al. (2012). Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature 492, 423-427.

Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J.X., Mitros, T., Nelson, W., Hyten, D.L., Song, Q.J., et al. (2010). Genome sequence of the palaeopolyploid soybean. Nature 463, 178-183.

Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F.S., Pasternak, S., Liang, C.Z., Zhang, J.W., et al. (2009). The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science 326, 1112-1115.

Tang, C., Yang, M., Fang, Y., Luo, Y., Gao, S., Xiao, X., An, Z., Zhou, B., et al. (2016). The rubber tree genome reveals new insights into rubber production and species adaptation. Nat Plants 2, 16073.

Tang, H.B., Lyons, E., and Town, C.D. (2015). Optical mapping in plant comparative genomics. Gigascience 4.

Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.

Vining, K.J., Johnson, S.R., Ahkami, A., Lange, I., Parrish, A.N., Trapp, S.C., Croteau, R.B., Straub, S.C., et al. (2016). Draft Genome Sequence of Mentha longifolia and Development of Resources for Mint Cultivar Improvement. Mol Plant 10, 323-339.

Weisenfeld, N.I., Kumar, V., Shah, P., Church, D.M., and Jaffe, D.B. (2017). Direct determination of diploid genome sequences. Genome Res 27, 757-767.

Yuan, Y., Bayer, P.E., Batley, J., and Edwards, D. (2017). Improvements in Genomic Technologies: Application to Crop Genomics. Trends Biotechnol 35, 547-558.

Zhang, T., Hu, Y., Jiang, W., Fang, L., Guan, X., Chen, J., Zhang, J., Saski, C.A., et al. (2015). Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nat Biotechnol 33, 531-537.

Figure 1. An 8.5 Mb region of chromosome 1 from the Gossypium raimondii genome assembly and associated alignment data. At the top, a hybrid assembly is aligned to two scaffolds of genomic sequence, six BspQI Bionnocontigs, and eight BssSI Bionano contigs. Horizontal bars from left to right represent the chromosome and contigs, while colored vertical bands within the bars represent nick sites (BspQI= orange[matched]/red[unmatched], BssSI = red[matched]/green[unmatched]). Grey lines connect matched nick sites between the hybrid scaffold and the contigs (BspQI = orange, BssSI = red). A region of a Bionano (upper red box) is expanded to illustrate its individual alignment to the genome sequence (green bar, labeled sequence contig). A region of the same Bionano contig(lower red box) is further expanded to illustrate the consensus contig containing individually nicked, labeled, and assembled DNA molecules. Tick marks represent 400kb, 100 kb, and 50 kb on the top, middle, and bottom scales. The blue individual molecules overlap a single BssSI nick site (*). Red individual molecules do not overlap the selected nick site.

Figure 2. An illustration of Bionano contigs likely spanning centromeric-regions in the Gossypiumherbaceum reference. The best match between the reference sequence and the Bionano molecules is determined by the lowest local p-values. The repeats in one sequence contig on this chromosome are very regularly spaced resulting in significant matches with several Bionano contigs, despite those same contigsalso having flanking regions that match elsewhere to the genome sequence. Consequently, Bionano molecules spanning the centromeric region were mapped to this genomic location. The top colored bar illustrates the sequence contigs that were concatenated to form the pseudomolecule chromosome. Bionano contigs are illustrated as cyan bars with a light blue coverage plot and many dark blue vertical BssSI matches to the genome sequence. Pink elipsesillustrate the regions matching the repeat (inset right). Grey lines connect the matched nick sites between the reference sequence and Bionano contigs. A closer view of the putative centromericregions illustrates the match of one BssSI site to multiple matches (red lines) in the different contigs(inset left).

Figure 3. Sequence contigs from Gossypium herbaceum chromosome 4 ordered and oriented into pseudomolecules by the Hi-C methodology (as assembled by PhaseGenomics). A) The first row is a colored bar that represents the concatenated contigs based on clustering and orientation likelihood ratios of Hi-C data. The second row is the same sequence that has been digested in silico and displayed as the ‘Reference with nicks’ with a ruler in megabases. The two following rows represent Bionano molecules aligned to the nick sites of the Reference sequence. Matches are illustrated by grey lines connecting the nick sites between the reference sequence and the Bionano molecules. B) Three steps of rearrangement can locally reorganize the sequence contigsso that they agree with the Bionano contigs that have substantive evidence of contig order. The contigs are treated as blocks (of one or more contigs) and the blocks can be inverted, moved, or both. C) Once the contigs have been re-ordered and oriented based on the Bionano evidence, the sequence is re-digested in silico and the Bionano contigs are re-aligned to the final version of the sequence. Both sequence and Bionano contigs agree in order and orientation after corrections to the genome sequence.

DOI 10.1105/tpc.17.00514; originally published online December 20, 2017;Plant Cell

Joshua Udall and R. Kelly DaweIs it ordered correctly? Validating genome assemblies by optical mapping

This information is current as of June 4, 2020

Permissions https://www.copyright.com/ccc/openurl.do?sid=pd_hw1532298X&issn=1532298X&WT.mc_id=pd_hw1532298X

eTOCs http://www.plantcell.org/cgi/alerts/ctmain

Sign up for eTOCs at:

CiteTrack Alerts http://www.plantcell.org/cgi/alerts/ctmain

Sign up for CiteTrack Alerts at:

Subscription Information http://www.aspb.org/publications/subscriptions.cfm

is available at:Plant Physiology and The Plant CellSubscription Information for

ADVANCING THE SCIENCE OF PLANT BIOLOGY © American Society of Plant Biologists

https://www.copyright.com/ccc/openurl.do?sid=pd_hw1532298X&issn=1532298X&WT.mc_id=pd_hw1532298X

http://www.plantcell.org/cgi/alerts/ctmain

http://www.plantcell.org/cgi/alerts/ctmain

http://www.aspb.org/publications/subscriptions.cfm

Is It Ordered Correctly? Validating Genome Assemblies by ...The problem of genome validation takes...

Documents

Transcript of Is It Ordered Correctly? Validating Genome Assemblies by ...The problem of genome validation takes...