GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on...

23
Genotyping ? IFT6299 H2019 ? UdeM ? Mikl´ os Cs˝ ur¨ os G ENOMIC VARIATION

Transcript of GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on...

Page 1: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros

GENOMIC VARIATION

Page 2: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Variants

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros ii

? SNP=Single Nucleotide Polymorphism, small indels

? structural variants

0CVWTG�4GXKGYU�^�)GPGVKEU

/QDKNGGNGOGPV

4GH�

4GH�

4GH�

0QXGN�UGSWGPEG�KPUGTVKQP

+PVGTURGTUGF�FWRNKECVKQP

6TCPUNQECVKQP

4GH�4GH�

&GNGVKQP

4GH�

+PXGTUKQP

6CPFGO�FWRNKECVKQP

4GH�

/QDKNG�GNGOGPV�KPUGTVKQP

4GH�

Figure 1 | Classes of structural variation. Traditionally, structural variation refers to genomic alterations that are larger than 1 kb in length, but advances in discovery techniques have led to the detection of smaller events. Currently, >50 bp is used as an operational demarcation between indels and copy number variants (CNVs). The schematic depicts deletions, novel sequence insertions, mobile-element insertions, tandem and interspersed segmental duplications, inversions and translocations in a test genome (lower line) when compared with the reference genome.

Array comparative genomic hybridization(Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known target DNA sequence immobilized on a solid glass substrate and then interrogating the hybridization ratio.

SNP microarraysHybridization-based assays in which the target DNA sequences are discriminated on the basis of a single base difference. Assays are processed with a single sample per array and perform both SNP genotyping and copy-number interrogation.

Single-base extensionSingle-base-extension reactions use a primer that binds to a region of interest and follow this with an extension reaction that allows the incorporation of a single base after the primer.

technologies infer copy number gains or losses com-pared to a reference sample or population, but differ in the details and application of the molecular assays.

Array CGH. Array CGH platforms are based on the principle of comparative hybridization of two labelled samples (test and reference) to a set of hybridization tar-gets (typically long oligonucleotides or, historically, bac-terial artificial chromosome (BAC) clones). The signal ratio is then used as a proxy for copy number (see BOX 1 for details). An important consideration is the effect of the reference sample on the copy-number profile. For example, when only one sample is examined, a loss in the reference sample is indistinguishable from a gain in the test sample. For this reason, a well-characterized ref-erence is key to interpretation of array CGH data19. Early studies of germline CNVs were based on BAC arrays or low-resolution oligonucleotide platforms and allowed detection of CNVs typically greater than 100 kb1,2,6 (BOX 2). These initial studies highlighted the incred-ible number of CNVs observed in healthy individuals; however, the breakpoints of these alterations were not sufficiently well-defined to allow accurate assessment of the proportion of the genome altered or its gene con-tent. This led to a drastic overestimation of the extent of copy-number polymorphism using large-insert BAC clones2, which was subsequently refined by oligonucle-otide microarrays or sequence-based studies of the same DNA samples4,5,20,21.

Currently, Roche NimbleGen and Agilent Technologies are the major suppliers of whole-genome array CGH platforms and routinely produce arrays with up to 2.1 million (2.1M) and 1M long oligonucleotides (50–75-mers), respectively, per microarray. Detection of a CNV typically requires a signal from at least 3 to 10

consecutive probes (BOX 1); as a result, SNP and CGH microarrays can routinely detect anywhere from dozens to several hundred events per genome depending on the platform applied (BOXES 1,2). Two studies have recently used ultra-high-resolution arrays (24M to 42M probes) for array CGH-based SV discovery in samples from HapMap individuals5,19. Although such high-density arrays are not practical for a large number of samples (30 and 40 samples were used in these studies), these approaches enabled the discovery of CNVs down to 500 bp, with breakpoints precise enough to allow the identification of sequence motifs at a subset of vari-ants. One key advantage of array CGH platforms is the availability of custom, high-probe-density arrays from both major manufacturers. This has led to their widespread adoption in clinical diagnostics, essentially replacing karyotype analysis as the primary means of detecting copy-number alterations among children with developmental delay22.

SNP arrays. SNP microarray platforms are also based on hybridization, with a few key differences from CGH tech-nologies. First, hybridization is performed on a single sample per microarray, and log-transformed ratios are generated by clustering the intensities measured at each probe across many samples20,23,24. Second, SNP platforms take advantage of probe designs that are specific to single-nucleotide differences between DNA sequences, either by single-base-extension methods (Illumina) or differential hybridization (Affymetrix)20,23,24. One key disadvantage is that, per probe, SNP microarrays tend to offer lower signal-to-noise ratio than do the best array CGH platforms. This is apparent in comparisons of array CGH and SNP platforms in terms of detection of CNVs by a purely ratio-based approach24–27. However, a key advantage of SNP microarrays is the use of SNP allele-specific probes to increase CNV sensitivity, dis-tinguish alleles and identify regions of uniparental disomy through the calculation of a metric termed B allele frequency (BAF) (BOX 1).

SNP arrays have proved popular in CNV-detection studies, historically as complements to array CGH platforms for fine-mapping regions2 and currently in the large-scale discovery of CNVs in a broad variety of populations16,20,23,28,29. Early SNP arrays demonstrated poor coverage of CNV regions, but recent arrays (such as the Affymetrix 6.0 SNP and Illumina 1M platforms) incorporate better SNP selection criteria for complex regions of the genome and non-polymorphic copy-number probes (which are examined for log ratios but not BAF)20,23,30. Another important consideration is the choice of population because the average heterozygosity affects the proportion of SNPs that will generate a mean-ingful BAF signal (typically, heterozygosity is 30–40% in Illumina platforms). This is particularly relevant when dealing with populations that may have experienced a drastic bottleneck, as opposed to more outbred popula-tions, and thus may affect the number of probes needed to identify an alteration23,24. Some studies combine array CGH and SNP platforms to offer higher confidence in CNV detection2,20,30.

REVIEWS

364 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

? copy number variation(=duplication of > 50pb, including complete chromosomes)

Alkan, Coe & Eichler Nature Reviews Genetics 12:363 (2011)

Page 3: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Methods

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros iii

? CGH (comparative genomic hybridization) on a chip⇒ copy number varia-tion

? SNP array: hybridization probes for known alleles arranged on a chip, hy-bridization detected by using marked probes or samples

? sequencing + alignment with reference genome

Page 4: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

SNPs

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros iv

A

A

G

C

C

Treferencepaternalchromosome maternal chromosome

targeted personal genome

heterozygousSNP

homozygousSNP

wildtype allele

(mutant allele)

C

C

T

Unphased Phased##fileformat=VCFv4.2...... REF ALT ... FORMAT smp... A G ... GT 0/1... C T ... GT 0/1... T C ... GT 1/1

##fileformat=VCFv4.2...... REF ALT ... FORMAT smp... A G ... GT 0|1... C T ... GT 1|0... T C ... GT 1|1

Page 5: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Variant Call Format

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros v

1 The VCF specification

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines (prefixedwith ”##”), a header line (prefixed with ”#”), and data lines each containing information about a position in thegenome and genotype information on samples for each position (text fields separated by tabs). Zero length fields arenot allowed, a dot (”.”) must be used instead. In order to ensure interoperability across platforms, VCF compliantimplementations must support both LF (\n) and CR+LF (\r\n) newline conventions.

1.1 An example##fileformat=VCFv4.3##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality isbelow 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a referencesequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellitewith two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype dataare given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depthand haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellitecalls are unphased.

1.2 Character encoding, non-printable characters and characters with special mean-ing

The character encoding of VCF files is UTF-8. UTF-8 is a multi-byte character encoding that is a strict supersetof 7-bit ASCII and has the property that none of the bytes in any multi-byte characters are 7-bit ASCII bytes. Asa result, most software that processes VCF files does not have to be aware of the possible presence of multi-byteUTF-8 characters. Note that non-printable characters U+0000-U+0008, U+000B-U+000C, U+000E-U+001F aredisallowed. Line separators must be CR+LF or LF and they are allowed only as line separators at end of line.Characters with special meaning (such as field delimiters ’;’ in INFO or ’:’ FORMAT fields) must be representedusing the capitalized percent encoding:

%3A : (colon)%3B ; (semicolon)%3D = (equal sign)%25 % (percent sign)%2C , (comma)%0D CR%0A LF%09 TAB

1.3 Data types

Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit, formatted to match the regular expression^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$, NaN, or +/-Inf), Flag, Character, and String. For the Integer type,the values from �231 to �231 + 7 cannot be stored in the binary version and therefore are disallowed in both VCFand BCF, see 6.3.3.

4

Header: information on data origin, methods, and definition of codes to be used(in FILTER, INFO and FORMAT fields)

INFO: information on the locus; FORMAT declares the genotyping fields forthe following columns; alleles are encodes as 0,1,. . . (0=reference)

Page 6: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Variant calling

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros vi

base calls at a locus: Z = {(z1, q1), (z2, q2), . . . , (zn, qn)}

possible diploid genotypes (unordered pairs): 4 homozygotes, 6 heterozygotes

unknown genotype: Y

P{Y = y1y2

∣∣∣∣ Z} ∝ P{Z∣∣∣∣ Y = y1y2

}︸ ︷︷ ︸

likelihood of y1y2

× P{y1y2

}︸ ︷︷ ︸

prob. of genotype y1y2

calculate P{Z

∣∣∣∣ Y = y1y2

}for homozygotes (y1 = y2) or heterozygotes

(y1 6= y2) . . .

Page 7: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Genotypes: Hardy-Weinberg

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros vii

Hardy-Weinberg equilibrium:

? infinite population

? discrete generations

? panmixia

? ∅ mutations, immigration, selection

allele frequencies: wildtype (A) with p, mutant (a) with q = 1− p

diploid genotype frequencies: AA ∼ p2, Aa ∼ 2pq , aa ∼ q2

stays constant . . .

Page 8: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

SNP frequencies

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros viii

minor allele frequency (MAF): mutant allele’s frequency in the population (q)

frequent SNPs (MAF > 10%) and rare ones (MAF < 5%)

HAPMAP: SNP arrays (phase I 2005, phase II 2009)

1000 Genomes

populations: CEU (North America, European ancestors), YRI (Yoruba, Nigeria),JPT (Japon), CHB (Han); ASW (African Americains), GIH (Gujarati in Houston),MEX (Mexicans in Los Angeles), LWK (Luhya, Kenya), . . .

⇒ population-scpecific MAF values

dbSNP: database on SNPs and their frequencies

Page 9: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Haplotyping — as a graph problem

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros ix

28 RUSSELL SCHWARTZ

1 2 3 4

A - C -

T A - G

A - G G

- C - C

T - C C

1 3

A C

A G

T C

1 4

T G

A G

T C

(a) (b) (c) (d)1 2

34

Fig. 3. The SNP conflict graph. (a) A hypothetical data set consisting of five fragments

sequenced at four sites. (b) Highlight of a SNP conflict between columns 1 and 3 of (a), using rows

1, 3, and 5. (c) Highlight of a SNP conflict between columns 1 and 4 of (a) using rows 2, 3, and 5.

(d) The SNP conflict graph for (a), with edges corresponding to the conflicts shown in (b) and (c).

define a SNP conflict graph GS = (VS , ES) specifying SNP pairs that are inconsistentwith having come from at most two haplotypes. For the SNP conflict graph, |VS | = n

(one vertex per SNP) and ES is defined as follows:

ES = {(vi, vj)|vi, vj 2 VS

^9k1, k2, k3 s.t.

((fk1i, fk1j) 6= (fk2i, fk2j)) ^ ((fk1i, fk1j) 6= (fk3i, fk3j)) ^ ((fk2i, fk2j) 6= (fk3i, fk3j)) ^(fk1i 6= ‘°0) ^ (fk1j 6= ‘°0) ^ (fk2i 6= ‘°0) ^ (fk2j 6= ‘°0) ^ (fk3i 6= ‘°0) ^ (fk3j 6= ‘°0)}

Informally, ES is the set of SNP pairs for which at least three haplotypes are observedacross all genotypes. When two SNPs conflict then the set of fragments coveringthose sites cannot be resolved into two haplotypes without disagreements among thefragments of at least one haplotype. Fig. 3 illustrates the construction of GS .

2.2. Problem Formulations. In general, data will not be error free and thusGF may not be bipartite. Conflicts may occur because of sequencing errors, whichintroduce erroneous SNP values into individual fragments, and paralogous misrecruit-ment, which introduces erroneous fragments into the data set. DiÆerent problem for-mulations reflect diÆerent ways of inducing some bipartite G0

F close to GF . Our firstformulations of the haplotype assembly problem approximately captures the intuitionthat sequencing errors are the main source of conflicts in haplotypes:

Minimum edge removal (MER) [19]: Find V1, V2 µ VF

such that V1 [ V2 = VF minimizingP

vi,vj2V1w(vi, vj) +P

vi,vj2V2w(vi, vj).

Note that although we define MER to minimize edge weights, we could alternativelydefine an unweighted version of the problem:

HAPLOTYPE ASSEMBLY PROBLEM 27

~h2 = {h21, h22, . . . , h2n}

summarizing the partition F1 and F2, e.g., by taking the more common allele (ifany) at each site covered by each part. The specifics of how we judge the quality ofthe partition F1 [ F2 and form the consensus haplotypes ~h1 and ~h2 distinguish thediÆerent variants of the haplotype assembly problem.

In explaining the problem variants in the literature, it is helpful to consider twofurther abstractions of the data [14]. We first construct a fragment conflict graphGF = (VF , EF ) where |VF | = m (i.e., one node per fragment) and the edge set isdefined by the pairs of fragments that conflict:

EF = {(vi, vj)|vi 2 VF , vj 2 VF , d(~fi, ~fj) > 0}.

Fig. 2 illustrates a fragment conflict graph for a hypothetical data set. We can op-tionally treat GF as a weighted graph, weighting each edge by the distance betweenits corresponding fragments:

w(vi, vj) = d(fi, fj).

If the data is error-free, then it must be possible to partition the fragments intotwo sets such that there are no conflicts within either set. Thus, error free data impliesthat GF is bipartite; two fragments from the same chromosome will not conflict whiletwo fragments from diÆerent chromosomes may or may not conflict. In Fig. 2, thegrey line shows a partition of the fragments into two parts implying a solution tothe fragment assignment problem. For bipartite Gf , the haplotype assembly problemtrivially reduces to finding the two parts of a bipartite graph, which we can solve inlinear time.

A - A - - -

- C A - - -

- C - - T -

- - - G - A

G G - - - -

- - T A - -

- - - - C T

(a) (b)

Fig. 2. The fragment conflict graph. (a) A set of hypothetical fragments with edges labeling

conflicts between pairs of fragments. (b) The fragment conflict graph GF corresponding to the

fragments in (a). The grey bar cuts all edges in the graph, showing that GF is bipartite and that

the haplotype assembly problem hence has an error-free solution.

Some formulations of the haplotype assembly problem depend on an alternativegraph formulation based on conflicts between SNP sites. For this formulation, we

Schwartz Communications in Information and Systems, 10:23–38 (2010)

Page 10: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Haplotype assembly

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros x

HAPLOTYPE ASSEMBLY PROBLEM 29

1 2

34

(a) (b) (c)

TGAC-----

AC-----CG

----AGT--

--GAT----

-----CATA

(d) (e)

TGAC-----

AC-----CG

----TGT--

--GAT----

-----CATA

----TGT------TGT--

TGAC-----

AC-----CG

----AGT--

--GAT----

-----CATA

TGAC-----

AC-----CG

----AGT--

--GAT----

-----CATA-----CATA-----CATA

1

Fig. 4. Illustration of haplotype assembly problem variants. (a) Minimum edge removal (MER).

(b) Minimum fragment removal (MFR). (c) Minimum SNP removal (MSR). (d) Minimum error

correction (MEC). (e) Longest fragment reconstruction (LFR).

Unweighted minimum edge removal (UMER) [19]:

Find V1, V2 µ VF such that V1 [ V2 = VF minimizing|{(vi, vj) 2 EF |vi, vj 2 V1 _ vi, vj 2 V2}|.

MER is illustrated in Fig. 4(a). We assume that we need to remove some subsetof the edges of GF (marked in grey in the figure) to produce a bipartite graph on allnodes. We may then have conflicting fragments assigned to a given chromosome andmust choose consensus SNP alleles for each chromosome to derive the two haplotypes.Although it does not precisely correspond to any reasonable error model, MER cannonetheless be a useful formulation algorithmically because it reduces the the problemto a well-studied graph optimization problem, Maximum Cut [11].

An analogous formulation of the problem can also be derived from the fragmentconflict graph:

Minimum fragment removal (MFR) [14]: Find V1, V2 µ VF

minimizing |VF /(V1 [ V2)| such that V1 \ V2 = ; and @vi, vj 2 V1

s.t. (vi, vj) 2 E and @vi, vj 2 V2 s.t. (vi, vj) 2 EF .

Informally, the problem is to remove as few fragments as possible so as to leave abipartite fragment conflict graph. The remaining graph will then be conflict-free andwe can therefore easily derive the two haplotypes from its bipartition. This variantis illustrated in Fig. 4(b), which shows how we can remove some subset of the nodesof GF (marked in grey) to produce a bipartite graph on the remaining nodes. MFR

� NP-hard problems

� the abstraction does not take biological context into account

� the abstraction does not take the technologixal context into account

Page 11: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Haplotypes with inheritance

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xi

recombinations during meıosis in father and mother — example: quartet (2 parents, 2 children)

children: identical, haploidentical (by mother or by father only), or differents;

joint genotyping with HMM: 4 inheritance states + 2 error states (compression/CNV and sequencing errors);

emissions: sequence mismatch with 0.5% (or 30% in error state), or else respecting state’s contstraint; compression:

frequent heterozygotesFigure S2A

Figure S2B

30

Roach & al. Science 328:636 (2010)

Page 12: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Family quartet

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xii

gametes is therefore 2n, where n is the number ofmeioses in a pedigree. In a nuclear family of four,the Mendelian inheritance patterns can be groupedinto four inheritance states for each variant po-sition, with children receiving (i) the same allelefrom both the mother and the father (identical), (ii)the same allele from the mother but oppositesfrom the father (haploidentical maternal), (iii) thesame allele from the father, but opposites from themother (haploidentical paternal), or (iv) oppositesfrom both parents (nonidentical) (fig. S2). Adja-cent variant base pairs in alignments of the familygenomes have the same inheritance state unless arecombination has occurred between these basesin one of the meioses. This delineates inheritanceblocks.

Many algorithms can identify the boundariesof blocks, and theory-driven implementationsare in wide use (5–7). For our complete genomesequence data, we developed an algorithm toidentify all states, including non-Mendelianstates. One non-Mendelian state will occur inregions where highly similar sequences are in-advertently compressed computationally (for ex-ample, during sequence assembly of CNVs).In such a “compression block,” many positionswill appear to be heterozygous in all individuals,regardless of the inheritance patterns of the po-sitions contributing to the compression. Other non-Mendelian patterns are seen in regions prone toerrors in sequence calling or assembly or thathave inherited hemizygous deletions. For both ofthese patterns, many positions will be observedas Mendelian inheritance errors (MIEs). Our al-gorithm identified six states: one for each of thefour Mendelian inheritance states, one for a com-pression state, and one for a MIE-prone state (4).We identified 1.5% of the genome in this pedi-gree as 409 compression blocks and 1.7% as 126error-prone blocks. Because these blocks are asource of false positives for recombination pre-dictions, SNPs, and disease candidate alleles, theiridentification is important (Fig. 1). The power toprecisely determine inheritance-state boundariesis striking in families of at least four and wouldbe reduced had we sequenced fewer individuals(Fig. 2). Meiotic gene conversions could in prin-ciple be recognized in the same way as inheritanceblocks; theywould be indistinguishable froma shortregion flanked by meiotic recombinations in thesame meiosis. We found that the great majority ofcandidate gene-conversion regions were causedby reads mismapped to repetitive DNA, such asCNVs or satellites, and did not conclusivelyidentify gene-conversion regions.

Recombination in maternal meioses is thoughtto occur 1.7 times more frequently than in pa-ternal meioses (8). We inferred 98 crossovers inmaternal and 57 in paternal meioses (count in-cludes both offspring), which is consistent withthis estimate. The median resolution of the 155crossover sites was 2.6 kb, with a few sites local-ized within a 30-bp window (Fig. 1). Crossoversites were significantly correlated with hotspots ofrecombination as inferred from HapMap data, in

which a hotspot is defined as a region with !10centimorgan (cM)/Mb; 92 of the 155 recombina-tions took place in a hotspot.

By identifying inconsistencies across the 22%of the genomes of the two children in “identi-cal” blocks, for which they are effectively twins,we computed an error rate of 1.0 ! 10"5. We alsodetermined error rate through other methods, in-cluding resequencing, which gave similar esti-mates, ranging from 8.1 ! 10"6 to 1.1 ! 10"5 (4).Furthermore, ~70% of the errors in a four-personpedigree can be detected as apparent MIEs andinconsistencies in inheritance state blocks, sothe effective base-pair error rate in the contextof a pedigree is ~3 ! 10"6.

Analysis of the mutation rate, including germ-line and early embryonic somatic mutations, re-quires highly accurate sequence data. Even withsuch data, however, most apparent aberrationsin allele inheritance will be due to errors in thedata and not to mutation. Our data had thou-sands of such false-positive candidates for eachtrue de novo mutation. Our initial data encom-passed 2.3 billion bases and contained 49,720candidate MIEs that were consistent with thepresence of a single-nucleotide mutation. Afterexcluding sites in MIE-prone and compression

states as well as sites that were unsuitable forprobe design, 33,937 potential mutations among1.83 billion bases remained. We resequencedeach of these candidates and applied a stringentbase-calling algorithm to confirm 28 candidatesas de novo mutations. In a final confirmationstep, we verified all 28 mutations with massspectrometry (table S3) (4), corresponding to amutation rate of 3.8 ! 10"9 per position pergeneration per haploid genome.

Because the raw estimate of 3.8 ! 10"9 doesnot account for the true mutations that were notconclusively identified through resequencing,we estimated a false-negative rate by applyingthe base-calling algorithm to 5 Mb of indepen-dent resequencing data, divided into 25 random-ly selected regions of the genome. A comparisonof the resequencing data with the completegenome sequence for the same regions provideda de novo mutation false negative rate of 0.662[95% confidence interval (CI) 0.644 to 0.680].Adjusting for the false-negative rate producedan unbiased mutation rate estimate of 1.1 ! 10"8

per position per haploid genome, correspondingto approximately 70 new mutations in eachdiploid human genome (95% CI of 6.8 ! 10"9

to 1.7 ! 10"8) (4). In great apes, CpG sites are

Fig. 1. The landscape of recombination. Each chromosome in this schematic karyotype is used torepresent information abstracted from the four corresponding chromosomes of the two children in thepedigree. It is vertically split to indicate the inheritance state from the father (left half) and mother (righthalf), as shown in the key. The three compound heterozygous (DHODH, DNAH5, and KIAA0556) and onerecessive (CES1) candidate gene, depicted by red bands, lie in “identical” blocks. (Inset) Scatterplot ofHapMap recombination rates (in centimorgans per megabase) within the predicted crossover regions. Themaximum value of centimorgans per megabase found in each window is shown in red. The left his-togram shows the size distribution of recombination windows (log10 value of –0.58 T 0.92). The topgraph shows the centimorgans per megabase distribution for the observed maximal values (red), forsimilarly sized windows shifted by 6 kb (orange), and for similarly sized windows randomly chosen fromthe entire genome (blue). A shift of 6 kb from the observed locations eliminates the correlation withhotspots. Of 155 recombination windows, 92 contained a HapMap site with >10 cM/Mb. Only fiverandomly picked windows are expected to contain such high recombination rates.

www.sciencemag.org SCIENCE VOL 328 30 APRIL 2010 637

REPORTS

Roach & al. Science 328:636 (2010)

Page 13: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Long reads

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xiii

10X : read length extension by barcodingPacific Biosciences: single-molecule real-time

Table 1. Genome assemblies

Input Continuity Comparison to truth data Runtime

Ida Sampleb Ethnicityc Sexd Data descriptione Xf Fg

N50contig(kb)h

N50phaseblock(Mb)i

N50scaffold(Mb)j Gappinessk

Same Parental Reference

N50perfectstretch(kb)l

Phasingerrorratem

Missing k-mers (%)n

Inconsistentat given

distance (%)q

Wallclock(days)rHaploido Diploidp 1 Mb 10 Mb

A NA19238 Yoruban F One 10× library 56 115 114.6 8.0 18.7 2.1 14.5 10.0 1.2 0.8 1.7B NA19240 Yoruban F One 10× library 56 125 118.8 9.3 16.4 2.3 0.00008 14.4 9.8 1.2 0.7 1.7C HG00733 Puerto Rican F One 10× library 56 106 123.6 3.4 17.8 2.0 0.00008 12.7 9.2 1.0 1.2 1.7D HG00512 Chinese M One 10× library 56 102 113.2 2.7 15.4 2.2 13.6 10.0 1.3 0.5 1.7E NA24385 Ashkenazi M One 10× library 56 120 106.4 4.2 15.1 2.6 0.00006 13.9 9.6 1.3 2.0 1.8F HGP European M One 10× library 56 139 120.2 4.5 18.6 2.5 19.8 12.4 8.8 1.8 0.9 2.0G NA12878 European F One 10× library 56 92 118.5 2.8 16.4 2.0 16.5 0.00077 12.6 9.1 1.1 0.6 1.8H NA12878 European F Unknown number of PacBio

libraries plusBioNano Genomics data

46 1594.2 25.4 4.6 18.0 0.5 2.0

I NA12878 European F Six libraries (fragment,jumping, 10×)

160 12.3 30.1 10.2 19.7 1.1 7.1

J NA12878 European F Nine libraries (fragment,jumping, Fosmid, Chicago)

150 43.6 42.8 0.6 14.8 5.6 6.0

K NA24385 Ashkenazi M Seven PacBio libraries 71 4525.2 4.5 0.0 11.8 2.2 17.9L NA24143 Ashkenazi F Two PacBio libraries 30 1048.4 1.0 0.0 14.3 15.2M YH Chinese M ∼18,000 Fosmid pools and six

fragment and jumpinglibraries, Illuminasequenced, plus CompleteGenomics data

702 52.5 0.5 23.2 1.5 10.4 1.2 1.6

Assemblies of this work plus preexisting assemblies (H from Pendleton et al. 2015; I from Mostovoy et al. 2016; J from Putnam et al. 2016; M from Cao et al. 2015; see Supplemental Note 4). All sta-tistics were computed after removing scaffolds shorter than 10 kb. Comparisons to reference use GRCh37 (Chr1-22,X,Y), with ChrY excluded for female samples. Software used to create assemblies:(A–G) Supernova 1.1 with default parameters; (H) Falcon, BLASR (Chaisson and Tesler 2012), Celera Assembler (Koren et al. 2012), RefAligner (Anantharaman and Mishra 2001; Nguyen 2010),custom scripts; (I) SOAPdenovo2 (Luo et al. 2012), ABySS (Simpson et al. 2009), Longranger (Zheng et al. 2016), BWA-MEM (Li 2013), fragScaff (Adey et al. 2014), RefAligner, Lastz (Harris 2007),BioNano hybrid scaffold tool (Mak et al. 2016); (J) Meraculous (Chapman et al. 2011), HiRise (Putnam et al. 2016); (K,L) Celera Assembler, Quiver (Chin et al. 2013); (M) SOAPdenovo2, ReFHap(Duitama et al. 2012), custom pipeline.aIdentifier of assembly in this table.bSource of starting material. HGP is from the donor to the Human Genome Project for libraries RPCI 1,3,4,5 (https://bacpacresources.org/library.php?id=1), for which 340 Mb of finished sequenceare in GenBank. HGP was from fresh blood; others are Coriell cell lines.cEthnicity of individual.dSex of individual.eCapsule description of data type.fEstimated coverage of genome by sequence reads. For assemblies of this work, reads were 2×150; 1200 M reads were used for each assembly; all samples were sequenced on HiSeq X.gInferred length-weighted mean molecule length of DNA in kb (for other statistics, see Supplemental Table 1).hN50 size of FASTA records, after breaking at sequences of 10 or more n or N characters.iN50 size of phase blocks, computed for A–G, and as reported for assembly M.jN50 size of FASTA records, excluding Ns.kFraction of bases that are ambiguous.lN50 length in kb of segments on finished sequence from same sample that are perfectly mirrored in assembly (see text).mFraction of phased sites in megabubble branches whose phasing did not agree with the majority.nFraction of 100-mers in reference that are missing from the assembly (includes bona fide sample/reference differences).oValue for haploid version of assembly.pValue for diploid version of assembly.qOf k-mer pairs at the given distance in the assembly, and for which both are uniquely placed on the reference, fraction for which either the reference chromosome, orientation, order, or separation(±10%) are inconsistent (includes bona fide sample/reference differences).rRun time (days) for assemblies using a single server having 28 cores and 384 GB available memory (booted with “mem= 384G”), exclusive of subsampling to 1200 M reads, sorting by barcodeand trimming of barcodes (total 2–5 h).

Direct

determination

ofdiploid

genomesequences

Genom

eResearch

759www.genom

e.org

C

old Spring Harbor Laboratory Press

on March 13, 2018 - Published by

genome.cshlp.org

Dow

nloaded from

resolved contigs: 100 kbp, scaffolds: > 15 Mbp

Weisenfeld & al. Genome Research 276:757 (2017)

Page 14: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Diploid assembly

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xiv

those present in only one barcode, thus reducing the incidence offalse k-mers, i.e., those absent from the sample. The remaining k-mers are formed into an initial directed graph, in which edges rep-resent unbranched DNA sequences, and abutting edges overlap byk−1 bases. Operations are then carried out to recover missing k-mers and remove residual false k-mers (Weisenfeld et al. 2014).At this point, the graph (called the base graph) is an approximationto what would be obtained by collapsing the true sample genomesequence along identical 48-base sequences (Butler et al. 2008).Wethenuse the readpairs to effectively increase k to about 200, so thatthe new graph represents an approximation to what would be ob-tained by collapsing the true sample genome sequence along iden-tical 200-base sequences, thus achieving considerably greaterresolution (Methods).

The remainder of the assembly process consists of a series ofoperations that modify this graph, so as to improve it. To facilitatethese operations, we decompose the graph into units called lines(Fig. 1; Methods). Lines are extended linear regions, punctuatedonly by “bubbles.” Bubbles are places in the graph where the se-quence diverges along alternate paths that then reconnect.Common sources of bubbles are loci that are heterozygous or diffi-cult to read (in particular, at long homopolymers).

We can use lines to scaffold the assembly graph. This involvesdetermining the relative order and orientation of two lines, thenbreaking the connections at their ends, then inserting a special“gap” edge between the lines. The end result is a new line, whichhas a special “bubble” consisting only of a gap edge. Subsequentoperations (described later) may remove some of these gaps, re-placing them by sequence.

Scaffolding is first carried out using read pairs. If the right endof one line is unambiguously connected by read pairs to the leftend of another line, then they can be connected. Read pairs canreach over short gaps. To scaffold across larger gaps, we use thebarcodes. Briefly, if two lines are actually near each other in the ge-nome, then with high probability, multiple molecules (in the par-titions) bridge the gap between the two lines. Therefore for anyline, we may find candidate lines in its neighborhood by lookingfor other lines sharingmanyof the same barcodes. By scoring alter-native orders and orientations (O&Os) of these lines, we can scaf-fold the lines by choosing their most probable configuration,excluding short lines whose position is uncertain (Methods).

Once the assembly has been scaffolded, some gapsmay be re-placed by one or more sequences. For short gaps, read pairs fromboth sides of the gap reach in and may cover the intervening se-quence, from which it may be inferred. For long gaps, we firstfind the barcodes that are incident upon sequence proximate tothe left and right sides of the gap. Thenwe find all the reads in these

barcodes. This set of readswill include reads thatproperly liewithinthegap andyetbe roughly 10 times larger than that set (as eachpar-tition contains about 10 molecules). We assemble this set of reads.Reads outside the gap locus tend to be at low coverage in this re-stricted read set and hence not assemble. In this way, it is typicallypossible to fill in the gapwith a chunkof graphand thereby removethe gap from the assembly. The chunk may not be a single se-quence. For example, at this stage, heterozygous sites within thegap would typically be manifested as simple bubbles.

The final step in the assemblyprocess is to phase lines. First foreach line (Fig. 1), we find all its simple bubbles, i.e., bubbles havingjust two branches. Then we define a set of molecules. These are de-fined by a series of reads from the same barcode, incident upon theline, and not having very large gaps (>100 kb) between successivereads. A givenmolecule then “votes” at certain bubbles, and the to-tality of this voting (across all molecules on each line) is then usedto identify phaseable sections of the line, which are then separatedinto “megabubble” arms (Fig. 2; Methods).

Software and computational performanceSupernova takes as input FASTQ files. No algorithmic parametersare supplied by the user. Supernova is designed to run on a singleLinux server. The peak memory usage across the seven human as-semblies of this work was 335 GB, and accordingly we recommendusing a server having ≥384 GB RAM. Wall clock run times areshown in Table 1 and are in the range of 2 d.

Supernova outputA Supernova assembly can separate homologous chromosomesover long distances, in this sense capturing the true biology of adiploid genome (Fig. 2). These separated alleles (or phase blocks)are represented as “megabubbles” in the assembly, with eachbranch representing one parental allele. Sequences betweenmega-bubbles are nominally homozygous. Successive megabubbles arenot phased relative to each other (if they were, they would havebeen combined). A chain of megabubbles as shown comprise ascaffold. In addition to large-scale features, the Supernova graphencodes smaller features such as gaps and bubbles at long homo-polymers, whose lengths are not fully determined by the data.

A Supernova assembly can be translated into FASTA in severaldistinct ways that might prove useful for different applications(Fig. 3). These allow representation of the full (or “raw”) graph(Fig. 3A), or erase microfeatures (choosing the most likely branch

Figure 1. Lines in an assembly graph. Each edge represents a DNA se-quence. (A) Blue portion describes a line in an assembly graph, which isan acyclic graph part bounded on both ends by single edges. The line al-ternates between five common segments and four bubbles, three of whichhave two branches. The third bubble is more complicated. The entiregraph may be partitioned so that each of its edges lies in a unique line (al-lowing for degenerate cases, including single edge lines, and circles). (B)The same line, but now each bubble has been replaced by a bubble con-sisting of all its paths. After this change, each bubble consists only of par-allel edges.

Figure 2. Supernova assemblies encode diploid genome architecture.Each edge represents a sequence. Blue represents one parental allele,and gold represents the other. Megabubble arms represent alternative pa-rental alleles at a given locus, whereas sequences between megabubblesare homozygous (or appear so to Supernova). Successive megabubblesare not phased relative to each other. Smaller scale features appear asgaps and bubbles.

Weisenfeld et al.

760 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on March 13, 2018 - Published by genome.cshlp.orgDownloaded from

NATURE METHODS | VOL.13 NO.12 | DECEMBER 2016 | 1051

ARTICLES

information from heterozygous positions that it identifies (Fig. 1b). Phased reads are then used to assemble haplotigs and primary contigs (backbone contigs for both haplotypes) (Fig. 1c and Supplementary Fig. 1b) that form the final diploid assembly with phased single-nucleotide polymorphisms (SNPs) and structural variants (SVs).

To evaluate the accuracy of FALCON-Unzip, we applied it to a trio of Arabidopsis genomes (Col-0, Cvi-0, and the hybrid Col-0–Cvi-0) and analyzed the results with respect to each other and the TAIR10 reference genome25. We also assessed perform-ance on the genomes of Vitis vinifera cv. Cabernet Sauvignon, a highly heterozygous outcrossed grape cultivar of agricultural importance, and on a highly heterozygous wild-type diploid fungus, Clavicorona pyxidata, which has resisted previous short-read assembly approaches.

RESULTSSequencing and assembly of an Arabidopsis trioWe individually sequenced and assembled the inbred Col-0 and Cvi-0 genomes using FALCON (Supplementary Table 1). Contig N50 sizes were 7.4 Mb (Col-0) and 6.0 Mb (Cvi-0), about 10 to 100 times more contiguous than other recently published Arabidopsis assembly26 (Table 1) and approaching the continu-ity of the highly curated TAIR10 assembly (10.9 Mbp), which was assembled using expensive BAC sequencing25. The largest FALCON contigs spanned entire chromosome arms (Fig. 2), creating a high-quality draft reference for Cvi-0.

When comparing our Col-0 assembly to the TAIR10 assem-bly, the nucleotide sequence identity was greater than 99.98% (Supplementary Table 2). We applied BUSCO27 to evaluate the assembly completeness by identifying a set of highly con-served plant orthologs in the assembly (Supplementary Table 3). BUSCO identified 914 (95.6%) and 906 (94.8%) genes in the Col-0 and Cvi-0 assemblies, respectively, compared with 915 (95.7%) in the TAIR10 reference. The variations between Col-0 and Cvi-0 assemblies are summarized in Table 2.

To assess performance on heterozygous genomes, we generated and assembled short- and long-read sequencing data of the F1 progeny with four leading assembly algorithms (Table 1). Canu28 was used to assemble long-read sequence data (Table 1 and Supplementary Fig. 2) from the Col-0–Cvi-0 F1 hybrid sample. The total size of the assembly was 219 Mb, slightly smaller than the expected diploid size of 238 Mb. The high level of polymor-phisms, including a SNP rate of ~1/200 bp and 1,051 SVs larger than 50 bp between the strains (Table 2), might cause fragmented assembly, as the algorithm is not currently optimized for diploid genomes. Consequently, the contiguity of the F1 assembly was substantially worse (~3-fold less) than the Canu assembly of either inbred parent alone (Table 1). Short-read assemblies with SOAPdenovo2 (ref. 29) and Platanus15, which were designed to assemble heterogeneous diploid genomes, were significantly less contiguous compared with Canu; SOAPdenovo2 assembled a total of 260 Mbp with an N50 = 990 bp even after k-mer optimization and error correction (Supplementary Fig. 3). Contigs assembled using Platanus were marginally improved, with an N50 = 26.9 kbp and a total assembly size of 143 Mbp, which was only slightly larger than the haploid genome size.

Most assemblers generate a single set of contigs, but FALCON generates ‘primary contigs’ (p-contigs) and ‘alternative contigs’

(a-contig) that comprise the genome regions typified by SVs from the p-contigs (see Online Methods). The a-contigs, representing local alternative sequences, spanned a total of 57 Mbp (~40% of the p-contigs) with an N50 = 146 kbp. Thus, FALCON alone produced 84% of the estimated 238-Mbp diploid genome. After the initial assembly, the FALCON-Unzip algorithm used the heterozygosity information within the initial primary con-tigs for haplotype phasing (Fig. 1b and Supplementary Note). With phasing information from the raw reads, FALCON-Unzip generated a subsequent set of p-contigs and the final haplotig set (h-contigs) that represented more contiguous haplotype- specific sequence information than the a-contigs (Fig. 1c). After the ‘unzipping’ process, the total size of the p-contigs was 140 Mbp (N50 = 7.96 Mbp), and the total size of the haplotigs was 105 Mbp (N50 = 6.92 Mbp). FALCON-Unzip generated phased diploid genome assemblies with continuity comparable to that of the individual inbred parental genomes (Table 1).

Comparison of the F1 assembly of FALCON-Unzip, Platanus, and SOAPdenovo2 directly with the TAIR10 reference is detailed in the Supplementary Note (Supplementary Fig. 4 and Supplementary Table 4). Overall, the variants from the FALCON-Unzip assembly captured 89% of the Platanus variants and 90% of the SOAP variants at a stringent requirement of the exact same variant type, size, and genomic location. However, the Platanus and SOAP assemblies captured only 37% and 1% of the FALCON-Unzip variants, respectively.

Col-0–Cvi-0 F1 haplotig phasing qualityWe aligned p-contigs and haplotigs to the parental inbred assem-blies to evaluate the accuracy of haplotype separations. Ideally, each haplotig should be identical to one of the parental haplotypes and show variations against the other. We observed that most of the haplotigs only showed SNPs or SVs in one of the parental genomes, indicating that the phasing approach works accurately (Fig. 2 and Supplementary Fig. 5). We assessed accuracy by computing the ratio of differences (for example, SNPs) to either of the parental assemblies within each haplotig (Supplementary Table 5). For the largest six haplotigs spanning 50% of the genome, the minority SNP percentages were all lower than 0.2%. The small minority SNP ratio represents either a small number of (i) local phasing errors,

SNPsSNPs

SV SVSNPs

FALCONa

b

c FALCON-Unzip

Associate contig 1(alternative allele)

Associate contig 2(alternative allele)

Primary contig

Initial assembly graph

Phase heterozygous SNPs andidentify the haplotype of each read

SNPsSNPs

SNPsSVs

SVs

Haplotype-resolved assembly graph

Assembly output

Updatedprimarycontig

Haplotig 1 Haplotig 2 Haplotig 3

Figure 1 | Overview of FALCON and FALCON-Unzip. (a) An initial assembly is computed by FALCON, which error corrects the raw reads (not shown) and then assembles them using a string graph of the read overlaps. The assembled contigs are further refined by FALCON-Unzip into a final set of contigs and haplotigs. (b) Phase heterozygous SNPs and group reads by haplotype. (c) The phased reads are used to open up the haplotype-fused path and generate as output a set of primary contigs and associated haplotigs.

homozygous segments + heterozygosity bubbles

Weisenfeld & al (2017); Chin et al. Nature Methods 13:1050 (2016)

Page 15: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Structural variants

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xv

1. small indels

2. copy number variation

3. rearrangements

Page 16: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Structural variants

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xvi

categories: suppression, insertion, duplication, inversion/translocation

0CVWTG�4GXKGYU�^�)GPGVKEU

/QDKNGGNGOGPV

4GH�

4GH�

4GH�

0QXGN�UGSWGPEG�KPUGTVKQP

+PVGTURGTUGF�FWRNKECVKQP

6TCPUNQECVKQP

4GH�4GH�

&GNGVKQP

4GH�

+PXGTUKQP

6CPFGO�FWRNKECVKQP

4GH�

/QDKNG�GNGOGPV�KPUGTVKQP

4GH�

Figure 1 | Classes of structural variation. Traditionally, structural variation refers to genomic alterations that are larger than 1 kb in length, but advances in discovery techniques have led to the detection of smaller events. Currently, >50 bp is used as an operational demarcation between indels and copy number variants (CNVs). The schematic depicts deletions, novel sequence insertions, mobile-element insertions, tandem and interspersed segmental duplications, inversions and translocations in a test genome (lower line) when compared with the reference genome.

Array comparative genomic hybridization(Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known target DNA sequence immobilized on a solid glass substrate and then interrogating the hybridization ratio.

SNP microarraysHybridization-based assays in which the target DNA sequences are discriminated on the basis of a single base difference. Assays are processed with a single sample per array and perform both SNP genotyping and copy-number interrogation.

Single-base extensionSingle-base-extension reactions use a primer that binds to a region of interest and follow this with an extension reaction that allows the incorporation of a single base after the primer.

technologies infer copy number gains or losses com-pared to a reference sample or population, but differ in the details and application of the molecular assays.

Array CGH. Array CGH platforms are based on the principle of comparative hybridization of two labelled samples (test and reference) to a set of hybridization tar-gets (typically long oligonucleotides or, historically, bac-terial artificial chromosome (BAC) clones). The signal ratio is then used as a proxy for copy number (see BOX 1 for details). An important consideration is the effect of the reference sample on the copy-number profile. For example, when only one sample is examined, a loss in the reference sample is indistinguishable from a gain in the test sample. For this reason, a well-characterized ref-erence is key to interpretation of array CGH data19. Early studies of germline CNVs were based on BAC arrays or low-resolution oligonucleotide platforms and allowed detection of CNVs typically greater than 100 kb1,2,6 (BOX 2). These initial studies highlighted the incred-ible number of CNVs observed in healthy individuals; however, the breakpoints of these alterations were not sufficiently well-defined to allow accurate assessment of the proportion of the genome altered or its gene con-tent. This led to a drastic overestimation of the extent of copy-number polymorphism using large-insert BAC clones2, which was subsequently refined by oligonucle-otide microarrays or sequence-based studies of the same DNA samples4,5,20,21.

Currently, Roche NimbleGen and Agilent Technologies are the major suppliers of whole-genome array CGH platforms and routinely produce arrays with up to 2.1 million (2.1M) and 1M long oligonucleotides (50–75-mers), respectively, per microarray. Detection of a CNV typically requires a signal from at least 3 to 10

consecutive probes (BOX 1); as a result, SNP and CGH microarrays can routinely detect anywhere from dozens to several hundred events per genome depending on the platform applied (BOXES 1,2). Two studies have recently used ultra-high-resolution arrays (24M to 42M probes) for array CGH-based SV discovery in samples from HapMap individuals5,19. Although such high-density arrays are not practical for a large number of samples (30 and 40 samples were used in these studies), these approaches enabled the discovery of CNVs down to 500 bp, with breakpoints precise enough to allow the identification of sequence motifs at a subset of vari-ants. One key advantage of array CGH platforms is the availability of custom, high-probe-density arrays from both major manufacturers. This has led to their widespread adoption in clinical diagnostics, essentially replacing karyotype analysis as the primary means of detecting copy-number alterations among children with developmental delay22.

SNP arrays. SNP microarray platforms are also based on hybridization, with a few key differences from CGH tech-nologies. First, hybridization is performed on a single sample per microarray, and log-transformed ratios are generated by clustering the intensities measured at each probe across many samples20,23,24. Second, SNP platforms take advantage of probe designs that are specific to single-nucleotide differences between DNA sequences, either by single-base-extension methods (Illumina) or differential hybridization (Affymetrix)20,23,24. One key disadvantage is that, per probe, SNP microarrays tend to offer lower signal-to-noise ratio than do the best array CGH platforms. This is apparent in comparisons of array CGH and SNP platforms in terms of detection of CNVs by a purely ratio-based approach24–27. However, a key advantage of SNP microarrays is the use of SNP allele-specific probes to increase CNV sensitivity, dis-tinguish alleles and identify regions of uniparental disomy through the calculation of a metric termed B allele frequency (BAF) (BOX 1).

SNP arrays have proved popular in CNV-detection studies, historically as complements to array CGH platforms for fine-mapping regions2 and currently in the large-scale discovery of CNVs in a broad variety of populations16,20,23,28,29. Early SNP arrays demonstrated poor coverage of CNV regions, but recent arrays (such as the Affymetrix 6.0 SNP and Illumina 1M platforms) incorporate better SNP selection criteria for complex regions of the genome and non-polymorphic copy-number probes (which are examined for log ratios but not BAF)20,23,30. Another important consideration is the choice of population because the average heterozygosity affects the proportion of SNPs that will generate a mean-ingful BAF signal (typically, heterozygosity is 30–40% in Illumina platforms). This is particularly relevant when dealing with populations that may have experienced a drastic bottleneck, as opposed to more outbred popula-tions, and thus may affect the number of probes needed to identify an alteration23,24. Some studies combine array CGH and SNP platforms to offer higher confidence in CNV detection2,20,30.

REVIEWS

364 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

Alkan, Coe & Eichler Nature Reviews Genetics 12:363 (2011)

Page 17: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

SV signatures

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xvii

58�ENCUUGU #UUGODN[4GCF�RCKT 5RNKV�TGCF4GCF�FGRVJ

0CVWTG�4GXKGYU�^�)GPGVKEU

/'+

0QV�CRRNKECDNG

0QV�CRRNKECDNG

0QV�CRRNKECDNG

#PPQVCVGFVTCPURQUQP

42�� 42��

/'+

#PPQVCVGFVTCPURQUQP

+PXGTUKQP

#UUGODNG

%QPVKI�UECȭQNF

#UUGODNG

%QPVKI�UECȭQNF

#UUGODNG

#NKIP�VQ4GRDCUG

%QPVKI�UECȭQNF

#UUGODNG

%QPVKI�UECȭQNF

+PXGTUKQP+PXGTUKQP

%QPVKI�UECȭQNF

#UUGODNG

%QPVKI�UECȭQNF

#UUGODNG

0QXGN�UGSWGPEG�KPUGTVKQP�

+PXGTUKQP

/QDKNG�GNGOGPV�KPUGTVKQP

+PVGTURGTUGF�FWRNKECVKQP

6CPFGO�FWRNKECVKQP

&GNGVKQP

Figure 2 | Structural variation sequence signatures. There are four general sequence-based analytical approaches used to detect structural variation. Theoretically, read-pair (RP), split-read and assembly methods can be used to discover variants from all classes of structural variant (SV), but each has different biases depending on the underlying sequence content of the variants and the data properties of the sequence reads. However, read-depth approaches can be used to detect only losses (deletions) and gains (duplications), and cannot discriminate between tandem and interspersed duplications. Briefly, read-pair methods analyse the mapping information of paired-end reads and their discordancy from the expected span size and mapped strand properties. Sensitivity, specificity and breakpoint accuracy are dependent on the read length, insert size and physical coverage3,4,59,62,65,66,68,69. Breakpoints are indicated by red arrows. Read-depth analysis examines the increase and decrease in sequence coverage to detect duplications and deletions, respectively, and predict absolute copy numbers of genomic intervals45,62,74–76. Split-read algorithms are capable of detecting exact breakpoints of all variant classes by analysing the sequence alignment of the reads and the reference genome; however, they usually require longer reads than the other methods and have less power in repeat- and duplication-rich loci62,78,79. Assembly algorithms83–86,115 have the most power to detect SVs of all classes at the breakpoint resolution, but assembling short sequences and inserts often result in contig/scaffold fragmentation in regions with high repeat and duplication content89. MEI, mobile-element insertion. Repbase is a database of repetitive elements.

REVIEWS

368 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics

© 2011 Macmillan Publishers Limited. All rights reserved

Alkan, Coe & Eichler Nature Reviews Genetics 12:363 (2011)

Page 18: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Fragment coverage — distribution

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xviii

target of length L

island oceanobserved island

observed island

fragment

contig 1contig 2

contig 3

number of fragments: nlength of a fragment: `length of target (chromosome): Lminimal overlap for detection: θ`, 0 < θ < 1

coverage: c = n`/L

Page 19: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Questions

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xix

- number of positions covered by at least 1 (at least k) fragments- number of islands- number of contigs- length of an island

Lander-Waterman statistics: approximation by Poisson process

Page 20: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Coverage by ≥ 1 fragment

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xx

Thm. The probability that a given position is covered by at least 1 fragment is≈ (1− e−c).

Proof.

Prob. that a given fragment covers it: p = `/(L− `+ 1) ≈ `/LProb. that none does: (1− p)n ≈

(1− `

L

)n.

Approximation: (1− a/x)x ≈ e−a.

Thm. For k = 0,1, · · · , the probability that exactly k fragments cover a given

position is pk ≈ ck

k!e−c.

Proof. Probability equals(nk

)pk(1 − p)n−k ≈ pk, which converges to Poisson

distribution with parameter λ = np = c.

Approximation by Poisson process with intensity c

Page 21: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Oceans

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xxi

(uncovered gaps)

Thm. The expected number of (true) oceans is ≈ ne−c = L` ce−c.

Proof. Probability that a given fragment is the last one in an island

plast =(

1−`

L

)n−1.

+approximation as beforeExpectation = nplast.

Thm. The expected number of observed oceans is≈ ne−c(1−θ) = L` ce−c(1−θ).

Proof. Probability that a given fragment is the last one in an observed island:(1−

(1− θ)`

L

)n−1≈(

1−c(1− θ)

n

)n≈ e−c(1−θ)

Page 22: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Application : compression and CNV

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xxii

compression of repeated regions in assembly, or duplicated genomic segments canbe recognized by higher coverage

log-likelihood ratio for the number of fragments k in a region of length ρ` :- null hypothesis : no compression (coverage c)- alt. hypothesis : there is compression (coverage 2c)

LLR = loge−2cρ(2cρ)k/k!

e−cρ(cρ)k/k!= log

(e−cρ · 2k

)= k log 2− ρc log e

⇒ threshold for declaring duplication: LLR > 0 if k > ρc/ ln 2 ≈ 1.44ρc

Page 23: GENOMIC VARIATION · Array comparative genomic hybridization (Array CGH). A technique based on competitively hybridizing fluorescently labelled test and reference samples to a known

Coverage for diagnostics

Genotyping ? IFT6299 H2019 ? UdeM ? Miklos Csuros xxiii

cell-free DNA in pregnant mother’s blood: can test for aneuploidy in baby

total number of sequence tags obtained for different samples.(From this point forward, ‘‘sequence tag density’’ refers to thenormalized value and is used for comparing different samples andfor subsequent analysis). The interchromosomal variation withineach sample was also consistent among all samples (includinggenomic DNA control). The mean sequence tag density of eachchromosome correlates with the GC content of the chromosome(P ! 10"9) (Fig. S1 A and B). The standard deviation of sequencetag density for each chromosome also correlates with the absolutedegree of deviation in chromosomal GC content from the genome-wide GC content (P ! 10"12) (Fig. S1 A and C). The GC contentof sequenced tags of all samples (including the genomic DNAcontrol) was, on average, #10% higher than the value of thesequenced human genome (41%) (21) (Table S1), suggesting thatthere is a strong GC bias stemming from the sequencing process.We plotted in Fig. 1A the sequence tag density for each chromo-some (ordered by increasing GC content) relative to the corre-sponding value of the genomic DNA control to remove such bias.

Detection of Fetal Aneuploidy. The distribution of chromosome 21sequence tag density for all nine T21 pregnancies is clearly sepa-rated from that of pregnancies bearing disomy 21 fetuses (P ! 10"5,Student’s t test) (Fig. 1 A and B). The coverage of chromosome 21for T21 cases is #4–18% higher (average #11%) than that of thedisomy 21 cases. Because the sequence tag density of chromosome21 for T21 cases should be (1 $ !/2) of that of disomy 21pregnancies, where ! is the fraction of total plasma DNA originat-ing from the fetus (see SI Appendix for derivations), such increasein chromosome 21 coverage in T21 cases corresponds to a fetalDNA fraction of #8–35% (average #23%) (Table S1 and Fig. 2).We constructed a 99% confidence interval of the distribution ofchromosome 21 sequence tag density of disomy 21 pregnancies. Thevalues for all nine T21 cases lie outside the upper boundary of theconfidence interval, and those for all nine disomy 21 cases lie belowthe boundary (Fig. 1B). If we used the upper bound of theconfidence interval as a threshold value for detecting T21, theminimum fraction of fetal DNA that would be detected is #2%.

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

4 13 5 6 3 18 8 2 7 12 21 14 9 11 10 1 15 20 16 17 22 19chromosome

sequ

ence

tag

dens

ity re

lativ

e to

the

corr

espo

ndin

g va

lue

of

gDNA

con

trol

plasma DNA from woman bearing T21 fetus plasma DNA from woman bearing normal fetusplasma DNA from woman bearing T18 fetus plasma DNA from woman bearing T13 fetusplasma DNA from normal adult male

0.95

1

1.05

1.1

1.15

1.2

chromosome 21

sequ

ence

tag

dens

ity o

f chr

omos

ome

21 re

lativ

e to

the

med

ian

valu

e of

dis

omy

21 c

ases

trisomy 21 fetuses disomy 21 fetuses adult male plasma DNA

A

B

Fig. 1. Fetal aneuploidy is detectable by the overrep-resentation of the affected chromosome in maternalblood. (A) Sequence tag density relative to the corre-sponding value of genomic DNA control; chromo-somes are ordered by increasing GC content. (B) Chro-mosome 21 sequence tag density relative to themedian chromosome 21 sequence tag density of thenormal cases. Note that the values of three disomy 21cases overlap at 1.0. The dashed line represents theupper boundary of the 99% confidence interval con-structed from all disomy 21 samples. Number of disomy21 samples % 9. Number of trisomy 21 samples % 9.

Fan et al. PNAS ! October 21, 2008 ! vol. 105 ! no. 42 ! 16267

MED

ICA

LSC

IEN

CES

Plasma DNA of pregnant women carrying T18 fetuses (twocases) and a T13 fetus (one case) were also directly sequenced.Overrepresentation was observed for chromosomes 18 and 13 inT18 and T13 cases, respectively (Fig. 1A). Although there were notenough positive samples to measure a representative distribution, itis encouraging that all of these three positives are outliers from thedistribution of disomy values. The T18 are large outliers and areclearly statistically significant (P ! 10"7), whereas the statisticalsignificance of the single T13 case is marginal (P ! 0.05). FetalDNA fraction was also calculated from the overrepresented chro-mosome as described above (Fig. 2 and Table S1).

Fetal DNA Fraction in Maternal Plasma. Using digital TaqMan PCRfor a single locus on chromosome 1, we estimated the averagecell-free DNA concentration in the sequenced maternal plasmasamples to be #360 cell equivalents per milliliter of plasma (range:57–761 cell equivalents per milliliter of plasma) (Table S1), in roughaccordance with previously reported values (13). The cohort in-cluded 12 male pregnancies (6 normal cases, 4 T21 cases, 1 T18 case,and 1 T13 case) and 6 female pregnancies (5 T21 cases and 1 T18case). DYS14, a multicopy locus on chromosome Y, was detectablein maternal plasma by real-time PCR in all these pregnancies butnot in any of the female pregnancies (data not shown). The fractionof fetal DNA in maternal cell-free plasma DNA is usually deter-mined by comparing the amount of fetal-specific locus (such as theSRY locus on chromosome Y in male pregnancies) to that of a locuson any autosome that is common to both the mother and the fetusby using quantitative real-time PCR (13, 22, 23). We applied asimilar duplex assay on a digital PCR platform (see Materials andMethods) to compare the counts of the SRY locus and a locus onchromosome 1 in male pregnancies. SRY locus was not detectablein any plasma DNA samples from female pregnancies. We foundwith digital PCR that for the majority samples, fetal DNA consti-tuted !10% of total DNA in maternal plasma (Table S1), agreeingwith previously reported values (13).

The percentage of fetal DNA among total cell-free DNA inmaternal plasma can also be calculated from the density of se-

quence tags of the sex chromosomes for male pregnancies. Bycomparing the sequence tag density of chromosome Y of plasmaDNA from male pregnancies to that of adult male plasma DNA, weestimated fetal DNA percentage to be, on average, #19% (range:4–44%) for all male pregnancies (Table S1 and Fig. 2). Becausehuman males have one fewer chromosome X than human females,the sequence tag density of chromosome X in male pregnanciesshould be (1 " "/2) of that of female pregnancies, where " is fetalDNA fraction (see SI Appendix for derivation). We indeed observedunderrepresentation of chromosome X in male pregnancies ascompared with that of female pregnancies (Fig. S2). Based on thedata from chromosome X, we estimated fetal DNA percentage tobe, on average, #19% (range: 8–40%) for all male pregnancies(Table S1 and Fig. 2). The fetal DNA percentage estimated fromchromosomes X and Y for each male pregnancy sample correlatedwith each other (P $ 0.0015) (Fig. S3).

We plotted in Fig. 2 the fetal DNA fraction calculated from theoverrepresentation of trisomic chromosome in aneuploid pregnan-cies and the underrepresentation of chromosome X and the pres-ence of chromosome Y for male pregnancies against gestationalage. The average fetal DNA fraction for each sample correlateswith gestational age (P $ 0.0051), a trend that is also previouslyreported (13).

Size Distribution of Cell-Free Plasma DNA. We analyzed the sequenc-ing libraries with a commercial lab-on-a-chip capillary electro-phoresis system. There is a striking consistency in the peak fragmentsize, as well as the distribution around the peak, for all plasma DNAsamples, including those from pregnant women and male donor.The peak fragment size was, on average, 261 bp (range: 256–264 bp)(Fig. S4). Subtracting the total length of the Solexa adaptors (92 bp)from 261 bp gives 169 bp as the actual peak fragment size. This sizecorresponds to the length of DNA wrapped in a chromatosome,which is a nucleosome bound to a H1 histone (24). Because thelibrary preparation includes an 18-cycle PCR, there are concernsthat the distribution might be biased. To verify that the sizedistribution observed in the electropherograms is not an artifact ofPCR, we also sequenced cell-free plasma DNA from a pregnantwoman carrying a male fetus by using the 454 platform. The samplepreparation for this system uses emulsion PCR, which does notrequire competitive amplification of the sequencing libraries andcreates product that is largely independent of the amplificationefficiency. The size distribution of the reads mapped to uniquelocations of the human genome resembled those of the Solexasequencing libraries, with a predominant peak at 176 bp, aftersubtracting the length of 454 universal adaptors (Fig. 3 and Fig. S5).These findings suggest that the majority of cell-free DNA in theplasma is derived from apoptotic cells, in accordance with previousfindings (22, 23, 25, 26).

Of particular interest is the size distribution of maternal and fetalDNA in maternal cell-free plasma. Two groups have previouslyshown that the majority of fetal DNA has size range of that ofmononucleosome (!200–300 bp), whereas maternal DNA islonger (22, 23). Because 454 sequencing has a targeted read lengthof 250 bp, we interpreted the small peak at #250 bp (Fig. 3 and Fig.S5) as the instrumentation limit from sequencing higher-molecular-mass fragments. We plotted the distribution of all reads and thosemapped to Y chromosome (Fig. 3). We observed a slight depletionof Y-chromosome reads in the higher end of the distribution. Reads!220 bp constitute 94% of Y-chromosome and 87% of the totalreads. Our results are not in complete agreement with previousfindings in that we do not see as dramatic an enrichment of fetalDNA at short lengths (22, 23). Future studies will be needed toresolve this point and to eliminate any potential residual bias in the454 sample preparation process, but it is worth noting that theability to sequence single plasma samples permits one to measurethe distribution in length enrichments across many individual

R2 = 0.3971

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25 30 35 40

gestational age (weeks)

perc

enta

ge o

f mat

erna

l cel

l-fre

e DN

A th

at

orig

inia

tes

from

the

fetu

s (%

)

normal male, estimated from chrXnormal male, estimated from chrYT21 male, estimated from chrXT21 male, estimated from chrYT21, estimated from chr21T18 male, estimated from chrXT18 male, estimated from chrYT18, estimated from chr18T13 male, estimated from chrXT13 male, estimated from chrYT13, estimated from chr13detection limit

Fig. 2. Fetal DNA fraction and gestational age. The fraction of fetal DNA inmaternal plasma correlates with gestational age. Fetal DNA fraction was esti-mated in three different ways: (i) from the additional amount of chromosomes13, 18, and 21 sequences for T13, T18, and T21 cases, respectively; (ii) from thedepletion in amount of chromosome X sequences for male cases; (iii) from theamount of chromosome Y sequences present for male cases. The horizontaldashed line represents the estimated minimum fetal DNA fraction required forthe detection of aneuploidy. For each sample, the values of fetal DNA fractioncalculated from the data of different chromosomes were averaged. There is astatistically significant correlation between the average fetal DNA fraction andgestational age (P $ 0.0051). The dashed line represents the simple linear regres-sion line between the average fetal DNA fraction and gestational age. The R2

value represents the square of the correlation coefficient.

16268 ! www.pnas.org"cgi"doi"10.1073"pnas.0808319105 Fan et al.

? non-invasive (for baby),

? gives better detection rate and fewer false positives

? can be done early

Fan & al PNAS 105:16266 (2008)