Graham Taylor - The future of DNA sequencing technology

The Future of DNA Sequencing Technology Graham Taylor

Melbourne University, Human Variome Project (Australia), Victorian Clinical

GeneFcs Laboratories

Context and Topics 1.  Technology review & IdiosyncraFc selecFon

of noteworthy developments and trends in NGS hardware and soPware from the perspecFve of Genomic Medicine (so mostly human genome) in the context of meeFng clinical needs Sadly, not covering Transcriptomics, ChIP-‐seq, non-‐human genomes

2.  ApplicaFons and implicaFons for diagnosFcs

An UlFmate Goal for Sequence analysis?

For sequencing –  Chromosome-‐length reads –  Perfect base calling accuracy –  Each molecule is read – Highly parallel

For analysis – De novo assembly – Well curated reference resources – Data integrated with other biological and medical resources

Research, translaFon and service •  Original •  Surprising •  >80% accurate •  Numerator-‐driven: get

publicaFons •  Bespoke

•  Proven •  Predictable •  >99.99% accurate •  Denominator-‐driven (cost

sensiFve) •  Standardised

Cost and performance cost per base Illumina share price

Now is the winter of our discount tests (unless you are Illumina)

The case for disease-‐centric analysis

•  $1,000 dollar genomes or 1,000 x $1 interesFng regions? •  How to validate 3.5x 109 tests •  Sequencing costs are not limiFng

•  Quality and accuracy are incomplete •  Perform tests for a (clinical) reason

Sequence performance and clinical needs

number'of'readslength'of'reads

Genetics Tumor-Analysis MicrobiologySample/library,preparation 3 4 4Base,calling,accuracy 5 5 3De,novo,assembly 3 5 4Detect,Rare,Events 3 5 5Portability 2 3 4

How many variants per exome? SNP count Study

20,000 Choi et al. PNAS 2009

142,000 Mullikin NIH, unpublished 2010

50,000 Clark et al. Nature biotechnology 2011

125,000 Smith et al. Genome Biology 2011

100,000 Johnston & Biesecker Human Molecular GeneFcs 2013

200,000 to 400,000 Yang et al.N Engl J Med 2013

•  20-‐fold range •  Exome designs vary •  Likely to be higher variant count in African populaFons as the

reference sequence is non-‐African

Low concordance of mulFple variant-‐calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28

SNV concordance: 57.4% Indel concordance 26.8%

Venn diagrams of selected CNV detecFon methods in real data processing

Duan J, Zhang J-‐G, Deng H-‐W, Wang Y-‐P (2013) ComparaFve Studies of Copy Number VariaFon DetecFon Methods for Next-‐GeneraFon Sequencing Technologies. PLoS ONE 8(3): e59128. doi:10.1371/journal.pone.0059128 hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128

De novo Assembly (the unfinished genome)

•  Genome Res. 2014. 24: 688-‐696 2014 Huddleston et al. –  Within the human genome, there are >900 annotated genes mapping to large segmental duplicaFons. Such genes are typically missing or misassembled in working draP assemblies of genomes

–  The widespread adopFon of next-‐generaFon sequencing methods for de novo genome assemblies has complicated the assembly of repeFFve sequences and their organizaFon

–  resolved regions that are complex in a genome-‐wide context but simple in isolaFon for a fracFon of the Fme and cost of tradiFonal methods using long-‐read single molecule, real-‐Fme (SMRT) sequencing and assembly technology

–  SMRT sequencing of large-‐insert clones can significantly improve sequence assembly within complex repeFFve regions of genomes

Recent past and future RIP Coming soon?

SBS •  GnuBIO/BioRad :emulsion microfluidics for targeted

sequencing and hotspot analysis of rare variants •  LaserGen: Lightning Terminators™; increased accuracy,

longer reads and faster cycle-‐Fmes Nucleic Acids Res. Oct 2007; 35(19): 6339–6349.TerminaEon of DNA synthesis by N6-‐alkylated, not 3ʹ′-‐O-‐alkylated, photocleavable 2ʹ′-‐deoxyadenosine triphosphate Weidong Wu et al.

•  Qiagen/Intelligent Biosystems •  QuantuMDx: Nat Biotechnol. 2005 Oct;23(10):1294-‐301.

MulFplexed electrical detecFon of cancer markers with nanowire sensor arrays Zheng G, Patolsky F, Cui Y, Wang WU, Lieber CM

Currently SBS are Market Leaders

•  Illumina •  Proton Torrent •  PacBio

PacBio

•  English et al. (2012) Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-‐Read Sequencing Technology. PLoS ONE 7(11): e47768

•  Loomis et al (Sequencing the unsequenceable: Expanded CGG-‐repeat alleles of the fragile X gene Genome Research (2012)

Nanopores •  Electronic BioSciences: developing a system with a single/few pores

with a very fast rate of sequencing of ~50kb/second •  Genia: DNA polymerase to incorporate nucleoFdes with PEG-‐based

NanoTags. As the bases are incorporated the NanoTags are cleaved, allowing them to travel through the pore where they can be measured, generaFng sequence-‐specific informaFon

•  IBM: a solid state nanopore using alternaFng layers of metal and dielectric material to control the rate of passage through the nanopore

•  NABsys: Modified DNA (e.g. by SBH) read via Nanopore. Not yet sequencing, but very long reads

•  NobleGen: combinaFon of opFcal detecFon on nanopores •  Oxford Nanopore: exonuclease and strand-‐based nanopore

methods

Real long reads Nanopore sequencing

8,476 base single read

Not producFon ready 3040506070

3040506070

total time 273 seconds

Wiggle plot

Viterbi algorithm for all trinucloFdes

Electron Microscopy •  ZS GeneFcs: directly visualizes the sequence of DNA molecules

using electron microscopy. Proof of principle by the use of a dUTP nucleoFdewith a single mercury atom alached to the nitrogenous base. This modificaFon is small enough to allow very long molecules with labels at each A-‐U to be seen using annular dark-‐field scanning transmission electron microscopy (ADF-‐STEM) Microsc Microanal. 2012 Oct;18(5):1049-‐53 DNA base idenFficaFon by electron microscopy Bell DC, Thomas WK, Murtagh KM, Dionne CA, Graham AC, Anderson JE, Glover WR.

•  Reveo: atomic force microscopy called the Omni Molecular Recognizer ApplicaFon (OmniMoRA), will use arrays of nano-‐knife edge probes to measure the vibraFonal characterisFcs of individual bases on DNA molecules that have been stretched and immobilized on a surface

Electron Microscopy Progress toward an aberraFon-‐corrected low energy electron microscope for DNA sequencing and surface analysis. Mankos M, Shadman K, N'diaye AT,Schmid AK, Persson HH, Davis RW. Vac Sci Technol B Nanotechnol Microelectron. 2012 Nov;30(6):6F402

Imaging of reduced 5ʹ′-‐/5DTPA/C-‐20mer on Au substrate: (a) (b) AFM images at two magnificaFons, (c) height profile along line shown in (a), (d) height profile along line shown in (b), and (e) LEEM images at three different landing energies.

Aiming for 50 megabase reads with phred 60

Hardware Trends

•  Clonal sequencing –  Increasing accuracy –  Increasing read lengths –  Increasing read counts

•  Single molecule sequencing – PacBio – Oxford Nanopore

Increasing read counts via palerned flow cells

•  Palerned flow-‐cells useful for nucleic acid analysis US 20120316086 A1

•  KineFc exclusion amplificaFon of nucleic acid libraries WO 2013188582 A1 –  (i) capturing the different target nucleic acids at the amplificaFon sites at an average capture rate, and

–  (ii) amplifying the target nucleic acids captured at the amplificaFon sites at an average amplificaFon rate, wherein the average amplificaFon rate exceeds the average capture rate.

Palerned flow cells, super Poisson kineFcs

Pseudo-‐long reads via “molceculo”

Genome informaFcs example.. •  Does Moleculo’s technology have both a wet lab and a

bioinformaFcs aspect?

•  Yes, it’s about 50:50. One doesn’t make sense without the other. There are two components: first, there is a molecular biology kit and protocol that takes in genomic DNA and turns it into a sequencer-‐compaFble library. APer modifying and tagging the DNA, this allows the second component, the algorithmic part, to take the short reads and reconstructs long reads using those tags. Those are two separate parts. We developed both on campus, and improved upon them aPer we started the company last year.

Reducing assembly complexity of microbial genomes with single-‐molecule sequencing

identifying DNA modification, such as methylation pat-terns, directly from the single-molecule sequencing data[15]. While adoption of this technology was initially slowedby the low accuracy of the single-pass sequences, recentadvancements have demonstrated that this drawback canbe algorithmically managed to produce assemblies of un-matched continuity [7,8,16]. Steady improvements to thePacBio technology continue to increase read lengths andyield [17], while future technologies promise to combineaccuracy with length using either nanopores [11] or ad-vanced sample preparation [18]. Improved microbial gen-ome assembly is an obvious application of these recentdevelopments in long-read sequencing.Genome assembly is the process of reconstructing a

genome from many shorter sequencing reads [19-21]. Itis typically formulated as finding a traversal of a properlydefined graph of reads, with the ultimate goal ofreconstructing the original genome as faithfully as pos-sible. Repeated sequence in the genome induces com-plexity in the graph and poses the greatest challenge toall assembly algorithms [22]. In addition, repeats areoften the focus of analysis [23-25], making their correctassembly critical for subsequent studies. However, re-peats can only be resolved by a spanning read or readpair that is uniquely anchored on both sides. Read pairsare typically used due to their length potential (tens ofkilobase pairs), but introduce additional complexitybecause they cannot be precisely sized. Alternatively,long-read sequencing promises to more accurately re-solve repeats and directly assemble genomes into theirconstituent replicons. Figure 1 shows the benefit of in-creasing read length when assembling Escherichia coliK12 MG1655. This genome can only be assembled intoa single contig when the read length exceeds the size ofthe longest repeat in the genome, a multi-copy rDNAoperon. The rDNA operon, sized around 5 to 7 kbp, isthe largest repeat class in most bacteria and archaea[26]. Therefore, sequencing reads longer than the rDNAoperon, such as those produced by single-moleculesequencing, can automatically close most microbialgenomes.ALLPATHS-LG was the first assembler shown to pro-

duce complete microbial genomes using single-moleculesequences [7]. Utilizing a combination of PacBio RSsingle-molecule reads (2 to 3 kbp), short-range Illuminaread pairs (<300 bp insert), and long-range Illumina readpairs (3 to 10 kbp insert), ALLPATHS-LG assembles theIllumina reads first using a de Bruijn graph and incorpo-rates PacBio reads afterwards to patch coverage gapsand resolve repeats. Riberio et al. [7] tested this methodon 16 genomes and consensus accuracy was measuredat 99.9999% on 3 genomes with an available reference.Four of the sixteen genomes were successfully assembledinto a complete genome - the remaining genomes were

all highly continuous but left unresolved due to large-scale repeats. These results are promising, especially interms of consensus accuracy; however, the methodrequires two different sequencing platforms and threelibrary preparations, which limits its efficiency. Inaddition, the jumping libraries were observed to be in-consistent at spanning large repeats due to biases in thelibrary construction process.Ideally, complete genomes could be reconstructed

from a single fragment library, minimizing costs. Previ-ously, pair libraries were the only sequencing methodcapable of spanning large repeats, such as the rDNA op-eron, but the PacBio RS is now capable of producingsingle-molecule reads of the same length. Leveragingthis recent development, we present an approach for mi-crobial genome closure that relies on overlapping andassembling single-molecule reads de novo rather than

Figure 1 Genome assembly graph complexity is reduced assequence length increases. Three de Bruijn graphs for E. coli K12are shown for k of 50, 1,000, and 5,000. The graphs are constructedfrom the reference and are error-free following the methodology ofKingsford et al. [27]. Non-branching paths have been collapsed, soeach node can be thought of as a contig with edges indicatingadjacency relationships that cannot be resolved, leaving a repeat-induced gap in the assembly. (A) At k = 50, the graph is tangledwith hundreds of contigs. (B) Increasing the k-mer size to k = 1,000significantly simplifies the graph, but unresolved repeats remain.(C) At k = 5,000, the graph is fully resolved into a single contig. Thesingle contig is self-adjacent, reflecting the circular chromosome ofthe bacterium.

Koren et al. Genome Biology 2013, 14:R101 Page 2 of 16http://genomebiology.com/2013/14/9/R101

Long, single-‐molecule reads are sufficient for the complete assembly of most known microbial genomes. The assemblies presented here have good likelihood and finished-‐grade consensus accuracy exceeding 99.9999%.

Koren et al. Genome Biology 2013, 14:R101

Clinical Drivers

•  Manageable workflow •  Cost efficiency •  SensiFvity and specificity •  Referring to the clinical quesFon •  Depth vs. breadth of coverage

AdapFng NGS to purpose

control'

expansion'0'

(GGCCCC)4'(GGCCCC)5'

(GGCCCC)6'

Merging'forward'and'reverse'reads'

Use a HiFi polymerase

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA

TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA

CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA

CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA

Discarding rare (wrong) reads

DetecFng allele expansion

110"111"112"113"114"115"116"117"118"119"120"121"122"123"124"125"126"127"128"

Count&

Nucleo+de&base&pair&length&

"Control""

CG11"(MI095)"

Sample CASE Gene Genomic/Coordinates RefSeq VARIANT/c Variant/pAllele/%/

(Amplivar)

Allele/%/(MiSeq/

Reporter)1 22029 BRCA2 chr13_329116280G>T NM_000059.3 c.3136G>T p.Glu1046Ter 26.03% 22.37%2 22814 BRCA1 chr17_412464430insGA NM_007300.3 c.1105_1106insTC p.Asp369ValfsTer6 76.27% 75.65%3 23074 BRCA2 chr13_329144380delT NM_000059.3 c.5946delT p.Ser1982ArgfsTer22 74.36% 74.27%4 23162 BRCA1 chr17_412760430delCT NM_007300.3 c.68_69delAG p.Glu23ValfsTer17 76.04% 78.69%5 23165 BRCA2 chr13_329140660delAATT NM_000059.3 c.5574_5577delAATT p.Ile1859LysfsTer3 100.00% 95.53%5 23165 BRCA1 chr17_412444380del75 NM_007300.3 c.3005_3079del75 p.Asn1002_Ile1027del 48.71% Not0found6 23179 BRCA1 chr17_41215948_G>A NM_007300.3 c.5095C>T p.Arg1699Trp 85.24% 87.14%7 23210 BRCA2 chr13_329688360insT NM_000059.3 c.9266_9267insT p.Val3091ArgfsTer20 60.06% 59.64%8 23815 BRCA1 chr17_412562060insA NM_007300.3 c.374dupT p.Gln126ProfsTer16 100.00% 94.23%9 23824 BRCA1 chr17_4125691550delA NM_007300.3 c.271delT p.Cys91ValfsTer28 81.62% 83.32%10 23828 BRCA2 chr13_329128870delATTAC NM_000059.3 c.4395_4399delATTAC p.Leu1466PhefsTer2 93.20% 91.76%

MSI by NGS

Genotyping

Library ConstrucFon

Covaris optimisinglane SampleM E-‐gel Quant ladder1 gDNA2 A3 B4 C5 50bp ladder

Peak at 200bp50 bp ladder, 350bp bright band

B: 213.4 bp B: 208.8 bp C : 201.4 bp

NextEra method of Library ConstrucFon

Higher coverage greater reproducibility

Coverage Coefficient of variaFon

Can we capture coverage report dosage to diagnosFc standards? samples

targets

samples

autosomal ta

chrX ta

Inter-‐sample variaFon is low, But low coverage prevents dosage esFmaFon

Chr X is a good first pass test for dosage

XX vs. XY

8 Female cases and 16 Male cases showing reproducibility of coverage of X loci within each group. Loci with higher SDs were associated with reduced coverage.

0" 10" 20" 30" 40" 50" 60" 70" 80"

Average"XX"

Average"XY"

0% 10% 20% 30% 40% 50% 60% 70% 80%

AVGE%XX%

AVGE%XY%

EPCAM exon 9 to MSH2 exon 8 and EPCAM exon 9 to MSH2 exon 1 deleFons

chr2:"47596595047596770"(EPCAM

chr2:"47600552047600759"(EPCAM

chr2:"47600897047601237"(EPCAM

chr2:"47602323047602488"(EPCAM

chr2:"47604103047604266"(EPCAM

chr2:"47606042047606243"(EPCAM

chr2:"47606858047607158"(EPCAM

chr2:"47612255047612399"(EPCAM

chr2:"47613661047613802"(EPCAM

chr2:"47630281047630591"(MSH2)"

chr2:"47635490047635744"(MSH2)"

chr2:"47637183047637561"(MSH2)"

chr2:"47639503047639749"(MSH2)"

chr2:"47641358047641607"(MSH2)"

chr2:"47643385047643618"(MSH2)"

chr2:"47656831047657130"(MSH2)"

chr2:"47672637047672846"(MSH2)"

chr2:"47690120047690343"(MSH2)"

chr2:"47693747047693997"(MSH2)"

chr2:"47698054047698251"(MSH2)"

chr2:"47702114047702459"(MSH2)"

chr2:"47703456047703760"(MSH2)"

chr2:"47705361047705708"(MSH2)"

chr2:"47707785047708060"(MSH2)"

chr2:"47709868047710138"(MSH2)"

chr2:"48010323048010682"(MSH6)"

chr2:"48018016048018312"(MSH6)"

chr2:"48022983048023252"(MSH6)"

chr2:"48025700048028344"(MSH6)"

chr2:"48030509048030874"(MSH6)"

chr2:"48031999048032216"(MSH6)"

chr2:"48032707048032896"(MSH6)"

chr2:"48033293048033547"(MSH6)"

chr2:"48033541048033840"(MSH6)"

chr2:"48033868048034049"(MSH6)"

IntegraFng data handling with sequencing operaFon

Partly auto-‐fills metadata file (based on samplesheet metadata file name)

Checks if the run finishes with FASTQs generated on board

Checks if the analysis is complete, then if metadata contains sufficient informaFon for FASTQ renaming. Also checks and starts analysis workflow in samplesheet

Renames FASTQ files

Monitor data monitor_data.sh

Monitor run status monitor_run_status_miseq.sh

Monitor metadata monitor_metadata.sh

Rename FASTQs rename_files.py

Syncing MiSeq

miseq_rsync.sh

Copies miseq run directory to server (every hour)

Workflow /storage/local/sw/system_automagic/

workflows

Depending on the workflow defined in samplesheet ( Project)

AutomaFc data populaFon for sample sheets

HiSeq MiSeq

ProducFon: Taffy Centos 11TB

Development: S’Box Centos

1 TB Systems 3TB Sandbox

ProducFon: MiSeq Reporter

PC Windows

Backup: Biocube Centos

RAID6 20TB

Image systems at install and aPer update

Snapshots: 14 12-‐hourly 4 weekly 12 monthly 3 yearly

Snapshots (System, Scripts + SoPware only)

Transfer Data as needed

Transfer new runs and backup analysis

Backup: ITS Tape (2 Copies)

Most recent Snapshot every 6 month for 3 years rolling

Required addi2onal Resources: 1.  MiSeq Reporter Desktop PC 2.  New Desktop PC for Seb 3.  ITS Account 4.  (NAS to expand Biocube)

Costs Es2mate: 1.  2x Desktop PCs: 3 – 4k 2.  ITS tapes: max 11k/3years for 20TB 3.  (NAS: approx 10k for 28TB)

Run Raw Data to keep: 1.  MiSeq: Complete Run Folder 2.  HiSeq: Run Folder excl. Bcl, image files etc. (fastq = raw data) 3.  Compress data aPer 1 month (excl. fastq) 4.  Discard external data aPer 2 month

NGS-‐DB

Projects Project_ID Project_DescripFon User

Experiments Exp_ID Project_ID Exp_DescripFon Prep_Method Kit_ID (if applicable) Pipeline(s) LibOperator

Samples [LIMS?] Sample_ID PaEent_ID Family_ID Sample_Type Sample_Source Sample_Conc QC_Score

Libraries Library_ID Sample_ID Exp_ID Barcode_1 Barcode_2 LibConc QC_P/F Pool_IDs MiSeqRun_IDs HiSeqRun_Ids (FC+Lane)

MiSeqRuns MiSeqRun_ID MiCartridge_ID MiSeq_LoadConc SeqOperator MiFC_ID MiSeq_RunDate MiSeq_ClustDens MiSeq_PFClustDens MiSeq_Reads MiSeq_PFReads ...

HiSeqRuns HiSeqRun_ID HiFC_ID SeqOperator HiSeq_LoadConc SBS_RGT Cluster_RGT cBot_RGT HiSeq_RunDate HiSeq_ClustDens HiSeq_PFClustDens HiSeq_Reads HiSeq_PFReads ...

HiSeq_Flowcells HiFC_ID HiFC_CatNo HiFC_LotNo HiFC_Type HiFC_expiry HiSeqRun_ID

MiSeq_Flowcells MiFC_ID MiFC_CatNo MiFC_LotNo MiFC_Type MiFC_expiry MiSeqRun_ID

HiSeq_SBS (clust,cBot) SBS_RGT SBS_expiry SBS_CatNo SBS_LotNo SBS_Type HiSeqRun_ID

MiSeq_Cartridges Cart_ID Cart_CatNo Cart_LotNo Cart_TypeNo Cart_expiry MiSeqRun_ID

LibraryPools Pool_ID Library_IDs Exp_IDs Barcodes_1 Barcodes_2 PoolConc QC_P/F MiSeqRun_ID(s) HiSeqRun_ID(s) HiSeqRun_Ids (FC+Lane)

User Name InsFtuFon Contact Info Billing Info ...

Others Pipelines LibOperators SeqOperators Kits PrepMethods

AlternaFves to read mapping and alignment

•  Grouped read tesFng – Amplivar, AmbiVert

•  Tiled matching – MIST

•  kmer subtracFon – Diamund

•  DetecFon of allele expansions

coveragestatistics

each amplicon read sorted by primers

grouped amplicon variants

grab amplicons

sort by locus

group amplicons

Edit Disitance

Read Counts, Read Distribution and Analysis

Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.0"

NRAS1_7_2"chr1"1152565283115256531"

NRAS8_13_3"chr1"1152587303115258748"

PIK3CA1_20"chr3"1789168763178916876"

PIK3CA2_21"chr3"1789215533178921553"

PIK3CA3_22"chr3"1789279803178927980"

PIK3CA4_11_23"chr3"1789360743178936095"

PIK3CA12_24"chr3"1789388603178938860"

PIK3CA13_20_25"chr3"1789520073178952150"

PIK3CA13_20_26"chr3"1789520073178952150"

KIT1_36"chr4"55561764355561764"

KIT2_37"chr4"55592185355592186"

KIT3_19_38"chr4"55593464355593689"

KIT3_19_39"chr4"55593464355593689"

KIT3_19_40"chr4"55593464355593689"

KIT20_21_41"chr4"55594221355594258"

KIT22_42"chr4"55595519355595519"

KIT23_43"chr4"55597495355597497"

KIT24_28_44"chr4"55599320355599348"

KIT29_45"chr4"55602694355602694"

EGFR1_74"chr7"55211080355211080"

EGFR2_75"chr7"55221822355221822"

EGFR3_76"chr7"55233043355233043"

EGFR4_77"chr7"55241677355241708"

EGFR9_78"chr7"55242418355242511"

EGFR44_79"chr7"55249005355249131"

EGFR44_80"chr7"55249005355249131"

EGFR54_81"chr7"55259514355259524"

BRAF1_92"chr7"1404531213140453193"

BRAF28_93"chr7"1404813973140481478"

PTEN1_110"chr10"89624242389624244"

PTEN3_111"chr10"89685307389685307"

PTEN4_112"chr10"89711893389711900"

PTEN7_113"chr10"89717615389717772"

PTEN7_114"chr10"89717615389717772"

PTEN13_115"chr10"89720716389720852"

PTEN13_116"chr10"89720716389720852"

KRAS1_140"chr12"25378562325378562"

KRAS2_141"chr12"25380275325380283"

KRAS7_142"chr12"25398255325398285"

Average'Read'Count'per'Amplicon'+/6'SEM'

Smith-Waterman

363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA220#reads;#wildtypeTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA64#reads;#nonEcoding#chr12:295,398,329C>TTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;##two#nonEcoding#errors,#non#adjacentTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;#corresponding#to#c.35G>ATGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA

>1141_>EGFR9_78 chr7 55242418-55242511 1141GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>646_>EGFR9_78 chr7 55242418-55242511 646GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>60_>EGFR9_78 chr7 55242418-55242511 60GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT>57_>EGFR9_78 chr7 55242418-55242511 57GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT>54_>EGFR9_78 chr7 55242418-55242511 54GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>51_>EGFR9_78 chr7 55242418-55242511 51GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCTGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT

KRAS point mutation

EGFR 15 base deletion

EGFR 15 base deletion with Smith-Waterman alignment

Amplicon Sequencing: treaFng reads as groups

Amplivar vs. Alignment Amplivar Alignment (e.g. BWA)

Groups reads Uses individual reads

Designed for amplicons Designed for randomly sheared fragments

Works with FASTA aPer filtering

Works with FASTQ

Matches against target list Aligns against whole genome

Alignment is an opFonal late stage

Alignment is a required early stage

quality_filter Hard coded quality filters Output FASTA and qcore files

FASTA files (.fna) and quality files (.csv) wrilen to merged folder

fastq and fasta

>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!

@MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!+!AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!

group_fasta_reads

•  @file_list = glob "$merged_dir/*fna" ;

Grouped reads

3 !AGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATAG!!1 !ACCACTTTTGGAGGGAGATTTCGCTCCTGAAGAAAATTCGACAGCTTTGTGCCTGGCTAATTCT!!527!AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTAACA!!1 !CTCACCCCCAGACTGGGTTTTTAGGTCTCGGTTTACAAGTTTCTTATGCTGATGCTGAAAAAAA!

Usual suspects file

4 column tab separated text file with Unix line endings •  Column 1: RefSeq idenFfier •  Column 2: cDNA HGVS nomenclature •  Column 3: codon change HGVS nomenclature •  Sequence to match •  Usual suspects files available for TruSeq cancer panel and for

PCRbrary

RefSeq cDNA*description codon*change sequenceBRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA

Genotype table

libraryBRAF+600+wt+V600

BRAF+600+c.1799T>A+V600E

KRAS+12+&+13+wt+G12/G13

KRAS+12+c.34G>A+G12S

KRAS+12+c.34G>C+G12R

KRAS+12+c.34G>T+G12C

KRAS+12+c.35G>A+G12D

KRAS+12+c.35G>C+G12A

KRAS+12+c.35G>T+G12V

KRAS+13+c.38G>A+G13D

DL130016FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 74.2 0.0 0.0DL130028FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 30.6 0.0 0.0DL130040FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0DL130052FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 130.8 0.0 0.0DL130064FTGx120036 100.0 0.4 100.0 0.0 0.0 0.0 0.0 300.0 0.0 0.0DL130076FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 126.5 0.0 0.0DL130018FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130030FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130042FTGx120041 100.0 0.0 100.0 1.5 0.0 0.0 0.0 0.0 0.0 0.0DL130054FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130066FTGx120041 100.0 0.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130078FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130015FTGx120044 100.0 49.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 3.4DL130027FTGx120044 100.0 45.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130039FTGx120044 100.0 54.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130051FTGx120044 100.0 32.5 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130063FTGx120044 100.0 45.3 100.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0DL130075FTGx120044 100.0 43.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Amplicon flank file

4 column tab separated text file with Unix line endings •  Column 1: amplicon idenFfier with unique number •  Column 2: size (not currently calculated) •  Column 3: co-‐ords •  Flanks with (.*) to capture the sequence between amplimers •  Flank files available for PCRbrary, TruSeq Cancer Panel and

Olga’s TruSeq panel

ID Size Co)ords Sequence1_MPL1_2 175 chr1:43815006-43815137 GCCGTAGGTGCGCACG(.*)TCAGCAGCAGCAGG2_NRAS1_7 175 chr1:115256526-115256653 GCATTCCCTGTGGTTTT(.*)AGAGTACAGTGCCATG

Grouping Reads amplivar –i * –j * –k *!!-i /storage/local/sandbox/working_directory!-j usual_suspects.txt !-k flanking_primers!

Merges read pairs, quality filters, converts to fasta, groups by sequence Counts reads corresponding to each amplicon Genotypes according to the usual suspects table with read counts Groups reads by amplicon for mutaFon scanning

AMPLIVAR SeqPrep:

Remove adapters & Merge reads

Filter reads by quality

Convert fastq2fasta

Group fasta reads

Genotype grouped reads

Grab reads by flanks

Sort reads by locus

AMPLIVAR WRAPPER

AMPLIVAR SeqPrep:

Remove adapters & Merge reads

Filter reads by quality

Convert fastq2fasta

Group fasta reads

Genotype grouped reads

Grab reads by flanks

Sort reads by locus

Create symbolic links for fastqs

Create subdirectories for each fastq file pair

(R1 &R2)

Run amplivar

Run sort amplicons

Run blat on grouped, sorted reads

Convert blat psl2sam, sam2bam

Run bamleYalign

Inflate bam

Run VarScan

Run VEP

AmpliVar Required tools

•  SeqPrep (C) •  Blat (C) •  Samtools (C) •  BamlePalign (C++) •  VarScan (java) •  Bash •  Perl •  Python

>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC734>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACAAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGCTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTACCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCAGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGAGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAGCCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGGTCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCAGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCAAAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCAAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTCTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3

Sorted, locus (amplicon)-‐based files

Sample Amplicon Read %/of/major/alleleMS0318_1164/Blood 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0323_1164/Frozen 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTTATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 28.40%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATTTAGGTAATATTTCATCT 26.20%

Most reads are error free FFPE contaminates the evidence

Clustal alignment & phylogeny of errors

Merged forward and reverse reads PandaSeq and SeqPrep can merge overlapping read pairs to make them even longer and more accurate. In an unselected 100 base read pair run enriched for hereditary cancer genes over 20% of the reads could be merged. With longer reads and suitable experimental design this fracFon could be increased.

Pairs&Processed: 18,456,760Pairs&Merged: 4,029,383Pairs&With&Adapters: 32,899Pairs&Discarded: 646percent&merged 21.83

locus 1'total 2'totalchr17'41223060'41223115 2631 2777chr17'41223076'41223130 2452 2674chr17'41223101'41223146 2223 2501

chr17'41223060'41223115' chr17'41223076'41223130' chr17'41223101'41223146'

1'total'

2'total'

Probability of a given length read as a subset of a longer read in a normal distribuFon of longer reads: the “minimum substring problem”

This approach might make sense with longer reads

Average read length from 101 to 150 bases

Orthogonal validaFon without Sanger Scale:chr17--->

RefSeq Genes

RepeatMasker

10 baseshg1941,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105

AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGAflanks

Your Sequence from Blat SearchRefSeq Genes

Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of SamplesDuplications of >1000 Bases of Non-RepeatMasked Sequence

Repeating Elements by RepeatMasker

CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT

AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGAG YourSeq

rs1799966

"""GCCCAGAGTCCAGCTGCTGCTCATAC"""GCCCAG*G*GTCCAGCTGCTGCTCATAC"

Forward reads inc rs1799966

Downstream processing of FASTA files BWA, BLAT, annotaFon

A schematic of the workflow used by MiST.

Subramanian S et al. Nucl. Acids Res. 2013;41:e154

Identification of potentially paralogous read pairs.

Motivation for the use of Geoseq in variant calling.

Comparison of MiST and GATK. Each box has three sets of numbers, from left to right they are variant calls, (i) unique to MiST, (ii) common to both platforms and (iii) unique to GATK.

Filters are applied to remove calls occurring in public databases like dbSNP (17), 1000 Genomes (18) and a collection of already known private variants.

DIAMUND: Direct Comparison of Genomes to Detect Muta2ons

Figure 1. Outline of initial steps in the Diamund algorithm, which identifies all k-mers unique to an affected proband and missing from bothunaffected parents. The first step identifies k-mers, after which the proband data are filtered to remove k-mers resulting from sequencing errors.Intersecting all three sets identifies k-mers that are unique to the proband.

sequencing, where the number of true but clinically irrelevant vari-ants will be 50 times greater.

Here, we introduce a new method, DIAMUND (direct alignmentfor mutation discovery), which takes a different approach to exomeand whole-genome analysis, and as a result produces dramaticallysmaller sets of candidate mutations. Rather than aligning all samplesto the reference genome, we align the sequences directly to oneanother. This method is designed primarily for two types of analyses:(1) self-comparisons, where diseased tissue is compared with normaltissue from the same individual, and (2) family studies, where thedifferences among the DNA sequences from the subjects are farfewer than the differences between any subject and the referencegenome.

Our method does not require that the raw sequencing reads, usu-ally numbering 100 million or more for a whole exome, be aligned tothe GRC37 reference genome, nor does it require a complex genomeassembly or an all-versus-all alignment of these large data sets. Aswe explain in detail below, we use a more efficient algorithm thatallows us to quickly find sequences that are unique to any sample.

We have implemented and tested DIAMUND on exomes repre-senting two types of analysis problem. First, we considered self-comparisons, in which DNA from primary cultured fibroblasts de-rived from diseased tissue in an affected individual was comparedwith DNA from nondiseased primary cultured fibroblasts from thesame individual. For the analysis of tumor cells or other somaticmosaic genetic abnormalities, this direct comparison should yielda smaller set of variants than an analysis that first compares all se-quences to the reference genome. Second, we looked at three parent–child trios in which a de novo mutation in the child was suspectedto be causing disease. The standard algorithm would compare allthree individuals to the reference genome, generating very large listsof variants, many of which are shared by the child and a parent. Bycomparing the child’s DNA directly to both parents, we can quicklyidentify all de novo mutations, without losing sensitivity and with-out detecting family-specific variants that add noise to the process.For each of these problems, the number of true de novo mutationsis very small, obviating the need for the aggressive filters that exomeand whole-genome pipelines use, which might eliminate the truevariant of interest.

De novo mutations may account for a high proportion ofMendelian disorders. Yang et al. recently reported [Yang et al., 2013]on exome sequencing of 250 probands and their families, amongwhich they identified 33 patients with autosomal dominant and nine

with X-linked diseases. Of these, 83% of the autosomal dominantand 40% of the X-linked mutations occurred de novo.

In addition to generating fewer false positives, direct comparisonbetween samples within a family, or between affected and unaf-fected tissue, allows for detection of mutations in regions that areentirely missing from the reference genome. It has already beenshown that some human populations have large shared genomicregions, often spanning many megabases [Li et al., 2010], which aremissing entirely from the human reference genome. These includenovel segmental duplications [Schuster et al., 2010] as well as en-tirely novel sequences. If a mutation of interest happens to fall inone of these regions, then conventional methods will be guaran-teed to miss it. Our direct comparison algorithm, in contrast, in-cludes these regions and is quite capable of finding mutations withinthem.

An important caveat is that DIAMUND is not intended to solvethe more general problem of variant detection in any sample. It isdesigned to take advantage of very closely related samples wheredirect between-sample comparisons can more effectively identifymutations present in just one or a subset of the samples.

MethodsDIAMUND begins with two or more sets of DNA sequences, or

“reads,” generated by a sequencing instrument. Here, we describethe algorithm as applied to three trios consisting of an affected in-dividual (or proband) and two unaffected parents. Specializing thealgorithm to two samples, where one is normal and the other is dis-eased (e.g., cancerous) tissue from the same individual, is straight-forward.

One way of directly comparing two or more genomes is to assem-ble each data set de novo, using any of several next-generation se-quence assemblers [Schatz et al., 2010], and then compare the assem-blies using a whole-genome alignment algorithm such as MUMmer[Delcher et al., 1999; Kurtz et al., 2004]. However, whole-genomeassembly is computationally costly and can produce erroneous as-semblies, which in turn might create even larger problems thanaligning all reads to the reference genome. Instead, DIAMUND uses adirect approach in which we count all sequences of length k in all thereads, for some fixed value of k, and then compare these k-mers toone another. Here, we outline the 10 major steps of the algorithm;the initial steps are illustrated in Figure 1.

284 HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014

Filtering staFsFcs Table 1. Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutated Loci

Data remaining at the end of step

Filtering step Disease/normal pair Family trio BH1019 Family trio BH2041 Family trio BH2688

Number of reads from proband/diseased tissue 118,414,556 84,201,820 75,877,750 103,527,644Number of 27-mers in proband/diseased tissue 911,738,627 795,477,167 517,272,851 1,088,610,020Number of k-mers with count >10 77,903,885 61,805,320 64,719,150 113,066,951Remove vector sequence 77,898,848 61,800,798 64,713,995 113,062,417Eliminate k-mers found in reference GRC37 exome 17,821,359 9,385,347 10,730,208 50,535,681Eliminate k-mers found in parent exomes/normal tissue 10,568 65,352 20,130 2,006Identify reads containing k-mers 32,829 reads 148,496 46,454 4,404Remove reads containing vector 15,260 125,648 38,799 2,760Number of contigs after assembly 2,147 13,189 3,755 359Number of contigs with >3 reads after merging contigs 279 contigs 1,437 701 71Identify variants covered by reads from normal tissue 55 contigs 5 6 2Keep variants with >5% coverage 42 variants 5 6 2Find variants in coding regions 14 variants 3 3 1Remove synonymous SNPs 10 variants 2 3 1

Step 1: We utilize an efficient parallel algorithm, Jellyfish [Marcaisand Kingsford, 2011], for the k-mer counting step. This firststep converts the reads for each exome (or genome) to a set ofk-mers, which should in theory be a much smaller data set: thenumber of k-mers in an exome is equivalent to the length of theexome, 50–60 Mbp using current exome capture kits. However,the initial set is dramatically larger, due primarily to sequencingerrors, which we address below. We sort each set of k-mers to allowfor efficient intersection operations in subsequent steps. SortingN k-mers requires O(N log N) time, after which computing theintersection with another set of k-mers requires only O(N) time.

Step 2: The second step in the DIAMUND algorithm removes allk-mers from the proband (but not from the unaffected samples)that are likely to represent sequencing errors. Note that everysequencing error introduces k new k-mers. If k is sufficiently large,then virtually all of these k-mers will be unique, i.e., they will notoccur in the genome or elsewhere in the reads. Combined with thefact that exome coverage is usually very deep, we can safely assumethat any k-mer that occurs just once represents an error.

After empirical observations of multiple exomes, we observedthat even k-mers occurring more than once are usually errors. Dueto biases in sequencing technology, exome data sets may containerroneous k-mers that occur 10 or more times, particularly forregions that contain very deep coverage (which can exceed 1000-fold for some exonic targets). For the exomes we have analyzed,average coverage is approximately 80–100!, which means that anovel, heterozygous mutation should have 40–50! coverage. Evenin regions with lower coverage, novel mutations should have 20 ormore reads (and k-mers) covering them. Note that in the case ofmosaicism, a much lower proportion than 50% of the reads mightcontain the mutation; the software can be adjusted to report suchcases.

Given these observations, at this stage, we discard all k-mers thatoccur fewer than 10 times. We tested different values before choosing10 as the default value, and this can easily be adjusted for data setswith lower or higher coverage. In our tests, a minimum value of 10excluded an extremely small number of true k-mers.

Step 3: After removing likely sequencing errors, somek-mers may remain due to vector contamination. We pre-compute all k-mers in known vectors, taken from the UniVecdatabase (www.ncbi.nlm.nih.gov/tools/vecscreen/univec), and re-move these from the exome representing the proband (or

the diseased tissue, in the case of normal vs. diseased tissuecomparisons).

We also observe that any k-mer that occurs in the referencegenome is probably not the cause of disease. We precompute allk-mers from the targeted regions of the GRC37 genome, and re-move these “normal” k-mers from the proband’s data. Note thatthis set can easily be expanded to include a larger set of variantsknown to be harmless.

Step 4: After computing all k-mers in the reads from the probandand both parents, the third step computes the intersection betweenproband and mother, and separately between proband and father(Fig. 1). We collect all k-mers unique to the proband but missingfrom the mother, and repeat this step for the father. We thenintersect the two resulting files to give us a single file that containsall k-mers found in the proband but missing from both unaffectedparents. These form our initial set that should contain any de novomutations in the affected individual.

Step 5: At this point, DIAMUND usually has reduced the initial setof k-mers over 10,000-fold, leaving between 2,000 and 65,000k-mers (Table 1). For the fifth step, we collect the reads containingthese k-mers. This requires us to align the k-mers back to theoriginal reads, because the Jellyfish k-mer counter does not keeptrack of the source of each k-mer. DIAMUND can use either of twoefficient alignment systems for this step: MUMmer [Delcher et al.,1999; Kurtz et al., 2004], a suffix tree-based algorithm that rapidlyfinds exact matches; or Kraken [Wood and Salzberg, 2013], a fastsequence classifier that we modified to provide the output neededby our system. Kraken is the default choice because it is significantlyfaster. In our experiments, the number of reads identified in thisstep ranged from 4,400 to 148,000 (Table 1).

Step 6: Despite every effort to screen reads for contamina-tion, some small fragments of vector sequences often still re-main in the reads. If these vectors happen to contaminateonly the proband (or affected) data set, they will appear tobe novel mutations. We eliminate these by comparing thereads identified in the previous step to the UniVec database(www.ncbi.nlm.nih.gov/tools/vecscreen/univec) using the vec-screen program, and removing any reads with vector sequence.Note that running vecscreen on the original data would be ex-tremely demanding computationally, but because the number ofreads at this step has been reduced approximately 1,000-fold, it isrelatively fast.

HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014 285

Panagopoulos et al. Plosone 2014 Volume 9 (6) e99439

The ‘‘Grep’’ Command But Not FusionMap, FusionFinder or ChimeraScan Captures the CIC-‐DUX4 Fusion Gene from Whole Transcriptome Sequencing Data on a Small Round Cell Tumor with t(4;19)(q35;q13)

Three fusion-‐finder programs FusionMap, Fusion Finder, and ChimeraScan generated a plethora of fusion transcripts but not the biologically important and cancer-‐specific fusion gene, the CIC-‐ DUX4 chimeric transcript. It was necessary to use the ‘‘grep’’ command-‐line uFlity to siP out the laler from the many data produced by the automated algorithms. CytogeneFc, FISH, and clinico-‐pathologic tumor features hinted at the presence of the said fusion, but it was eventually found only aPer the manual ‘‘grep’’-‐ funcFon had been used.

Simple is good

2. For each maximal repetition Y, identify the minimum unit U

such that U is not a repetition and Y is a concatenation of

multiple occurrences of U and a prefix of U. For example,

when Y = (CAG) 6CA, U = CAG.

3. An approximate repetition is a substring such that its

alignment with repetition (U)m is decomposed into series of

exact matches of length |U| or more, and neighboring series

must have only one mismatch, one insertion, or one deletion

between them in the alignment, where |U| indicates the length

of U. We calculate an approximate repetition by extending a

maximal (exact) repetition in both directions in a greedy

manner. For example, given

CGCCCGCAGCGCAT(CAG)6CATCAGGGA,

we can extend repetition (CAG)6CA to the underlined

substring,

CGCCCGCAGC-GCAT(CAG)6CATCAGGGA,

where bold letters represent mismatches and “-” indicates a

deletion. In this way, we retrieve an approximate STR that is

not necessarily an exact repeat of the minimum unit U, but

may contain mismatches and indels.

4. A read may contain multiple overlapping STRs with the

same unit. If two overlap, eliminate the shorter one. If both

are of the same length, select one arbitrarily.

The algorithm is able to process ten million reads of length 100

bases in ~1700 s on a Xeon X5690 with a clock rate of 3.47-GHz

(Supplementary Fig. S1). As the computational time is

proportional to the number of reads, ~47 hours is required to

process 1 billion 100-bp reads, confirming the practicality of the

method for processing real human resequencing data.

Fig. 1. Sensing and locating short tandem repeats (STRs) in short reads. (A) An original short read. (B) An approximate STR (AGAGGC)n (n=6) in the

short read. The central four copies of AGAGGC are an exact STR with no mutations, while the flanking copies contain the mutations shown in bold letters.

If one of the regions (black) surrounding the STR aligns in a unique position, the STR can be located in the genome. (C) A read occupied by an approximate

STR. (D) Sensing STRs from frequency distributions of (AGAGCC)n in NA12877 (father of the HapMap CEU trio), NA12878 (mother), and NA18507 (an

African male). The x-axis is the lengths of STR occurrences detected in a read, and the y-axis is the frequency of reads containing STR occurrences of the

length indicated on the x-axis. Note that 100-bp-long STR occurrences are frequent in NA12877, while no STR occurrences of length >70 bp are observed

in samples NA12878 and NA18507. (E) When a read is filled with an STR (red), we attempt to anchor the other end read (blue) to a unique position

unambiguously. (F, G) An STR is located easily if its location can be sandwiched using information on paired-end reads. The length of an STR of length <

100 bp is easily estimated (F), while determining the length of a much longer STR is nontrivial (G). We need to use third-generation sequencers, such as

PacBio RS, with the capability of reading DNA fragments having a length of thousands of bases.

by guest on June 7, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Rapid detecFon of expanded short tandem repeats in personal genomics using hybrid sequencing

Koichiro Doi, Taku Monjo, Pham H. Hoang, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji and Shinichi Morishita

University of Tokyo

Standards

Human Variome Project

Prototype NGS database

Report

Sharing Experience with TruSight One

•  In partnership with Illumina, RCPA and the HGSA Kim Flintoff (Wellington Regional GeneFcs Laboratory) is leading an evaluaFon of exon sequencing using Illumina’s True Sight One panel. Two Coriell family trios will be sequenced by New Zealand Genomics Limited and the data will be shared on a HVPA database

•  The VCF file will be available on the HVPA LOVD database and performance stats will also be made available.

Next Steps

•  Robust standards for genomic medicine •  Databases and data content – Access to idenFfied and de-‐idenFfied data (consent and confidenFality)

– Database accreditaFon process in prep with RCPA – Defining the performance of various aligners, variant callers and annotaFon programs

–  Clinical grade Variant Call Format (VCF) – Metafile covering data trail: what was tested, what was not tested

Data quality classes DifferenFate between three classes of data: The Clinically Reported data label would denote the class of data that the HVP Australian Node was originally designed to collect and share: data that has been generated in a NATA accredited Australian diagnosFc laboratory and is able to be included in a clinical report. Unreported Clinical quality data would denote data that has been generated in a NATA accredited diagnosFc laboratory, but is not capable of being included in a clinical report. This class would comprise, primarily, of next-‐generaFon sequencing (NGS) type data. Unaccredited data would be used to denote data that has been generated by an Australian laboratory that has not been NATA accredited A new filtering opFon would be made available to allow users to view only data of a certain class

Standards for AccreditaFon of DNA Sequence VariaFon Databases

Quality Use of Pathology Program (QUPP), a naFonal project for the Development of Standards for AccreditaFon of DNA Sequence VariaFon Data Bases has been jointly iniFated by the Royal College of Pathologists of Australasia (RCPA), and the Human Variome Project (HVP). Background •  There is a rapidly increasing volume, spectrum, and complexity of geneFc tests emerging within

diagnosFc pathology laboratories. In parFcular, high throughput sequencing methods such as targeted panel, exome (WES), and whole genome sequencing (WGS), are producing an increasing quanFty of geneFc data requiring analysis and interpretaFon, forming a substanFal proporFon of the workload.

•  Currently, there is a plethora of online mutaFon databases to refer to, however there is a disFnct lack of such databases that meet the stringent accuracy and reproducibility that the clinical diagnosFc environment demands. AddiFonally, The current databases are “Fractured”, with varied access and sharing of the data within; and variable quality due to errors / inaccurate data posFng, all of which is a clear risk to the quality of paFent care. With more widespread, secure sharing of variants and associated phenotypes, the value of cumulaFve variant informaFon will accelerate the delivery of accurate, acFonable, and efficient clinical reports.

•  There are currently no standards or equivalent mechanisms for accreditaFon of databases to ensure the accuracy and quality of uploaded data into any central repository to meet the needs of the clinical diagnosFcs environment.

Pathogenicity 1.  “Deleterious-‐ and Disease-‐Allele Prevalence in Healthy Individuals:

Insights from Current PredicFons, MutaFon Databases, and PopulaFon-‐Scale Resequencing” Yali Xue, Yuan Chen, Qasim Ayub, Ni Huang, Edward V. Ball, Malhew Mort, Andrew D. Phillips, Katy Shaw, Peter D. Stenson, David N. Cooper, Chris Tyler-‐Smith, and the 1000 Genomes Project ConsorFum Am J Hum Genet 91, 1022–1032 2012

2.  “Amino Acid Changes in Disease-‐Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset” Tjaart A. P. de Beer*, Roman A. Laskowski, Sarah L. Parks, Botond Sipos, Nick Goldman, Janet M. Thornton PLOS Comp Biol, 9 1-‐15 2013

3.  “Large Numbers of GeneFc Variants Considered to be Pathogenic are Common in AsymptomaFc Individuals” Christopher A. Cassa, Mark Y. Tong, and Daniel M. Jordan HuMu 34. 9 1216–1220, 2013

4.  “Integrated sequence analysis pipeline provides one-‐stop soluFon for idenFfying disease-‐causing mutaFons” Hao Hu , Thomas F Wienker, Luciana Musante, Vera MM Kalscheuer, Peter N Robinson, H Hilger Ropers HuMu under review

Table 1. List of selected CNV detec2on methods.

Duan J, Zhang J-‐G, Deng H-‐W, Wang Y-‐P (2013) ComparaFve Studies of Copy Number VariaFon DetecFon Methods for Next-‐GeneraFon Sequencing Technologies. PLoS ONE 8(3): e59128. doi:10.1371/journal.pone.0059128 hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128

Summary

•  Current sequencing technology has plenty of room for improvement w.r.t. read length and accuracy

•  Many informaFcs challenges relate to managing poor quality data or technological limitaFons and will go away with longer, more accurate reads

•  AnnotaFon, data sharing and integraFng variant data with clinical and phenotypic data are the high value healthcare deliverables

Acknowledgments

•  Genomic Medicine & Centre for TranslaFonal Pathology, University of Melbourne: Arthur Hsu, Olga Kondrashova, SebasFan Lunke, Clare Love, Renate Marquis-‐Nicholson, Kym Pham, Paul Waring

•  Human Variome Project: Tim Smith, Alan Lo, Dick Colon

•  Melbourne Genomics Health Alliance: Clara Gaff, Kathryn North, Doug Hilton, Stephen Smith

Targeted Tumour Sequencing:

BWA Enrichment, Version 1.0.0.1

Enrichment Sequencing Report

Sample Information

Sample ID: TL140380

Sample Name: TL140380

Total PF Reads: 77,538,750

Percent Q30: 78.6%

Adapters Trimmed: Yes

Median Read Length: 151 bp

Enrichment Summary

Target Manifest Total Length of Targeted Reference Padding Size

TruSight One v1.0 11,946,514 bp 150 bp

Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.

Read Level Enrichment

Total Aligned Reads

Percent Aligned Reads

Targeted Aligned Reads

Read Enrichment

Padded Target Aligned Reads

Padded Read Enrichment

75,690,682 97.6% 49,355,753 65.2% 57,751,970 76.3%

Base Level Enrichment

Total Aligned Bases

Targeted Aligned Bases

Base Enrichment

Padded Target Aligned Bases

Padded Base Enrichment

10,449,595,970 4,976,953,404 47.6% 7,636,013,223 73.1%

Coverage Summary

Mean Region Coverage Depth

Uniformity of Coverage (Pct > 0.2*mean)

Target Coverage at 1X

416.6X 96.1% 99.4% 99.1% 98.9% 97.9%

Coverage Summary

416.6X 96.1% 99.4% 99.1% 98.9% 97.9%

ConsFtuFonal Frozen

Small Variants Summary

SNVs Insertions Deletions

Total Passing 8,113 192 230

Percent Found in dbSNP 98.8% 87.5% 74.3%

Het/Hom Ratio 1.7 1.8 2.5

Ts/Tv Ratio 3.1 - -

Variants by Sequence Context

Number in Genes 8,206 187 225

Number in Exons 6,927 80 107

Number in Coding Regions 6,587 50 64

Number in UTR Regions 340 30 43

Number in Splice Site Regions 742 54 69

Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.

Variants by Consequence

Frameshift - 20 23

Non-synonymous 2,886 30 40

Synonymous 3,676 - -

Stop Gained 19 0 0

Stop Lost 6 0 0

Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences

Coverage Summary

2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%

Coverage Summary

2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%

Small Variants Summary

Total Passing 8,244 184 255

Percent Found in dbSNP 98.8% 88.6% 70.6%

Het/Hom Ratio 1.7 1.5 2.9

Ts/Tv Ratio 3.0 - -

Variants by Sequence Context

Number in Genes 8,336 182 250

Number in Exons 7,033 79 101

Number in Coding Regions 6,685 49 59

Number in UTR Regions 348 30 42

Number in Splice Site Regions 762 51 89

Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.

Variants by Consequence

Frameshift - 22 24

Non-synonymous 2,952 27 34

Synonymous 3,705 - -

Stop Gained 22 0 0

Stop Lost 6 0 0

Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences

Sample Information

Sample ID: WES001FR1

Sample Name: WES001FR1

Total PF Reads: 454,879,338

Percent Q30: 81.9%

Adapters Trimmed: Yes

Median Read Length: 151 bp

Enrichment Summary

Target Manifest Total Length of Targeted Reference Padding Size

TruSight One v1.0 11,946,514 bp 150 bp

Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.

Read Level Enrichment

Total Aligned Reads

Percent Aligned Reads

Targeted Aligned Reads

Read Enrichment

Padded Target Aligned Reads

Padded Read Enrichment

445,404,466 97.9% 305,571,144 68.6% 341,517,122 76.7%

Base Level Enrichment

Total Aligned Bases

Targeted Aligned Bases

Base Enrichment

Padded Target Aligned Bases

Padded Base Enrichment

60,040,931,299 30,533,612,135 50.9% 44,910,193,400 74.8%

Called Variants

FFPE Frozen

16,711

2,095 154

3,641 9,368 2,455

Gene List

techniques allow for the rapid detection of EGFRmutations with high sensitivity and specificity.However, confirmation of mutations via directsequencing is still necessary.27,76,77 Though not ofany current clinical use, an assay that provides arapid assessment of EGFR mutation status in as littleas 30 min using a ‘smart amplification process’ hasbeen described. These may one day provide greatlyimproved turnaround times for this analysis.78

Formalin-fixed and paraffin-embedded tissue isperfectly suitable for fluorescence in situ hybridiza-tion (FISH) and DNA-based tests, but tissue pre-servation is critical for a successful test. Decalcifiedand ethanol-fixed tissue, as well as tissues contain-ing abundant necrosis, should be avoided.

The ability to detect multiple driver mutations inlung adenocarcinoma has revolutionized the medi-cal management of this disease and multiplexedtesting for all common driver mutations will providephysicians with a more precise guide for therapy.9

Recently, Kris et al79 identified 10 driver mutationsin tumor samples from 1000 lung adenocarcinomapatients enrolled in the National Cancer InstituteLung Cancer Mutation Consortium. The mutations,involving KRAS, EGFR, ERBB2 (HER2), BRAF,PIK3CA, AKT1, MAP2K1, and NRAS, were screenedusing standard multiplexed assays and FISH. Drivermutations were detected in 60% of tumors. Theincidences of mutations were as follows: KRAS25%, EGFR 23%, ALK rearrangements 6%, BRAF3%, PIK3CA 3%, MET amplifications 2%, ERBB21%, MAP2K1 0.4%, NRAS 0.2%, and AKT1 0%(Figure 3).12,67–71 It is noteworthy that 95% ofmolecular lesions were mutually exclusive.79

EGFR mutations are responsible for the constitu-tive activation of the tyrosine kinase receptor. Thesemutations are also most frequently associatedwith either sensitivity or resistance to EGFR TKIs(Figure 2).6,80–84 The response-associated mutationsare linked with response rates of 470% in patientstreated with either erlotinib or gefitinib.85,86 How-ever, upto 25% of patients with TKI resistance-associated mutations will also respond to thetherapy.67 Pao et al7 analyzed EGFR mutation ofexons 18–24 in tumors from 10 gefitinib-responsiveand from 7 erlotinib-responsive patients. The resultsdemonstrated that EGFR mutations were present in7 of 10 (70%) gefitinib-responsive and in 5 of 7(71%) erlotinib-responsive tumors.

EGFR genotype was more useful than clinicalcharacteristics for selection of appropriate patientsfor consideration of first-line therapy with an EGFRTKI.85 EGFR mutations are generally associated withsensitivity to TKI therapy.71,87 Both retrospectiveand prospective studies have demonstrated thatlung adenocarcinoma patients carrying such anEGFR mutation and who were treated with TKIshad significantly higher response rates and longerprogression-free survival than patients without anEGFR mutation,5–7,25,29,71,83,85,87,88 with some patientsexperiencing rapid, complete, or partial responses

that were persistant.55 Jackman et al85 studied 223chemotherapy-naı̈ve patients with advanced lungcancer of non-small cell type, among which 86%were adenocarcinomas. Sensitizing EGFR mutationswere found in 84 carcinomas, 89% of which wereadenocarcinomas. The mutations were associatedwith a 67% response rate, with a time to progressionof 11.8 months, and overall survival of 23.9months.85 Exon 19 deletions were associated witha relatively longer median time to progression andoverall survival compared with L858R (exon 21)mutations. Wild-type EGFR was found in 139patients (62%), and this finding was associatedwith poor outcomes (response rate, 3%; time to pro-gression, 3.2 months), irrespective of KRAS status.

EGFRvIII Mutation

EGFR variant III (EGFRvIII), a mutation resultingfrom an in-frame deletion of exons 2–7 of the codingsequence (amino acids 6–273), has been associatedwith a subset of squamous cell lung cancers.89–91 Anumber of functional differences between EGFRvIIIand EGFR have been characterized.90,91 EGFRvIII hasbeen identified in an array of human solid tumors,including glioblastoma, breast cancer, ovarian can-cer, prostate cancer, and lung caner. AlthoughEGFRvIII fails to bind EGF, its intracellular tyrosine

Figure 3 Frequency of major driver mutations in signalingmolecules in lung adenocarcinomas. About 64% of all adenocar-cinoma cases harbor somatic driver mutations. According to theNational Cancer Institute Lung Cancer Mutation Consortiumdata,79 B23% of lung adenocarcinomas harbor EGFR mutations.The EGFR mutation status of the cancer is associated with itsresponsiveness or resistance to EGFR TKI therapy. KRAS muta-tions are more frequently found in adenocarcinomas (25%),which are mutually exclusive with EGFR mutations. Mutationsin KRAS have been proposed as one of the mechanisms ofprimary resistance to gefitinib and erlotinib therapy. A subsetof adenocarcinoma cases harbors a transforming fusion gene,EML4–ALK (6%), which mainly involves adenocarcinoma fromnon-smokers with wild-type EGFR and KRAS mutations. Themutation frequency of BRAF is 3%, PIK3CA 3%, MET amplifica-tions 2%, ERBB2(Her2/neu) 1%, MAP2K1 0.4%, and NRAS 0.2%.Each of the molecular alterations has a role in the signalpathways, activating important cell functions, including cellproliferation and survival. Approximately 36.4% of lung adeno-carcinomas do not harbor currently detectable mutations.

Molecular pathology of lung cancer

350 L Cheng et al

Modern Pathology (2012) 25, 347–369

Filtering Variants All variants None Qual Not in Blood

Blood 9828 8551 NA

Frozen 9920 8736 126

FFPE 9709 8163 199

Variants in Gene List None Qual Not in Blood

Blood 27 18 NA

Frozen 27 23 2 (EGFR)

FFPE 25 19 3 (EGFR, ROS)

EGFR p.L858R

EGFR p.T790M

ConfirmaFon by PCR

100.0#

150.0#

200.0#

250.0#

EGFR_NM_005228.3#T790#T790#WT#

EGFR_NM_005228.3#784#"c.2350T>C,#p.S784P"#

EGFR_NM_005228.3#784#"c.2351C>T,#p.S784F"#

EGFR_NM_005228.3#785#"c.2354C>T,#p.T785I"#

EGFR_NM_005228.3#786#"c.2356G>A,#p.V786M"#

EGFR_NM_005228.3#790#"c.2368A>G,#p.T790A"#

EGFR_NM_005228.3#790#"c.2369C>T,#p.T790M"#

EGFR_NM_005228.3#828#&#861#"828#&#861,#wt"#

EGFR_NM_005228.3#858#"c.2572C>A,#p.L858M"#

EGFR_NM_005228.3#858#"c.2573_2574delinsGT,#

EGFR_NM_005228.3#858#"c.2573T>A,#p.L858Q"#

EGFR_NM_005228.3#858#"c.2573T>G,#p.L858R"#

EGFR_NM_005228.3#860#"c.2579A>T,#p.K860I"#

EGFR_NM_005228.3#861#"c.2582T>A,#p.L861Q"#

EGFR_NM_005228.3#861#"c.2582T>G,#p.L861R"#

EGFR%normalised%%

KRAS_NM_033360.2#12#"c.34G>A,#p.G12S"#

KRAS_NM_033360.2#12#"c.34G>C,#p.G12R"#

KRAS_NM_033360.2#12#"c.34G>T,#p.G12C"#

KRAS_NM_033360.2#12#"c.35G>A,#p.G12D"#

KRAS_NM_033360.2#12#"c.35G>C,#p.G12A"#

KRAS_NM_033360.2#12#"c.35G>T,#p.G12V"#

KRAS_NM_033360.2#13#"c.37G>A,#p.G13S"#

KRAS_NM_033360.2#13#"c.37G>C,#p.G13R"#

KRAS_NM_033360.2#13#"c.37G>T,#p.G13C"#

KRAS_NM_033360.2#13#"c.38G>A,#p.G13D"#

KRAS_NM_033360.2#13#"c.38G>C,#p.G13A"#

KRAS_NM_033360.2#13#"c.38G>T,#p.G13V"#

KRAS%normalised%%

Gene list summary

gene locus

covered,by,capture,panel? Observations Notes

AKT1 chr14:105,235,7611105,262,116 y wt activating:pointALK chr2:29,415,410130,146,821 y wt rearrangements:and:secondary:resistance:mutationsBRAF chr7:140,433,8131140,624,564 y wt activating:pointEGFR chr7:55,248,979155,259,567 Y L858R;:T790M activating:points,:indels,:ressitance:pointHER2 chr17:37,844,393137,884,915 Y wt ampliication,:activating:pointKRAS chr12:25,386,768125,403,863 Y ? activating:point

MAP2K1 chr15:66,679,211166,783,882 Ychanged:allele:ratio

MET chr7:116,312,4591116,409,963 Ychanged:allele:ratio mutation:and:amplification

NRAS chr1:115,247,0851115,259,515 Y wt activating:pointPI3KCA chr3:178,866,3111178,952,497 Y wt activating:pointROS1 chr6:117,609,5301117,747,018 Y wt rearrangements

Graham Taylor - The future of DNA sequencing technology

Science

Transcript of Graham Taylor - The future of DNA sequencing technology

Jean E. E . Graham · 2018. 4. 19. · E. E. Graham in serving as the dean of the college has ... Billy Owen Taylor Myra Faye Emerson Duane Parker El Dorado Grace Parkinson Magnolia

DNA Sequencing Sanger Di-deoxy method of Sequencing Manual versus Automatic Sequencing.

Building Experience; Delivering Value SPONSORED BY: Graham Salt, Allen Browne Tonkin & Taylor, Hiway Stabilizers Quantifying Improvements to Pavement Life.

Illumin8er: Software for the Illumina GAII Ian Carr, Joanne Morgan, Phil Chambers, Alex Markham, David Bonthron& Graham Taylor Leeds Institute of Molecular.

*~TAYLOR & GRAHAM~*

MTEC a member of NSTDA · 2011. 3. 17. · Graham nnî ha 4 Bette Nesmith Graham Graham 5 . 500 100 nu Graham GE 400 Graham Graham Graham nnî IBM Graham 61ufl n.m 1969 Liquid Paper

Islam in middle childhood Jonathan Scourfield, Sophie Gilliat-Ray, Asma Khan, Sameh Otri, Chris Taylor, Graham Moore.

DNA Sequencing. Next few topics DNA Sequencing Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun Sequencing Assembly Gene Recognition.

Exclusive Interview: Paula Taylor from Graham & Brown

Genome Sequencing and Assembly High throughput Sequencing

Sequencing technologies - Technical University of Valencia · 2019-05-06 · Sequencing technologies: Sanger ... Sanger sequencing. Sanger sequencing Traditional DNA sequencing method

The Human Variome Database in Australia in 2014 - Graham Taylor

600.445; Copyright © 1999, 2000, 2001 rht+sg Introduction to Vectors and Frames CIS - 600.445 Russell Taylor Sarah Graham.

VALIDATION OF NGS SEQUENCING BY SANGER SEQUENCING

Capabilities of Next Generation Sequencing Instrumentation · 2017-12-20 · New Technology for Sequencing ... – Semiconductor sequencing • SOLiD (Life Tech.) – Sequencing by

Tumor Sequencing and Next-Generation Sequencing

November 2010 – March 2011 Councillor Graham Vickery (Chair) Councillor Andy Fry Councillor Brenda Quinney Councillor Derek Taylor Jess Bayley March 2011.

1 Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX Methods, Automation, Applications Graham Wiley Roe Lab.

Abstract · 2018. 2. 15. · Canada 2Vector Institute, Toronto, Ontario, Canada. Correspon-dence to: Terrance DeVries , Graham W. Taylor .

Graham Taylor - University of Melbourne - Maximising the advantages and addressing the limitation of clonal sequencing in diagnostics

~TAYLOR & GRAHAM~