Post on 16-Apr-2017
The Future of DNA Sequencing Technology Graham Taylor
Melbourne University, Human Variome Project (Australia), Victorian Clinical
GeneFcs Laboratories
Context and Topics 1. Technology review & IdiosyncraFc selecFon
of noteworthy developments and trends in NGS hardware and soPware from the perspecFve of Genomic Medicine (so mostly human genome) in the context of meeFng clinical needs Sadly, not covering Transcriptomics, ChIP-‐seq, non-‐human genomes
2. ApplicaFons and implicaFons for diagnosFcs
An UlFmate Goal for Sequence analysis?
For sequencing – Chromosome-‐length reads – Perfect base calling accuracy – Each molecule is read – Highly parallel
For analysis – De novo assembly – Well curated reference resources – Data integrated with other biological and medical resources
Research, translaFon and service • Original • Surprising • >80% accurate • Numerator-‐driven: get
publicaFons • Bespoke
• Proven • Predictable • >99.99% accurate • Denominator-‐driven (cost
sensiFve) • Standardised
Cost and performance cost per base Illumina share price
Now is the winter of our discount tests (unless you are Illumina)
The case for disease-‐centric analysis
• $1,000 dollar genomes or 1,000 x $1 interesFng regions? • How to validate 3.5x 109 tests • Sequencing costs are not limiFng
• Quality and accuracy are incomplete • Perform tests for a (clinical) reason
Sequence performance and clinical needs
number'of'readslength'of'reads
Genetics Tumor-Analysis MicrobiologySample/library,preparation 3 4 4Base,calling,accuracy 5 5 3De,novo,assembly 3 5 4Detect,Rare,Events 3 5 5Portability 2 3 4
How many variants per exome? SNP count Study
20,000 Choi et al. PNAS 2009
142,000 Mullikin NIH, unpublished 2010
50,000 Clark et al. Nature biotechnology 2011
125,000 Smith et al. Genome Biology 2011
100,000 Johnston & Biesecker Human Molecular GeneFcs 2013
200,000 to 400,000 Yang et al.N Engl J Med 2013
• 20-‐fold range • Exome designs vary • Likely to be higher variant count in African populaFons as the
reference sequence is non-‐African
Low concordance of mulFple variant-‐calling pipelines O’Rawe et al. Genome Medicine 2013, 5:28
SNV concordance: 57.4% Indel concordance 26.8%
Venn diagrams of selected CNV detecFon methods in real data processing
Duan J, Zhang J-‐G, Deng H-‐W, Wang Y-‐P (2013) ComparaFve Studies of Copy Number VariaFon DetecFon Methods for Next-‐GeneraFon Sequencing Technologies. PLoS ONE 8(3): e59128. doi:10.1371/journal.pone.0059128 hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128
De novo Assembly (the unfinished genome)
• Genome Res. 2014. 24: 688-‐696 2014 Huddleston et al. – Within the human genome, there are >900 annotated genes mapping to large segmental duplicaFons. Such genes are typically missing or misassembled in working draP assemblies of genomes
– The widespread adopFon of next-‐generaFon sequencing methods for de novo genome assemblies has complicated the assembly of repeFFve sequences and their organizaFon
– resolved regions that are complex in a genome-‐wide context but simple in isolaFon for a fracFon of the Fme and cost of tradiFonal methods using long-‐read single molecule, real-‐Fme (SMRT) sequencing and assembly technology
– SMRT sequencing of large-‐insert clones can significantly improve sequence assembly within complex repeFFve regions of genomes
Recent past and future RIP Coming soon?
SBS • GnuBIO/BioRad :emulsion microfluidics for targeted
sequencing and hotspot analysis of rare variants • LaserGen: Lightning Terminators™; increased accuracy,
longer reads and faster cycle-‐Fmes Nucleic Acids Res. Oct 2007; 35(19): 6339–6349.TerminaEon of DNA synthesis by N6-‐alkylated, not 3ʹ′-‐O-‐alkylated, photocleavable 2ʹ′-‐deoxyadenosine triphosphate Weidong Wu et al.
• Qiagen/Intelligent Biosystems • QuantuMDx: Nat Biotechnol. 2005 Oct;23(10):1294-‐301.
MulFplexed electrical detecFon of cancer markers with nanowire sensor arrays Zheng G, Patolsky F, Cui Y, Wang WU, Lieber CM
Currently SBS are Market Leaders
• Illumina • Proton Torrent • PacBio
PacBio
• English et al. (2012) Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-‐Read Sequencing Technology. PLoS ONE 7(11): e47768
• Loomis et al (Sequencing the unsequenceable: Expanded CGG-‐repeat alleles of the fragile X gene Genome Research (2012)
Nanopores • Electronic BioSciences: developing a system with a single/few pores
with a very fast rate of sequencing of ~50kb/second • Genia: DNA polymerase to incorporate nucleoFdes with PEG-‐based
NanoTags. As the bases are incorporated the NanoTags are cleaved, allowing them to travel through the pore where they can be measured, generaFng sequence-‐specific informaFon
• IBM: a solid state nanopore using alternaFng layers of metal and dielectric material to control the rate of passage through the nanopore
• NABsys: Modified DNA (e.g. by SBH) read via Nanopore. Not yet sequencing, but very long reads
• NobleGen: combinaFon of opFcal detecFon on nanopores • Oxford Nanopore: exonuclease and strand-‐based nanopore
methods
Real long reads Nanopore sequencing
8,476 base single read
Not producFon ready 3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
3040506070
total time 273 seconds
mea
n sig
nal (
picoa
mps
)
Wiggle plot
Viterbi algorithm for all trinucloFdes
Electron Microscopy • ZS GeneFcs: directly visualizes the sequence of DNA molecules
using electron microscopy. Proof of principle by the use of a dUTP nucleoFdewith a single mercury atom alached to the nitrogenous base. This modificaFon is small enough to allow very long molecules with labels at each A-‐U to be seen using annular dark-‐field scanning transmission electron microscopy (ADF-‐STEM) Microsc Microanal. 2012 Oct;18(5):1049-‐53 DNA base idenFficaFon by electron microscopy Bell DC, Thomas WK, Murtagh KM, Dionne CA, Graham AC, Anderson JE, Glover WR.
• Reveo: atomic force microscopy called the Omni Molecular Recognizer ApplicaFon (OmniMoRA), will use arrays of nano-‐knife edge probes to measure the vibraFonal characterisFcs of individual bases on DNA molecules that have been stretched and immobilized on a surface
Electron Microscopy Progress toward an aberraFon-‐corrected low energy electron microscope for DNA sequencing and surface analysis. Mankos M, Shadman K, N'diaye AT,Schmid AK, Persson HH, Davis RW. Vac Sci Technol B Nanotechnol Microelectron. 2012 Nov;30(6):6F402
Imaging of reduced 5ʹ′-‐/5DTPA/C-‐20mer on Au substrate: (a) (b) AFM images at two magnificaFons, (c) height profile along line shown in (a), (d) height profile along line shown in (b), and (e) LEEM images at three different landing energies.
Aiming for 50 megabase reads with phred 60
Hardware Trends
• Clonal sequencing – Increasing accuracy – Increasing read lengths – Increasing read counts
• Single molecule sequencing – PacBio – Oxford Nanopore
Increasing read counts via palerned flow cells
• Palerned flow-‐cells useful for nucleic acid analysis US 20120316086 A1
• KineFc exclusion amplificaFon of nucleic acid libraries WO 2013188582 A1 – (i) capturing the different target nucleic acids at the amplificaFon sites at an average capture rate, and
– (ii) amplifying the target nucleic acids captured at the amplificaFon sites at an average amplificaFon rate, wherein the average amplificaFon rate exceeds the average capture rate.
Palerned flow cells, super Poisson kineFcs
Pseudo-‐long reads via “molceculo”
Genome informaFcs example.. • Does Moleculo’s technology have both a wet lab and a
bioinformaFcs aspect?
• Yes, it’s about 50:50. One doesn’t make sense without the other. There are two components: first, there is a molecular biology kit and protocol that takes in genomic DNA and turns it into a sequencer-‐compaFble library. APer modifying and tagging the DNA, this allows the second component, the algorithmic part, to take the short reads and reconstructs long reads using those tags. Those are two separate parts. We developed both on campus, and improved upon them aPer we started the company last year.
Reducing assembly complexity of microbial genomes with single-‐molecule sequencing
identifying DNA modification, such as methylation pat-terns, directly from the single-molecule sequencing data[15]. While adoption of this technology was initially slowedby the low accuracy of the single-pass sequences, recentadvancements have demonstrated that this drawback canbe algorithmically managed to produce assemblies of un-matched continuity [7,8,16]. Steady improvements to thePacBio technology continue to increase read lengths andyield [17], while future technologies promise to combineaccuracy with length using either nanopores [11] or ad-vanced sample preparation [18]. Improved microbial gen-ome assembly is an obvious application of these recentdevelopments in long-read sequencing.Genome assembly is the process of reconstructing a
genome from many shorter sequencing reads [19-21]. Itis typically formulated as finding a traversal of a properlydefined graph of reads, with the ultimate goal ofreconstructing the original genome as faithfully as pos-sible. Repeated sequence in the genome induces com-plexity in the graph and poses the greatest challenge toall assembly algorithms [22]. In addition, repeats areoften the focus of analysis [23-25], making their correctassembly critical for subsequent studies. However, re-peats can only be resolved by a spanning read or readpair that is uniquely anchored on both sides. Read pairsare typically used due to their length potential (tens ofkilobase pairs), but introduce additional complexitybecause they cannot be precisely sized. Alternatively,long-read sequencing promises to more accurately re-solve repeats and directly assemble genomes into theirconstituent replicons. Figure 1 shows the benefit of in-creasing read length when assembling Escherichia coliK12 MG1655. This genome can only be assembled intoa single contig when the read length exceeds the size ofthe longest repeat in the genome, a multi-copy rDNAoperon. The rDNA operon, sized around 5 to 7 kbp, isthe largest repeat class in most bacteria and archaea[26]. Therefore, sequencing reads longer than the rDNAoperon, such as those produced by single-moleculesequencing, can automatically close most microbialgenomes.ALLPATHS-LG was the first assembler shown to pro-
duce complete microbial genomes using single-moleculesequences [7]. Utilizing a combination of PacBio RSsingle-molecule reads (2 to 3 kbp), short-range Illuminaread pairs (<300 bp insert), and long-range Illumina readpairs (3 to 10 kbp insert), ALLPATHS-LG assembles theIllumina reads first using a de Bruijn graph and incorpo-rates PacBio reads afterwards to patch coverage gapsand resolve repeats. Riberio et al. [7] tested this methodon 16 genomes and consensus accuracy was measuredat 99.9999% on 3 genomes with an available reference.Four of the sixteen genomes were successfully assembledinto a complete genome - the remaining genomes were
all highly continuous but left unresolved due to large-scale repeats. These results are promising, especially interms of consensus accuracy; however, the methodrequires two different sequencing platforms and threelibrary preparations, which limits its efficiency. Inaddition, the jumping libraries were observed to be in-consistent at spanning large repeats due to biases in thelibrary construction process.Ideally, complete genomes could be reconstructed
from a single fragment library, minimizing costs. Previ-ously, pair libraries were the only sequencing methodcapable of spanning large repeats, such as the rDNA op-eron, but the PacBio RS is now capable of producingsingle-molecule reads of the same length. Leveragingthis recent development, we present an approach for mi-crobial genome closure that relies on overlapping andassembling single-molecule reads de novo rather than
A
B C
Figure 1 Genome assembly graph complexity is reduced assequence length increases. Three de Bruijn graphs for E. coli K12are shown for k of 50, 1,000, and 5,000. The graphs are constructedfrom the reference and are error-free following the methodology ofKingsford et al. [27]. Non-branching paths have been collapsed, soeach node can be thought of as a contig with edges indicatingadjacency relationships that cannot be resolved, leaving a repeat-induced gap in the assembly. (A) At k = 50, the graph is tangledwith hundreds of contigs. (B) Increasing the k-mer size to k = 1,000significantly simplifies the graph, but unresolved repeats remain.(C) At k = 5,000, the graph is fully resolved into a single contig. Thesingle contig is self-adjacent, reflecting the circular chromosome ofthe bacterium.
Koren et al. Genome Biology 2013, 14:R101 Page 2 of 16http://genomebiology.com/2013/14/9/R101
Long, single-‐molecule reads are sufficient for the complete assembly of most known microbial genomes. The assemblies presented here have good likelihood and finished-‐grade consensus accuracy exceeding 99.9999%.
Koren et al. Genome Biology 2013, 14:R101
Clinical Drivers
• Manageable workflow • Cost efficiency • SensiFvity and specificity • Referring to the clinical quesFon • Depth vs. breadth of coverage
AdapFng NGS to purpose
control'
expansion'0'
200'
400'
600'
800'
1000'
1200'
(GGCCCC)4'(GGCCCC)5'
(GGCCCC)6'
Merging'forward'and'reverse'reads'
Use a HiFi polymerase
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA
CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA
Discarding rare (wrong) reads
DetecFng allele expansion
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
110"111"112"113"114"115"116"117"118"119"120"121"122"123"124"125"126"127"128"
Count&
Nucleo+de&base&pair&length&
"Control""
CG11"(MI095)"
Sample CASE Gene Genomic/Coordinates RefSeq VARIANT/c Variant/pAllele/%/
(Amplivar)
Allele/%/(MiSeq/
Reporter)1 22029 BRCA2 chr13_329116280G>T NM_000059.3 c.3136G>T p.Glu1046Ter 26.03% 22.37%2 22814 BRCA1 chr17_412464430insGA NM_007300.3 c.1105_1106insTC p.Asp369ValfsTer6 76.27% 75.65%3 23074 BRCA2 chr13_329144380delT NM_000059.3 c.5946delT p.Ser1982ArgfsTer22 74.36% 74.27%4 23162 BRCA1 chr17_412760430delCT NM_007300.3 c.68_69delAG p.Glu23ValfsTer17 76.04% 78.69%5 23165 BRCA2 chr13_329140660delAATT NM_000059.3 c.5574_5577delAATT p.Ile1859LysfsTer3 100.00% 95.53%5 23165 BRCA1 chr17_412444380del75 NM_007300.3 c.3005_3079del75 p.Asn1002_Ile1027del 48.71% Not0found6 23179 BRCA1 chr17_41215948_G>A NM_007300.3 c.5095C>T p.Arg1699Trp 85.24% 87.14%7 23210 BRCA2 chr13_329688360insT NM_000059.3 c.9266_9267insT p.Val3091ArgfsTer20 60.06% 59.64%8 23815 BRCA1 chr17_412562060insA NM_007300.3 c.374dupT p.Gln126ProfsTer16 100.00% 94.23%9 23824 BRCA1 chr17_4125691550delA NM_007300.3 c.271delT p.Cys91ValfsTer28 81.62% 83.32%10 23828 BRCA2 chr13_329128870delATTAC NM_000059.3 c.4395_4399delATTAC p.Leu1466PhefsTer2 93.20% 91.76%
MSI by NGS
Genotyping
Library ConstrucFon
Covaris optimisinglane SampleM E-‐gel Quant ladder1 gDNA2 A3 B4 C5 50bp ladder
Peak at 200bp50 bp ladder, 350bp bright band
B: 213.4 bp B: 208.8 bp C : 201.4 bp
NextEra method of Library ConstrucFon
Higher coverage greater reproducibility
Coverage Coefficient of variaFon
Can we capture coverage report dosage to diagnosFc standards? samples
targets
samples
autosomal ta
rgets
chrX ta
rgets
Inter-‐sample variaFon is low, But low coverage prevents dosage esFmaFon
Chr X is a good first pass test for dosage
XX vs. XY
8 Female cases and 16 Male cases showing reproducibility of coverage of X loci within each group. Loci with higher SDs were associated with reduced coverage.
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
2"
0" 10" 20" 30" 40" 50" 60" 70" 80"
Average"XX"
Average"XY"
!0.5%
0%
0.5%
1%
1.5%
2%
2.5%
3%
0% 10% 20% 30% 40% 50% 60% 70% 80%
AVGE%XX%
AVGE%XY%
870
160
EPCAM exon 9 to MSH2 exon 8 and EPCAM exon 9 to MSH2 exon 1 deleFons
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
chr2:"47596595047596770"(EPCAM
)"
chr2:"47600552047600759"(EPCAM
)"
chr2:"47600897047601237"(EPCAM
)"
chr2:"47602323047602488"(EPCAM
)"
chr2:"47604103047604266"(EPCAM
)"
chr2:"47606042047606243"(EPCAM
)"
chr2:"47606858047607158"(EPCAM
)"
chr2:"47612255047612399"(EPCAM
)"
chr2:"47613661047613802"(EPCAM
)"
chr2:"47630281047630591"(MSH2)"
chr2:"47635490047635744"(MSH2)"
chr2:"47637183047637561"(MSH2)"
chr2:"47639503047639749"(MSH2)"
chr2:"47641358047641607"(MSH2)"
chr2:"47643385047643618"(MSH2)"
chr2:"47656831047657130"(MSH2)"
chr2:"47672637047672846"(MSH2)"
chr2:"47690120047690343"(MSH2)"
chr2:"47693747047693997"(MSH2)"
chr2:"47698054047698251"(MSH2)"
chr2:"47702114047702459"(MSH2)"
chr2:"47703456047703760"(MSH2)"
chr2:"47705361047705708"(MSH2)"
chr2:"47707785047708060"(MSH2)"
chr2:"47709868047710138"(MSH2)"
chr2:"48010323048010682"(MSH6)"
chr2:"48018016048018312"(MSH6)"
chr2:"48022983048023252"(MSH6)"
chr2:"48025700048028344"(MSH6)"
chr2:"48030509048030874"(MSH6)"
chr2:"48031999048032216"(MSH6)"
chr2:"48032707048032896"(MSH6)"
chr2:"48033293048033547"(MSH6)"
chr2:"48033541048033840"(MSH6)"
chr2:"48033868048034049"(MSH6)"
IntegraFng data handling with sequencing operaFon
Partly auto-‐fills metadata file (based on samplesheet metadata file name)
Checks if the run finishes with FASTQs generated on board
Checks if the analysis is complete, then if metadata contains sufficient informaFon for FASTQ renaming. Also checks and starts analysis workflow in samplesheet
Renames FASTQ files
Monitor data monitor_data.sh
Monitor run status monitor_run_status_miseq.sh
Monitor metadata monitor_metadata.sh
Rename FASTQs rename_files.py
Syncing MiSeq
miseq_rsync.sh
Copies miseq run directory to server (every hour)
Workflow /storage/local/sw/system_automagic/
workflows
Depending on the workflow defined in samplesheet ( Project)
AutomaFc data populaFon for sample sheets
HiSeq MiSeq
ProducFon: Taffy Centos 11TB
Development: S’Box Centos
1 TB Systems 3TB Sandbox
ProducFon: MiSeq Reporter
PC Windows
Backup: Biocube Centos
RAID6 20TB
Image systems at install and aPer update
Snapshots: 14 12-‐hourly 4 weekly 12 monthly 3 yearly
Snapshots (System, Scripts + SoPware only)
Transfer Data as needed
Transfer new runs and backup analysis
Backup: ITS Tape (2 Copies)
Most recent Snapshot every 6 month for 3 years rolling
Required addi2onal Resources: 1. MiSeq Reporter Desktop PC 2. New Desktop PC for Seb 3. ITS Account 4. (NAS to expand Biocube)
Costs Es2mate: 1. 2x Desktop PCs: 3 – 4k 2. ITS tapes: max 11k/3years for 20TB 3. (NAS: approx 10k for 28TB)
Run Raw Data to keep: 1. MiSeq: Complete Run Folder 2. HiSeq: Run Folder excl. Bcl, image files etc. (fastq = raw data) 3. Compress data aPer 1 month (excl. fastq) 4. Discard external data aPer 2 month
NGS-‐DB
Projects Project_ID Project_DescripFon User
Experiments Exp_ID Project_ID Exp_DescripFon Prep_Method Kit_ID (if applicable) Pipeline(s) LibOperator
Samples [LIMS?] Sample_ID PaEent_ID Family_ID Sample_Type Sample_Source Sample_Conc QC_Score
Libraries Library_ID Sample_ID Exp_ID Barcode_1 Barcode_2 LibConc QC_P/F Pool_IDs MiSeqRun_IDs HiSeqRun_Ids (FC+Lane)
MiSeqRuns MiSeqRun_ID MiCartridge_ID MiSeq_LoadConc SeqOperator MiFC_ID MiSeq_RunDate MiSeq_ClustDens MiSeq_PFClustDens MiSeq_Reads MiSeq_PFReads ...
HiSeqRuns HiSeqRun_ID HiFC_ID SeqOperator HiSeq_LoadConc SBS_RGT Cluster_RGT cBot_RGT HiSeq_RunDate HiSeq_ClustDens HiSeq_PFClustDens HiSeq_Reads HiSeq_PFReads ...
HiSeq_Flowcells HiFC_ID HiFC_CatNo HiFC_LotNo HiFC_Type HiFC_expiry HiSeqRun_ID
MiSeq_Flowcells MiFC_ID MiFC_CatNo MiFC_LotNo MiFC_Type MiFC_expiry MiSeqRun_ID
HiSeq_SBS (clust,cBot) SBS_RGT SBS_expiry SBS_CatNo SBS_LotNo SBS_Type HiSeqRun_ID
MiSeq_Cartridges Cart_ID Cart_CatNo Cart_LotNo Cart_TypeNo Cart_expiry MiSeqRun_ID
LibraryPools Pool_ID Library_IDs Exp_IDs Barcodes_1 Barcodes_2 PoolConc QC_P/F MiSeqRun_ID(s) HiSeqRun_ID(s) HiSeqRun_Ids (FC+Lane)
User Name InsFtuFon Contact Info Billing Info ...
Others Pipelines LibOperators SeqOperators Kits PrepMethods
AlternaFves to read mapping and alignment
• Grouped read tesFng – Amplivar, AmbiVert
• Tiled matching – MIST
• kmer subtracFon – Diamund
• DetecFon of allele expansions
coveragestatistics
each amplicon read sorted by primers
grouped amplicon variants
grab amplicons
sort by locus
group amplicons
Edit Disitance
Read Counts, Read Distribution and Analysis
Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.0"
1000"
2000"
3000"
4000"
5000"
6000"
NRAS1_7_2"chr1"1152565283115256531"
NRAS8_13_3"chr1"1152587303115258748"
PIK3CA1_20"chr3"1789168763178916876"
PIK3CA2_21"chr3"1789215533178921553"
PIK3CA3_22"chr3"1789279803178927980"
PIK3CA4_11_23"chr3"1789360743178936095"
PIK3CA12_24"chr3"1789388603178938860"
PIK3CA13_20_25"chr3"1789520073178952150"
PIK3CA13_20_26"chr3"1789520073178952150"
KIT1_36"chr4"55561764355561764"
KIT2_37"chr4"55592185355592186"
KIT3_19_38"chr4"55593464355593689"
KIT3_19_39"chr4"55593464355593689"
KIT3_19_40"chr4"55593464355593689"
KIT20_21_41"chr4"55594221355594258"
KIT22_42"chr4"55595519355595519"
KIT23_43"chr4"55597495355597497"
KIT24_28_44"chr4"55599320355599348"
KIT29_45"chr4"55602694355602694"
EGFR1_74"chr7"55211080355211080"
EGFR2_75"chr7"55221822355221822"
EGFR3_76"chr7"55233043355233043"
EGFR4_77"chr7"55241677355241708"
EGFR9_78"chr7"55242418355242511"
EGFR44_79"chr7"55249005355249131"
EGFR44_80"chr7"55249005355249131"
EGFR54_81"chr7"55259514355259524"
BRAF1_92"chr7"1404531213140453193"
BRAF28_93"chr7"1404813973140481478"
PTEN1_110"chr10"89624242389624244"
PTEN3_111"chr10"89685307389685307"
PTEN4_112"chr10"89711893389711900"
PTEN7_113"chr10"89717615389717772"
PTEN7_114"chr10"89717615389717772"
PTEN13_115"chr10"89720716389720852"
PTEN13_116"chr10"89720716389720852"
KRAS1_140"chr12"25378562325378562"
KRAS2_141"chr12"25380275325380283"
KRAS7_142"chr12"25398255325398285"
Average'Read'Count'per'Amplicon'+/6'SEM'
Smith-Waterman
BLAST
363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA220#reads;#wildtypeTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA64#reads;#nonEcoding#chr12:295,398,329C>TTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>CTGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;##two#nonEcoding#errors,#non#adjacentTGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA17#reads;#corresponding#to#c.35G>ATGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA
>1141_>EGFR9_78 chr7 55242418-55242511 1141GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>646_>EGFR9_78 chr7 55242418-55242511 646GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>60_>EGFR9_78 chr7 55242418-55242511 60GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT>57_>EGFR9_78 chr7 55242418-55242511 57GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT>54_>EGFR9_78 chr7 55242418-55242511 54GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT>51_>EGFR9_78 chr7 55242418-55242511 51GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCTGACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT
KRAS point mutation
EGFR 15 base deletion
EGFR 15 base deletion with Smith-Waterman alignment
Amplicon Sequencing: treaFng reads as groups
Amplivar vs. Alignment Amplivar Alignment (e.g. BWA)
Groups reads Uses individual reads
Designed for amplicons Designed for randomly sheared fragments
Works with FASTA aPer filtering
Works with FASTQ
Matches against target list Aligns against whole genome
Alignment is an opFonal late stage
Alignment is a required early stage
quality_filter Hard coded quality filters Output FASTA and qcore files
FASTA files (.fna) and quality files (.csv) wrilen to merged folder
fastq and fasta
>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!
@MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC!+!AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!
FASTQ
FASTA
group_fasta_reads
• @file_list = glob "$merged_dir/*fna" ;
Grouped reads
3 !AGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATAG!!1 !ACCACTTTTGGAGGGAGATTTCGCTCCTGAAGAAAATTCGACAGCTTTGTGCCTGGCTAATTCT!!527!AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTAACA!!1 !CTCACCCCCAGACTGGGTTTTTAGGTCTCGGTTTACAAGTTTCTTATGCTGATGCTGAAAAAAA!
Usual suspects file
4 column tab separated text file with Unix line endings • Column 1: RefSeq idenFfier • Column 2: cDNA HGVS nomenclature • Column 3: codon change HGVS nomenclature • Sequence to match • Usual suspects files available for TruSeq cancer panel and for
PCRbrary
RefSeq cDNA*description codon*change sequenceBRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAABRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAABRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA
Genotype table
libraryBRAF+600+wt+V600
BRAF+600+c.1799T>A+V600E
KRAS+12+&+13+wt+G12/G13
KRAS+12+c.34G>A+G12S
KRAS+12+c.34G>C+G12R
KRAS+12+c.34G>T+G12C
KRAS+12+c.35G>A+G12D
KRAS+12+c.35G>C+G12A
KRAS+12+c.35G>T+G12V
KRAS+13+c.38G>A+G13D
DL130016FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 74.2 0.0 0.0DL130028FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 30.6 0.0 0.0DL130040FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0DL130052FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 130.8 0.0 0.0DL130064FTGx120036 100.0 0.4 100.0 0.0 0.0 0.0 0.0 300.0 0.0 0.0DL130076FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 126.5 0.0 0.0DL130018FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130030FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130042FTGx120041 100.0 0.0 100.0 1.5 0.0 0.0 0.0 0.0 0.0 0.0DL130054FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130066FTGx120041 100.0 0.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130078FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130015FTGx120044 100.0 49.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 3.4DL130027FTGx120044 100.0 45.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130039FTGx120044 100.0 54.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130051FTGx120044 100.0 32.5 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0DL130063FTGx120044 100.0 45.3 100.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0DL130075FTGx120044 100.0 43.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Amplicon flank file
4 column tab separated text file with Unix line endings • Column 1: amplicon idenFfier with unique number • Column 2: size (not currently calculated) • Column 3: co-‐ords • Flanks with (.*) to capture the sequence between amplimers • Flank files available for PCRbrary, TruSeq Cancer Panel and
Olga’s TruSeq panel
ID Size Co)ords Sequence1_MPL1_2 175 chr1:43815006-43815137 GCCGTAGGTGCGCACG(.*)TCAGCAGCAGCAGG2_NRAS1_7 175 chr1:115256526-115256653 GCATTCCCTGTGGTTTT(.*)AGAGTACAGTGCCATG
Grouping Reads amplivar –i * –j * –k *!!-i /storage/local/sandbox/working_directory!-j usual_suspects.txt !-k flanking_primers!
Merges read pairs, quality filters, converts to fasta, groups by sequence Counts reads corresponding to each amplicon Genotypes according to the usual suspects table with read counts Groups reads by amplicon for mutaFon scanning
AMPLIVAR SeqPrep:
Remove adapters & Merge reads
Filter reads by quality
Convert fastq2fasta
Group fasta reads
Genotype grouped reads
Grab reads by flanks
Sort reads by locus
AMPLIVAR WRAPPER
AMPLIVAR SeqPrep:
Remove adapters & Merge reads
Filter reads by quality
Convert fastq2fasta
Group fasta reads
Genotype grouped reads
Grab reads by flanks
Sort reads by locus
Create symbolic links for fastqs
Create subdirectories for each fastq file pair
(R1 &R2)
Run amplivar
Run sort amplicons
Run blat on grouped, sorted reads
Convert blat psl2sam, sam2bam
Run bamleYalign
Inflate bam
Run VarScan
Run VEP
AmpliVar Required tools
• SeqPrep (C) • Blat (C) • Samtools (C) • BamlePalign (C++) • VarScan (java) • Bash • Perl • Python
>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC734>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACAAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGCTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTACCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCAGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGAGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAGCCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGGTCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC4>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCAGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCAAAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCAAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTCTACAGGCCTTCGGCTCCACCTGGTCCACCGCCAGTCTCCTGCCTGGCGGGGGCGGTACCTGTAGTGTGCAGGAAACTGCCACC3
Sorted, locus (amplicon)-‐based files
Sample Amplicon Read %/of/major/alleleMS0318_1164/Blood 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0323_1164/Frozen 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTTATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 28.40%MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATTTAGGTAATATTTCATCT 26.20%
Most reads are error free FFPE contaminates the evidence
Clustal alignment & phylogeny of errors
Merged forward and reverse reads PandaSeq and SeqPrep can merge overlapping read pairs to make them even longer and more accurate. In an unselected 100 base read pair run enriched for hereditary cancer genes over 20% of the reads could be merged. With longer reads and suitable experimental design this fracFon could be increased.
Pairs&Processed: 18,456,760Pairs&Merged: 4,029,383Pairs&With&Adapters: 32,899Pairs&Discarded: 646percent&merged 21.83
locus 1'total 2'totalchr17'41223060'41223115 2631 2777chr17'41223076'41223130 2452 2674chr17'41223101'41223146 2223 2501
0'
500'
1000'
1500'
2000'
2500'
3000'
chr17'41223060'41223115' chr17'41223076'41223130' chr17'41223101'41223146'
1'total'
2'total'
Probability of a given length read as a subset of a longer read in a normal distribuFon of longer reads: the “minimum substring problem”
This approach might make sense with longer reads
Average read length from 101 to 150 bases
Orthogonal validaFon without Sanger Scale:chr17--->
RefSeq Genes
RepeatMasker
10 baseshg1941,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105
AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGAflanks
Your Sequence from Blat SearchRefSeq Genes
Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of SamplesDuplications of >1000 Bases of Non-RepeatMasked Sequence
Repeating Elements by RepeatMasker
CCAGCAGTATCAGTA(.*)AGATTCTGCAACTTTTATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT
AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGAG YourSeq
rs1799966
0"
50"
100"
150"
200"
250"
300"
350"
400"
450"
500"
"""GCCCAGAGTCCAGCTGCTGCTCATAC"""GCCCAG*G*GTCCAGCTGCTGCTCATAC"
Forward reads inc rs1799966
Downstream processing of FASTA files BWA, BLAT, annotaFon
VEP
MIST
A schematic of the workflow used by MiST.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Identification of potentially paralogous read pairs.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Motivation for the use of Geoseq in variant calling.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Comparison of MiST and GATK. Each box has three sets of numbers, from left to right they are variant calls, (i) unique to MiST, (ii) common to both platforms and (iii) unique to GATK.
Filters are applied to remove calls occurring in public databases like dbSNP (17), 1000 Genomes (18) and a collection of already known private variants.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
DIAMUND: Direct Comparison of Genomes to Detect Muta2ons
Figure 1. Outline of initial steps in the Diamund algorithm, which identifies all k-mers unique to an affected proband and missing from bothunaffected parents. The first step identifies k-mers, after which the proband data are filtered to remove k-mers resulting from sequencing errors.Intersecting all three sets identifies k-mers that are unique to the proband.
sequencing, where the number of true but clinically irrelevant vari-ants will be 50 times greater.
Here, we introduce a new method, DIAMUND (direct alignmentfor mutation discovery), which takes a different approach to exomeand whole-genome analysis, and as a result produces dramaticallysmaller sets of candidate mutations. Rather than aligning all samplesto the reference genome, we align the sequences directly to oneanother. This method is designed primarily for two types of analyses:(1) self-comparisons, where diseased tissue is compared with normaltissue from the same individual, and (2) family studies, where thedifferences among the DNA sequences from the subjects are farfewer than the differences between any subject and the referencegenome.
Our method does not require that the raw sequencing reads, usu-ally numbering 100 million or more for a whole exome, be aligned tothe GRC37 reference genome, nor does it require a complex genomeassembly or an all-versus-all alignment of these large data sets. Aswe explain in detail below, we use a more efficient algorithm thatallows us to quickly find sequences that are unique to any sample.
We have implemented and tested DIAMUND on exomes repre-senting two types of analysis problem. First, we considered self-comparisons, in which DNA from primary cultured fibroblasts de-rived from diseased tissue in an affected individual was comparedwith DNA from nondiseased primary cultured fibroblasts from thesame individual. For the analysis of tumor cells or other somaticmosaic genetic abnormalities, this direct comparison should yielda smaller set of variants than an analysis that first compares all se-quences to the reference genome. Second, we looked at three parent–child trios in which a de novo mutation in the child was suspectedto be causing disease. The standard algorithm would compare allthree individuals to the reference genome, generating very large listsof variants, many of which are shared by the child and a parent. Bycomparing the child’s DNA directly to both parents, we can quicklyidentify all de novo mutations, without losing sensitivity and with-out detecting family-specific variants that add noise to the process.For each of these problems, the number of true de novo mutationsis very small, obviating the need for the aggressive filters that exomeand whole-genome pipelines use, which might eliminate the truevariant of interest.
De novo mutations may account for a high proportion ofMendelian disorders. Yang et al. recently reported [Yang et al., 2013]on exome sequencing of 250 probands and their families, amongwhich they identified 33 patients with autosomal dominant and nine
with X-linked diseases. Of these, 83% of the autosomal dominantand 40% of the X-linked mutations occurred de novo.
In addition to generating fewer false positives, direct comparisonbetween samples within a family, or between affected and unaf-fected tissue, allows for detection of mutations in regions that areentirely missing from the reference genome. It has already beenshown that some human populations have large shared genomicregions, often spanning many megabases [Li et al., 2010], which aremissing entirely from the human reference genome. These includenovel segmental duplications [Schuster et al., 2010] as well as en-tirely novel sequences. If a mutation of interest happens to fall inone of these regions, then conventional methods will be guaran-teed to miss it. Our direct comparison algorithm, in contrast, in-cludes these regions and is quite capable of finding mutations withinthem.
An important caveat is that DIAMUND is not intended to solvethe more general problem of variant detection in any sample. It isdesigned to take advantage of very closely related samples wheredirect between-sample comparisons can more effectively identifymutations present in just one or a subset of the samples.
MethodsDIAMUND begins with two or more sets of DNA sequences, or
“reads,” generated by a sequencing instrument. Here, we describethe algorithm as applied to three trios consisting of an affected in-dividual (or proband) and two unaffected parents. Specializing thealgorithm to two samples, where one is normal and the other is dis-eased (e.g., cancerous) tissue from the same individual, is straight-forward.
One way of directly comparing two or more genomes is to assem-ble each data set de novo, using any of several next-generation se-quence assemblers [Schatz et al., 2010], and then compare the assem-blies using a whole-genome alignment algorithm such as MUMmer[Delcher et al., 1999; Kurtz et al., 2004]. However, whole-genomeassembly is computationally costly and can produce erroneous as-semblies, which in turn might create even larger problems thanaligning all reads to the reference genome. Instead, DIAMUND uses adirect approach in which we count all sequences of length k in all thereads, for some fixed value of k, and then compare these k-mers toone another. Here, we outline the 10 major steps of the algorithm;the initial steps are illustrated in Figure 1.
284 HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014
Filtering staFsFcs Table 1. Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutated Loci
Data remaining at the end of step
Filtering step Disease/normal pair Family trio BH1019 Family trio BH2041 Family trio BH2688
Number of reads from proband/diseased tissue 118,414,556 84,201,820 75,877,750 103,527,644Number of 27-mers in proband/diseased tissue 911,738,627 795,477,167 517,272,851 1,088,610,020Number of k-mers with count >10 77,903,885 61,805,320 64,719,150 113,066,951Remove vector sequence 77,898,848 61,800,798 64,713,995 113,062,417Eliminate k-mers found in reference GRC37 exome 17,821,359 9,385,347 10,730,208 50,535,681Eliminate k-mers found in parent exomes/normal tissue 10,568 65,352 20,130 2,006Identify reads containing k-mers 32,829 reads 148,496 46,454 4,404Remove reads containing vector 15,260 125,648 38,799 2,760Number of contigs after assembly 2,147 13,189 3,755 359Number of contigs with >3 reads after merging contigs 279 contigs 1,437 701 71Identify variants covered by reads from normal tissue 55 contigs 5 6 2Keep variants with >5% coverage 42 variants 5 6 2Find variants in coding regions 14 variants 3 3 1Remove synonymous SNPs 10 variants 2 3 1
Step 1: We utilize an efficient parallel algorithm, Jellyfish [Marcaisand Kingsford, 2011], for the k-mer counting step. This firststep converts the reads for each exome (or genome) to a set ofk-mers, which should in theory be a much smaller data set: thenumber of k-mers in an exome is equivalent to the length of theexome, 50–60 Mbp using current exome capture kits. However,the initial set is dramatically larger, due primarily to sequencingerrors, which we address below. We sort each set of k-mers to allowfor efficient intersection operations in subsequent steps. SortingN k-mers requires O(N log N) time, after which computing theintersection with another set of k-mers requires only O(N) time.
Step 2: The second step in the DIAMUND algorithm removes allk-mers from the proband (but not from the unaffected samples)that are likely to represent sequencing errors. Note that everysequencing error introduces k new k-mers. If k is sufficiently large,then virtually all of these k-mers will be unique, i.e., they will notoccur in the genome or elsewhere in the reads. Combined with thefact that exome coverage is usually very deep, we can safely assumethat any k-mer that occurs just once represents an error.
After empirical observations of multiple exomes, we observedthat even k-mers occurring more than once are usually errors. Dueto biases in sequencing technology, exome data sets may containerroneous k-mers that occur 10 or more times, particularly forregions that contain very deep coverage (which can exceed 1000-fold for some exonic targets). For the exomes we have analyzed,average coverage is approximately 80–100!, which means that anovel, heterozygous mutation should have 40–50! coverage. Evenin regions with lower coverage, novel mutations should have 20 ormore reads (and k-mers) covering them. Note that in the case ofmosaicism, a much lower proportion than 50% of the reads mightcontain the mutation; the software can be adjusted to report suchcases.
Given these observations, at this stage, we discard all k-mers thatoccur fewer than 10 times. We tested different values before choosing10 as the default value, and this can easily be adjusted for data setswith lower or higher coverage. In our tests, a minimum value of 10excluded an extremely small number of true k-mers.
Step 3: After removing likely sequencing errors, somek-mers may remain due to vector contamination. We pre-compute all k-mers in known vectors, taken from the UniVecdatabase (www.ncbi.nlm.nih.gov/tools/vecscreen/univec), and re-move these from the exome representing the proband (or
the diseased tissue, in the case of normal vs. diseased tissuecomparisons).
We also observe that any k-mer that occurs in the referencegenome is probably not the cause of disease. We precompute allk-mers from the targeted regions of the GRC37 genome, and re-move these “normal” k-mers from the proband’s data. Note thatthis set can easily be expanded to include a larger set of variantsknown to be harmless.
Step 4: After computing all k-mers in the reads from the probandand both parents, the third step computes the intersection betweenproband and mother, and separately between proband and father(Fig. 1). We collect all k-mers unique to the proband but missingfrom the mother, and repeat this step for the father. We thenintersect the two resulting files to give us a single file that containsall k-mers found in the proband but missing from both unaffectedparents. These form our initial set that should contain any de novomutations in the affected individual.
Step 5: At this point, DIAMUND usually has reduced the initial setof k-mers over 10,000-fold, leaving between 2,000 and 65,000k-mers (Table 1). For the fifth step, we collect the reads containingthese k-mers. This requires us to align the k-mers back to theoriginal reads, because the Jellyfish k-mer counter does not keeptrack of the source of each k-mer. DIAMUND can use either of twoefficient alignment systems for this step: MUMmer [Delcher et al.,1999; Kurtz et al., 2004], a suffix tree-based algorithm that rapidlyfinds exact matches; or Kraken [Wood and Salzberg, 2013], a fastsequence classifier that we modified to provide the output neededby our system. Kraken is the default choice because it is significantlyfaster. In our experiments, the number of reads identified in thisstep ranged from 4,400 to 148,000 (Table 1).
Step 6: Despite every effort to screen reads for contamina-tion, some small fragments of vector sequences often still re-main in the reads. If these vectors happen to contaminateonly the proband (or affected) data set, they will appear tobe novel mutations. We eliminate these by comparing thereads identified in the previous step to the UniVec database(www.ncbi.nlm.nih.gov/tools/vecscreen/univec) using the vec-screen program, and removing any reads with vector sequence.Note that running vecscreen on the original data would be ex-tremely demanding computationally, but because the number ofreads at this step has been reduced approximately 1,000-fold, it isrelatively fast.
HUMAN MUTATION, Vol. 35, No. 3, 283–288, 2014 285
Panagopoulos et al. Plosone 2014 Volume 9 (6) e99439
The ‘‘Grep’’ Command But Not FusionMap, FusionFinder or ChimeraScan Captures the CIC-‐DUX4 Fusion Gene from Whole Transcriptome Sequencing Data on a Small Round Cell Tumor with t(4;19)(q35;q13)
Three fusion-‐finder programs FusionMap, Fusion Finder, and ChimeraScan generated a plethora of fusion transcripts but not the biologically important and cancer-‐specific fusion gene, the CIC-‐ DUX4 chimeric transcript. It was necessary to use the ‘‘grep’’ command-‐line uFlity to siP out the laler from the many data produced by the automated algorithms. CytogeneFc, FISH, and clinico-‐pathologic tumor features hinted at the presence of the said fusion, but it was eventually found only aPer the manual ‘‘grep’’-‐ funcFon had been used.
Simple is good
3
2. For each maximal repetition Y, identify the minimum unit U
such that U is not a repetition and Y is a concatenation of
multiple occurrences of U and a prefix of U. For example,
when Y = (CAG) 6CA, U = CAG.
3. An approximate repetition is a substring such that its
alignment with repetition (U)m is decomposed into series of
exact matches of length |U| or more, and neighboring series
must have only one mismatch, one insertion, or one deletion
between them in the alignment, where |U| indicates the length
of U. We calculate an approximate repetition by extending a
maximal (exact) repetition in both directions in a greedy
manner. For example, given
CGCCCGCAGCGCAT(CAG)6CATCAGGGA,
we can extend repetition (CAG)6CA to the underlined
substring,
CGCCCGCAGC-GCAT(CAG)6CATCAGGGA,
where bold letters represent mismatches and “-” indicates a
deletion. In this way, we retrieve an approximate STR that is
not necessarily an exact repeat of the minimum unit U, but
may contain mismatches and indels.
4. A read may contain multiple overlapping STRs with the
same unit. If two overlap, eliminate the shorter one. If both
are of the same length, select one arbitrarily.
The algorithm is able to process ten million reads of length 100
bases in ~1700 s on a Xeon X5690 with a clock rate of 3.47-GHz
(Supplementary Fig. S1). As the computational time is
proportional to the number of reads, ~47 hours is required to
process 1 billion 100-bp reads, confirming the practicality of the
method for processing real human resequencing data.
Fig. 1. Sensing and locating short tandem repeats (STRs) in short reads. (A) An original short read. (B) An approximate STR (AGAGGC)n (n=6) in the
short read. The central four copies of AGAGGC are an exact STR with no mutations, while the flanking copies contain the mutations shown in bold letters.
If one of the regions (black) surrounding the STR aligns in a unique position, the STR can be located in the genome. (C) A read occupied by an approximate
STR. (D) Sensing STRs from frequency distributions of (AGAGCC)n in NA12877 (father of the HapMap CEU trio), NA12878 (mother), and NA18507 (an
African male). The x-axis is the lengths of STR occurrences detected in a read, and the y-axis is the frequency of reads containing STR occurrences of the
length indicated on the x-axis. Note that 100-bp-long STR occurrences are frequent in NA12877, while no STR occurrences of length >70 bp are observed
in samples NA12878 and NA18507. (E) When a read is filled with an STR (red), we attempt to anchor the other end read (blue) to a unique position
unambiguously. (F, G) An STR is located easily if its location can be sandwiched using information on paired-end reads. The length of an STR of length <
100 bp is easily estimated (F), while determining the length of a much longer STR is nontrivial (G). We need to use third-generation sequencers, such as
PacBio RS, with the capability of reading DNA fragments having a length of thousands of bases.
by guest on June 7, 2014http://bioinform
atics.oxfordjournals.org/D
ownloaded from
Rapid detecFon of expanded short tandem repeats in personal genomics using hybrid sequencing
Koichiro Doi, Taku Monjo, Pham H. Hoang, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji and Shinichi Morishita
University of Tokyo
Standards
Human Variome Project
Prototype NGS database
Report
Sharing Experience with TruSight One
• In partnership with Illumina, RCPA and the HGSA Kim Flintoff (Wellington Regional GeneFcs Laboratory) is leading an evaluaFon of exon sequencing using Illumina’s True Sight One panel. Two Coriell family trios will be sequenced by New Zealand Genomics Limited and the data will be shared on a HVPA database
• The VCF file will be available on the HVPA LOVD database and performance stats will also be made available.
Next Steps
• Robust standards for genomic medicine • Databases and data content – Access to idenFfied and de-‐idenFfied data (consent and confidenFality)
– Database accreditaFon process in prep with RCPA – Defining the performance of various aligners, variant callers and annotaFon programs
– Clinical grade Variant Call Format (VCF) – Metafile covering data trail: what was tested, what was not tested
Data quality classes DifferenFate between three classes of data: The Clinically Reported data label would denote the class of data that the HVP Australian Node was originally designed to collect and share: data that has been generated in a NATA accredited Australian diagnosFc laboratory and is able to be included in a clinical report. Unreported Clinical quality data would denote data that has been generated in a NATA accredited diagnosFc laboratory, but is not capable of being included in a clinical report. This class would comprise, primarily, of next-‐generaFon sequencing (NGS) type data. Unaccredited data would be used to denote data that has been generated by an Australian laboratory that has not been NATA accredited A new filtering opFon would be made available to allow users to view only data of a certain class
Standards for AccreditaFon of DNA Sequence VariaFon Databases
Quality Use of Pathology Program (QUPP), a naFonal project for the Development of Standards for AccreditaFon of DNA Sequence VariaFon Data Bases has been jointly iniFated by the Royal College of Pathologists of Australasia (RCPA), and the Human Variome Project (HVP). Background • There is a rapidly increasing volume, spectrum, and complexity of geneFc tests emerging within
diagnosFc pathology laboratories. In parFcular, high throughput sequencing methods such as targeted panel, exome (WES), and whole genome sequencing (WGS), are producing an increasing quanFty of geneFc data requiring analysis and interpretaFon, forming a substanFal proporFon of the workload.
• Currently, there is a plethora of online mutaFon databases to refer to, however there is a disFnct lack of such databases that meet the stringent accuracy and reproducibility that the clinical diagnosFc environment demands. AddiFonally, The current databases are “Fractured”, with varied access and sharing of the data within; and variable quality due to errors / inaccurate data posFng, all of which is a clear risk to the quality of paFent care. With more widespread, secure sharing of variants and associated phenotypes, the value of cumulaFve variant informaFon will accelerate the delivery of accurate, acFonable, and efficient clinical reports.
• There are currently no standards or equivalent mechanisms for accreditaFon of databases to ensure the accuracy and quality of uploaded data into any central repository to meet the needs of the clinical diagnosFcs environment.
Pathogenicity 1. “Deleterious-‐ and Disease-‐Allele Prevalence in Healthy Individuals:
Insights from Current PredicFons, MutaFon Databases, and PopulaFon-‐Scale Resequencing” Yali Xue, Yuan Chen, Qasim Ayub, Ni Huang, Edward V. Ball, Malhew Mort, Andrew D. Phillips, Katy Shaw, Peter D. Stenson, David N. Cooper, Chris Tyler-‐Smith, and the 1000 Genomes Project ConsorFum Am J Hum Genet 91, 1022–1032 2012
2. “Amino Acid Changes in Disease-‐Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset” Tjaart A. P. de Beer*, Roman A. Laskowski, Sarah L. Parks, Botond Sipos, Nick Goldman, Janet M. Thornton PLOS Comp Biol, 9 1-‐15 2013
3. “Large Numbers of GeneFc Variants Considered to be Pathogenic are Common in AsymptomaFc Individuals” Christopher A. Cassa, Mark Y. Tong, and Daniel M. Jordan HuMu 34. 9 1216–1220, 2013
4. “Integrated sequence analysis pipeline provides one-‐stop soluFon for idenFfying disease-‐causing mutaFons” Hao Hu , Thomas F Wienker, Luciana Musante, Vera MM Kalscheuer, Peter N Robinson, H Hilger Ropers HuMu under review
Table 1. List of selected CNV detec2on methods.
Duan J, Zhang J-‐G, Deng H-‐W, Wang Y-‐P (2013) ComparaFve Studies of Copy Number VariaFon DetecFon Methods for Next-‐GeneraFon Sequencing Technologies. PLoS ONE 8(3): e59128. doi:10.1371/journal.pone.0059128 hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128
Summary
• Current sequencing technology has plenty of room for improvement w.r.t. read length and accuracy
• Many informaFcs challenges relate to managing poor quality data or technological limitaFons and will go away with longer, more accurate reads
• AnnotaFon, data sharing and integraFng variant data with clinical and phenotypic data are the high value healthcare deliverables
Acknowledgments
• Genomic Medicine & Centre for TranslaFonal Pathology, University of Melbourne: Arthur Hsu, Olga Kondrashova, SebasFan Lunke, Clare Love, Renate Marquis-‐Nicholson, Kym Pham, Paul Waring
• Human Variome Project: Tim Smith, Alan Lo, Dick Colon
• Melbourne Genomics Health Alliance: Clara Gaff, Kathryn North, Doug Hilton, Stephen Smith
Targeted Tumour Sequencing:
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 2
Sample Information
Sample ID: TL140380
Sample Name: TL140380
Total PF Reads: 77,538,750
Percent Q30: 78.6%
Adapters Trimmed: Yes
Median Read Length: 151 bp
Enrichment Summary
Target Manifest Total Length of Targeted Reference Padding Size
TruSight One v1.0 11,946,514 bp 150 bp
Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.
Read Level Enrichment
Total Aligned Reads
Percent Aligned Reads
Targeted Aligned Reads
Read Enrichment
Padded Target Aligned Reads
Padded Read Enrichment
75,690,682 97.6% 49,355,753 65.2% 57,751,970 76.3%
Base Level Enrichment
Total Aligned Bases
Targeted Aligned Bases
Base Enrichment
Padded Target Aligned Bases
Padded Base Enrichment
10,449,595,970 4,976,953,404 47.6% 7,636,013,223 73.1%
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 4
Coverage Summary
Mean Region Coverage Depth
Uniformity of Coverage (Pct > 0.2*mean)
Target Coverage at 1X
Target Coverage at 10X
Target Coverage at 20X
Target Coverage at 50X
416.6X 96.1% 99.4% 99.1% 98.9% 97.9%
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 4
Coverage Summary
Mean Region Coverage Depth
Uniformity of Coverage (Pct > 0.2*mean)
Target Coverage at 1X
Target Coverage at 10X
Target Coverage at 20X
Target Coverage at 50X
416.6X 96.1% 99.4% 99.1% 98.9% 97.9%
ConsFtuFonal Frozen
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 3
Small Variants Summary
SNVs Insertions Deletions
Total Passing 8,113 192 230
Percent Found in dbSNP 98.8% 87.5% 74.3%
Het/Hom Ratio 1.7 1.8 2.5
Ts/Tv Ratio 3.1 - -
Variants by Sequence Context
SNVs Insertions Deletions
Number in Genes 8,206 187 225
Number in Exons 6,927 80 107
Number in Coding Regions 6,587 50 64
Number in UTR Regions 340 30 43
Number in Splice Site Regions 742 54 69
Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.
Variants by Consequence
SNVs Insertions Deletions
Frameshift - 20 23
Non-synonymous 2,886 30 40
Synonymous 3,676 - -
Stop Gained 19 0 0
Stop Lost 6 0 0
Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 4
Coverage Summary
Mean Region Coverage Depth
Uniformity of Coverage (Pct > 0.2*mean)
Target Coverage at 1X
Target Coverage at 10X
Target Coverage at 20X
Target Coverage at 50X
2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 4
Coverage Summary
Mean Region Coverage Depth
Uniformity of Coverage (Pct > 0.2*mean)
Target Coverage at 1X
Target Coverage at 10X
Target Coverage at 20X
Target Coverage at 50X
2555.9X 94.6% 99.5% 99.3% 99.3% 99.1%
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 3
Small Variants Summary
SNVs Insertions Deletions
Total Passing 8,244 184 255
Percent Found in dbSNP 98.8% 88.6% 70.6%
Het/Hom Ratio 1.7 1.5 2.9
Ts/Tv Ratio 3.0 - -
Variants by Sequence Context
SNVs Insertions Deletions
Number in Genes 8,336 182 250
Number in Exons 7,033 79 101
Number in Coding Regions 6,685 49 59
Number in UTR Regions 348 30 42
Number in Splice Site Regions 762 51 89
Genes include exons, introns and UTR regions. Exons include coding and UTR regions. UTR regions include 5'and 3' UTR regions. Splice site regions include regions annotated as splice acceptor, splice donor, splice site orsplice region.
Variants by Consequence
SNVs Insertions Deletions
Frameshift - 22 24
Non-synonymous 2,952 27 34
Synonymous 3,705 - -
Stop Gained 22 0 0
Stop Lost 6 0 0
Variation consequences are calculated following the guidelines athttp://uswest.ensembl.org/info/genome/variation/predicted_data.html#consequences
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Enrichment Sequencing Report
Page 2
Sample Information
Sample ID: WES001FR1
Sample Name: WES001FR1
Total PF Reads: 454,879,338
Percent Q30: 81.9%
Adapters Trimmed: Yes
Median Read Length: 151 bp
Enrichment Summary
Target Manifest Total Length of Targeted Reference Padding Size
TruSight One v1.0 11,946,514 bp 150 bp
Note: All enrichment values are calculated without padding (sequence immediately upstream anddownstream) unless otherwise stated.
Read Level Enrichment
Total Aligned Reads
Percent Aligned Reads
Targeted Aligned Reads
Read Enrichment
Padded Target Aligned Reads
Padded Read Enrichment
445,404,466 97.9% 305,571,144 68.6% 341,517,122 76.7%
Base Level Enrichment
Total Aligned Bases
Targeted Aligned Bases
Base Enrichment
Padded Target Aligned Bases
Padded Base Enrichment
60,040,931,299 30,533,612,135 50.9% 44,910,193,400 74.8%
Called Variants
FFPE Frozen
Blood
16,711
2,095 154
979
3,641 9,368 2,455
Gene List
techniques allow for the rapid detection of EGFRmutations with high sensitivity and specificity.However, confirmation of mutations via directsequencing is still necessary.27,76,77 Though not ofany current clinical use, an assay that provides arapid assessment of EGFR mutation status in as littleas 30 min using a ‘smart amplification process’ hasbeen described. These may one day provide greatlyimproved turnaround times for this analysis.78
Formalin-fixed and paraffin-embedded tissue isperfectly suitable for fluorescence in situ hybridiza-tion (FISH) and DNA-based tests, but tissue pre-servation is critical for a successful test. Decalcifiedand ethanol-fixed tissue, as well as tissues contain-ing abundant necrosis, should be avoided.
The ability to detect multiple driver mutations inlung adenocarcinoma has revolutionized the medi-cal management of this disease and multiplexedtesting for all common driver mutations will providephysicians with a more precise guide for therapy.9
Recently, Kris et al79 identified 10 driver mutationsin tumor samples from 1000 lung adenocarcinomapatients enrolled in the National Cancer InstituteLung Cancer Mutation Consortium. The mutations,involving KRAS, EGFR, ERBB2 (HER2), BRAF,PIK3CA, AKT1, MAP2K1, and NRAS, were screenedusing standard multiplexed assays and FISH. Drivermutations were detected in 60% of tumors. Theincidences of mutations were as follows: KRAS25%, EGFR 23%, ALK rearrangements 6%, BRAF3%, PIK3CA 3%, MET amplifications 2%, ERBB21%, MAP2K1 0.4%, NRAS 0.2%, and AKT1 0%(Figure 3).12,67–71 It is noteworthy that 95% ofmolecular lesions were mutually exclusive.79
EGFR mutations are responsible for the constitu-tive activation of the tyrosine kinase receptor. Thesemutations are also most frequently associatedwith either sensitivity or resistance to EGFR TKIs(Figure 2).6,80–84 The response-associated mutationsare linked with response rates of 470% in patientstreated with either erlotinib or gefitinib.85,86 How-ever, upto 25% of patients with TKI resistance-associated mutations will also respond to thetherapy.67 Pao et al7 analyzed EGFR mutation ofexons 18–24 in tumors from 10 gefitinib-responsiveand from 7 erlotinib-responsive patients. The resultsdemonstrated that EGFR mutations were present in7 of 10 (70%) gefitinib-responsive and in 5 of 7(71%) erlotinib-responsive tumors.
EGFR genotype was more useful than clinicalcharacteristics for selection of appropriate patientsfor consideration of first-line therapy with an EGFRTKI.85 EGFR mutations are generally associated withsensitivity to TKI therapy.71,87 Both retrospectiveand prospective studies have demonstrated thatlung adenocarcinoma patients carrying such anEGFR mutation and who were treated with TKIshad significantly higher response rates and longerprogression-free survival than patients without anEGFR mutation,5–7,25,29,71,83,85,87,88 with some patientsexperiencing rapid, complete, or partial responses
that were persistant.55 Jackman et al85 studied 223chemotherapy-naı̈ve patients with advanced lungcancer of non-small cell type, among which 86%were adenocarcinomas. Sensitizing EGFR mutationswere found in 84 carcinomas, 89% of which wereadenocarcinomas. The mutations were associatedwith a 67% response rate, with a time to progressionof 11.8 months, and overall survival of 23.9months.85 Exon 19 deletions were associated witha relatively longer median time to progression andoverall survival compared with L858R (exon 21)mutations. Wild-type EGFR was found in 139patients (62%), and this finding was associatedwith poor outcomes (response rate, 3%; time to pro-gression, 3.2 months), irrespective of KRAS status.
EGFRvIII Mutation
EGFR variant III (EGFRvIII), a mutation resultingfrom an in-frame deletion of exons 2–7 of the codingsequence (amino acids 6–273), has been associatedwith a subset of squamous cell lung cancers.89–91 Anumber of functional differences between EGFRvIIIand EGFR have been characterized.90,91 EGFRvIII hasbeen identified in an array of human solid tumors,including glioblastoma, breast cancer, ovarian can-cer, prostate cancer, and lung caner. AlthoughEGFRvIII fails to bind EGF, its intracellular tyrosine
Figure 3 Frequency of major driver mutations in signalingmolecules in lung adenocarcinomas. About 64% of all adenocar-cinoma cases harbor somatic driver mutations. According to theNational Cancer Institute Lung Cancer Mutation Consortiumdata,79 B23% of lung adenocarcinomas harbor EGFR mutations.The EGFR mutation status of the cancer is associated with itsresponsiveness or resistance to EGFR TKI therapy. KRAS muta-tions are more frequently found in adenocarcinomas (25%),which are mutually exclusive with EGFR mutations. Mutationsin KRAS have been proposed as one of the mechanisms ofprimary resistance to gefitinib and erlotinib therapy. A subsetof adenocarcinoma cases harbors a transforming fusion gene,EML4–ALK (6%), which mainly involves adenocarcinoma fromnon-smokers with wild-type EGFR and KRAS mutations. Themutation frequency of BRAF is 3%, PIK3CA 3%, MET amplifica-tions 2%, ERBB2(Her2/neu) 1%, MAP2K1 0.4%, and NRAS 0.2%.Each of the molecular alterations has a role in the signalpathways, activating important cell functions, including cellproliferation and survival. Approximately 36.4% of lung adeno-carcinomas do not harbor currently detectable mutations.
Molecular pathology of lung cancer
350 L Cheng et al
Modern Pathology (2012) 25, 347–369
Filtering Variants All variants None Qual Not in Blood
Blood 9828 8551 NA
Frozen 9920 8736 126
FFPE 9709 8163 199
Variants in Gene List None Qual Not in Blood
Blood 27 18 NA
Frozen 27 23 2 (EGFR)
FFPE 25 19 3 (EGFR, ROS)
EGFR p.L858R
EGFR p.T790M
ConfirmaFon by PCR
0.0#
50.0#
100.0#
150.0#
200.0#
250.0#
EGFR_NM_005228.3#T790#T790#WT#
EGFR_NM_005228.3#784#"c.2350T>C,#p.S784P"#
EGFR_NM_005228.3#784#"c.2351C>T,#p.S784F"#
EGFR_NM_005228.3#785#"c.2354C>T,#p.T785I"#
EGFR_NM_005228.3#786#"c.2356G>A,#p.V786M"#
EGFR_NM_005228.3#790#"c.2368A>G,#p.T790A"#
EGFR_NM_005228.3#790#"c.2369C>T,#p.T790M"#
EGFR_NM_005228.3#828#͝#"828#͝,#wt"#
EGFR_NM_005228.3#858#"c.2572C>A,#p.L858M"#
EGFR_NM_005228.3#858#"c.2573_2574delinsGT,#
EGFR_NM_005228.3#858#"c.2573T>A,#p.L858Q"#
EGFR_NM_005228.3#858#"c.2573T>G,#p.L858R"#
EGFR_NM_005228.3#860#"c.2579A>T,#p.K860I"#
EGFR_NM_005228.3#861#"c.2582T>A,#p.L861Q"#
EGFR_NM_005228.3#861#"c.2582T>G,#p.L861R"#
EGFR%normalised%%
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
1.2#
1.4#
1.6#
1.8#
2.0#
KRAS_NM_033360.2#12#"c.34G>A,#p.G12S"#
KRAS_NM_033360.2#12#"c.34G>C,#p.G12R"#
KRAS_NM_033360.2#12#"c.34G>T,#p.G12C"#
KRAS_NM_033360.2#12#"c.35G>A,#p.G12D"#
KRAS_NM_033360.2#12#"c.35G>C,#p.G12A"#
KRAS_NM_033360.2#12#"c.35G>T,#p.G12V"#
KRAS_NM_033360.2#13#"c.37G>A,#p.G13S"#
KRAS_NM_033360.2#13#"c.37G>C,#p.G13R"#
KRAS_NM_033360.2#13#"c.37G>T,#p.G13C"#
KRAS_NM_033360.2#13#"c.38G>A,#p.G13D"#
KRAS_NM_033360.2#13#"c.38G>C,#p.G13A"#
KRAS_NM_033360.2#13#"c.38G>T,#p.G13V"#
KRAS%normalised%%
Gene list summary
gene locus
covered,by,capture,panel? Observations Notes
AKT1 chr14:105,235,7611105,262,116 y wt activating:pointALK chr2:29,415,410130,146,821 y wt rearrangements:and:secondary:resistance:mutationsBRAF chr7:140,433,8131140,624,564 y wt activating:pointEGFR chr7:55,248,979155,259,567 Y L858R;:T790M activating:points,:indels,:ressitance:pointHER2 chr17:37,844,393137,884,915 Y wt ampliication,:activating:pointKRAS chr12:25,386,768125,403,863 Y ? activating:point
MAP2K1 chr15:66,679,211166,783,882 Ychanged:allele:ratio
MET chr7:116,312,4591116,409,963 Ychanged:allele:ratio mutation:and:amplification
NRAS chr1:115,247,0851115,259,515 Y wt activating:pointPI3KCA chr3:178,866,3111178,952,497 Y wt activating:pointROS1 chr6:117,609,5301117,747,018 Y wt rearrangements