Next Generation DNA Sequencing - Forensic Science at Penn State
Transcript of Next Generation DNA Sequencing - Forensic Science at Penn State
PENN STATE FORENSIC SCIENCE
The Role of Next Generation
Mitchell M. Holland, Ph.D.Associate Professor
Biochemistry & Molecular BiologyDirector, Forensic Science Program
Pennsylvania State UniversityUniversity Park, PA
PowerPoint will be posted atwww.forensics.psu.edu
The Role of Next Generation DNA Sequencing in Forensic
mtDNA Analysis
Annual Meeting8 November 2012
PENN STATE FORENSIC SCIENCE
History ofForensic mtDNA Analysis
� First applied in the late 1980’s to reunite grandchildren with grandparents in Argentina
Orrego C, Wilson AC, King MC. 1988. Identification of maternally-related individuals by amplification and direct sequencing of a highly polymorphic, noncoding region of mitochondrial DNA. Amer J Human Genet 43:A219
� AFDIL began using mtDNA analysis to identify military service members in 1991
� The FBI started using mtDNA analysis in casework in 1996
Holland MM, Fisher DL, Mitchell LG, Rodriguez WC, Canik JJ, Merril CR, Weedn VW. 1993. Mitochondrial DNA sequence analysis of human skeletal remains:
identification of remains from the Vietnam War. J Forensic Sci 38: 542
Mitochondrial DNA: State of Tennessee v. Paul WareBy C. Leland Davis, ADA
District Attorney’s Office, Chattanooga, TN
PENN STATE FORENSIC SCIENCE
History of Forensic mtDNA Analysis
� Romanov Case 1994-2009
Nature Genetics 1996
Nature Genetics 1994
FSS
AFDIL
2009
AFDIL
PENN STATE FORENSIC SCIENCE
History of Forensic mtDNA Analysis
� Vietnam Unknown Soldier 1998
PENN STATE FORENSIC SCIENCE
History of DNA Sequencing
“In the early 1970’s one person would struggle to complete 100 bases of sequence in one year. Then two very similar techniques in one year. Then two very similar techniques were developed, one by Allan Maxam and Walter Gilbert in the United States and the other by Fredrick Sanger and his coworkers that made it possible for one person to sequence thousands of base pairs in a year.”
Mapping the Human Genome: DNA Sequencing, Los Alamos Science 1992
PENN STATE FORENSIC SCIENCE
“Between 1975 and the present (1992), the number of base pairs of published sequence data grew from roughly 25,000 to almost 100
History of DNA Sequencing
data grew from roughly 25,000 to almost 100 million. During that time longer and longer contiguous stretches of DNA have been sequenced. In 1991 the longest sequence to be completed was that of the cytomegalovirus genome, which is 229,354 base pairs.”
Mapping the Human Genome: DNA Sequencing, Los Alamos Science 1992
PENN STATE FORENSIC SCIENCE
“By 1992 a cooperative effort in Europe had sequenced an entire chromosome of yeast, chromosome III, which is 315,357 base
History of DNA Sequencing
chromosome III, which is 315,357 base pairs. And now efforts are underway to sequence million-base stretches of DNA. Accomplishing such large-scale sequencing projects is among the goals for the first five years of the Genome Project.”
Mapping the Human Genome: DNA Sequencing, Los Alamos Science 1992
PENN STATE FORENSIC SCIENCE
History of DNA Sequencing
“In 1990, when the plans for the Genome Project were being made, the estimated cost of sequencing was $2 to $5 per base.”
Mapping the Human Genome: DNA Sequencing, Los Alamos Science 1992
of sequencing was $2 to $5 per base.”
Translating into ~$6-15 Billion to sequence the first human genome … the actual cost was around $1-3 Billion
PENN STATE FORENSIC SCIENCE
S35 labeled DNA fragments run through a polyacrylamide gel and exposed to x-ray film Technology
Advancements
History of DNA Sequencing
Fluorescently labeled DNA fragments run through a CE
and detected by a CCD camera
PENN STATE FORENSIC SCIENCE
Future of DNA Sequencing
Next GenerationSequencing = NGS
Second Generation Sequencing = SGS
Massive ParallelSequencing = MPS
Deep Sequencing = DS
Scientific American
PENN STATE FORENSIC SCIENCE
Key Questions
Legal Challenges
Questioned Samplesv.
What are the issues that need to be addressed when introducing NGS in forensic DNA labs??
v.Reference Samples
Target Loci
Technology Transfer
Bioinformatics
in forensic DNA labs??
PENN STATE FORENSIC SCIENCE
� Morphological SNP Markers
� Geoprofiling, eye color, skin pigmentation, etc
SNPs
� Kinship SNP Markers
PENN STATE FORENSIC SCIENCE
STRs2011
2012
Promega 2012
2012
PENN STATE FORENSIC SCIENCE
STRs
Mixture Deconvolution
Same “Issues” as Normal CE Analysis
PENN STATE FORENSIC SCIENCE
ForensicApplications
Where do we start??
PENN STATE FORENSIC SCIENCE
mtDNA Sequencing
Low HangingFruit
PENN STATE FORENSIC SCIENCE
Historical Relevance
RFLP~1987-Late 1990’s
STRs~1991-Present
We DIDN’T go directly from RFLP to STRs
PENN STATE FORENSIC SCIENCE
Historical Relevance
RFLP~1987 to Late 1990’s
Fast DQA1/PM
STRs~1991 to Present
Slower Track
Fast Track DQA1/PM
~1990 to Late 1990’s
PENN STATE FORENSIC SCIENCE
DQA1/PM testing provided an easier path for the admissibility of PCR in
Historical Relevance
admissibility of PCR in courts of law … paving the
way for STRs
PENN STATE FORENSIC SCIENCE
mtDNA testing may provide a similar path for the admissibility of NGS in
Can History Repeat Itself??
courts of law … paving the way again for STRs,
and also for SNPs
PENN STATE FORENSIC SCIENCE
Searched NGS + mtDNA = >100 journal articles
PENN STATE FORENSIC SCIENCE
Searched NGS + mtDNA = >100 journal articles
Penn State Group
PENN STATE FORENSIC SCIENCE
Promega 2009
mtDNA & STRs
Forensic mtDNA NGS
PENN STATE FORENSIC SCIENCE
Promega 2012
Forensic mtDNA NGS
PENN STATE FORENSIC SCIENCE
Promega 2012
Forensic mtDNA NGS
PENN STATE FORENSIC SCIENCE
Massive Parallel Sequencing (MPS)
The PlayersGS JuniorEarly 2010
MiSeq Late 2011
Ion PGMEarly 2011
PENN STATE FORENSIC SCIENCE
Instrument Comparison
Highest Throughput = 1.6 Gb/run Lowest Error Rate
Nature Biotechnology 2012
Moderate Throughput = 100’s Mb/runFastest Output = 80-100 Mb/hour
Lowest Throughput = 70 Mb/run Longest Reads = 500-600 bp
PENN STATE FORENSIC SCIENCE
Experimental Design
� Sample sources were blood and saliva only
� PCR primer and reaction conditions, along with 454
GS Junior procedures, are provided in our CMJ GS Junior procedures, are provided in our CMJ
paper from 2011
www.isabs.hrwww.cmj.hr
www.forensics.psu.edu
PENN STATE FORENSIC SCIENCE
Our Objectives
� Can we develop a 454 NGS approach for
forensic mtDNA analysis? … from both pristine
(references) and challenged samples (old bone
& hair shafts)& hair shafts)
� Can we report out low level control region
mtDNA heteroplasmy using the 454 NGS
approach? … with the goal of increasing the
discrimination potential of the testing results
PENN STATE FORENSIC SCIENCE
Our Objectives
� Assuming that we can report out low level
control region mtDNA heteroplasmy using the
454 NGS approach …
� … what are the criteria we need to address in
order to report out reliable results?
� … what are the important considerations when
answering the first question?
PENN STATE FORENSIC SCIENCE
Considerations
� Reproducibility
� Concordance
� Threshold Definitions/Interpretation Criteria
� In addition to the normal quality filters
available in the analysis software, what other
filters are necessary or desirable?
PENN STATE FORENSIC SCIENCE
SampleSanger
mtDNA Profile
Percent of Minor
Heteroplasmy & Site
454 GS Junior
mtDNA Profile
Percent of Minor
Heteroplasmy & Site
F2
16069T, 16093C,
16126C, 16261T,
16274A, 16355T
16311 – 18.4% C
16069T, 16093C,
16126C, 16261T,
16274A, 16355T
16093 – 3.71% T
16261 – 1.29% C
16311 – 20.14% C
F3
16069T, 16126C,
16145A, 16172C,
16261T
Not Detected
16069T, 16126C,
16145A, 16172C,
16261T
Not Detected
Evaluated 30 individuals from25 different mtDNA lineages
Table 3, Holland et al, CMJ 2011
F4 No polymorphisms Not Detected No polymorphisms Not Detected
F516129A, 16172C,
16223T, 16311CNot Detected
16129A, 16172C,
16223T, 16311C
16129 – 0.51% G
16311 – 0.33% T
F7, F12-13,
M13-14
16192T, 16256T,
16270TNot Detected
16192T, 16256T,
16270T16192 - 2.64-4.50% C
F8 16223T, 16362C Not Detected 16223T, 16362C 16223 – 1.86% C
F9 16356C Not Detected 16356C Not Detected
F10 16298C Not Detected 16298C 16298 – 0.45% T
F16
16126C, 16239T,
16294T, 16296T,
16304C
Not Detected
16126C, 16239T,
16294T, 16296T,
16304C
Not Detected
F25 16343G Not Detected 16343G Not Detected
F26 16093C Not Detected 16093C Not Detected
F27 16172C, 16278T Not Detected 16172C, 16278T Not Detected
M3 16355T Not Detected 16355T Not Detected
M4 16111T Not Detected 16111T 16111 – 0.52% C
0.33%or1/300
PENN STATE FORENSIC SCIENCE
Sanger versus 454 GS Junior Heteroplasmy Detection
SANGER
Figure 2, Holland et al, CMJ 2011
3.71% C/THeteroplasmy
1.29% T/CHeteroplasmy
20.14% C/THeteroplasmy
SANGER
454
PENN STATE FORENSIC SCIENCE
M5
16114A, 16129A,
16192T, 16213A,
16223T, 16278T,
16355T, 16362C
Not Detected
16114A, 16129A,
16192T, 16213A,
16223T, 16278T,
16355T, 16362C
16192 – 3.18% C
M716129A, 16223T,
16264TNot Detected
16129A, 16223T,
16264TNot Detected
M8 16224C, 16311C Not Detected 16224C, 16311C Not Detected
SampleSanger
mtDNA Profile
Percent of Minor
Heteroplasmy & Site
454 GS Junior
mtDNA Profile
Percent of Minor
Heteroplasmy & Site
11/25 (44%) of the lineages
showed some level of
Evaluated 30 individuals from25 different mtDNA lineages
M8 16224C, 16311C Not Detected 16224C, 16311C Not Detected
M916301T, 16343G,
16356CNot Detected
16301T, 16343G,
16356CNot Detected
M10 16304C Not Detected 16304C
16209 – 2.62% C
16222 – 2.30% T
16304 – 2.99% T
M11 16129A, 16223T Not Detected 16129A, 16223T Not Detected
M12 16069T, 16126C Not Detected 16069T, 16126C 16126- 1.14% T
M1516093C, 16224C,
16311CNot Detected
16093C, 16224C,
16311C16093 – 3.04% T
M1716126C, 16294T,
16296TNot Detected
16126C, 16294T,
16296TNot Detected
M1816278T, 16304C,
16311CNot Detected
16278T, 16304C,
16311C
16128 – 0.52% T
16278 – 0.77% C
16293 – 0.77% G
16304 – 1.00% T
M19, F2216069T, 16126C,
16222TNot Detected
16069T, 16126C,
16222TNot Detected
level of heteroplasmy
With 19 sites of heteroplasmy
observed across the 11
lineages
PENN STATE FORENSIC SCIENCE
Issues to Consider
� Are the observed heteroplasmic positions
consistent with past studies on other samples?
� Yes – for example, Parsons et al, Nat Genet � Yes – for example, Parsons et al, Nat Genet
1997 & Tully et al, Am J Hum Genet 2000
� Are the sequence changes real, or are they
simply artifacts/errors of the PCR and/or
sequencing process?
PENN STATE FORENSIC SCIENCE
� Are the sequence changes real, or are they
simply artifacts/errors of the PCR and/or
sequencing process?
Addressing the IssueCirca 2011
� First, each of the reported heteroplasmic sequences
had coverage rates of at least 40 reads, with the vast majority having more than 100 reads
� EXAMPLE: if a variant was observed in 0.5% of the
sequences (1/200), there were at least 40 reads of
the heteroplasmic sequence in 8,000 total reads
PENN STATE FORENSIC SCIENCE
� Are the sequence changes real, or are they
simply artifacts/errors of the PCR and/or
sequencing process?
Addressing the IssueCirca 2011
� Second, a subset of samples were run in
triplicate or duplicate
� EXAMPLE: F5 (0.51% 16129 G, 0.33% 16311 T)
� 16129: Exp 1 = 0.51%, Exp 2 = 1.06%, Exp 3 = 0.36%
� 16311: Exp 1 = 0.33%, Exp 2 = 1.09%, Exp 3 = 1.82%
PENN STATE FORENSIC SCIENCE
� “Rules” for accepting heteroplasmy data
� Is the coverage rate for the variant site at or
above 40 reads?
Interpretation CriteriaCirca 2011
above 40 reads?
� Is the ratio of forward-to-reverse reads consistent
with the total forward-to-reverse read-ratio?
� Is the variant site observed in other data
associated with the sequencing run, and is it a
plausible site?
PENN STATE FORENSIC SCIENCE
� “Rules” for accepting heteroplasmy data
� Bottom Line
Interpretation CriteriaCirca 2011
� If one or more of the “rules” is broken,
we did not report the observation of the
heteroplasmic variant
PENN STATE FORENSIC SCIENCE
� “Expanded Rules” for accepting
heteroplasmy data
� Is the observed sequence variant a
Interpretation CriteriaCirca 2012
� Is the observed sequence variant a
transition, transversion or INDEL, and
what effect might that have on the
interpretation process? … especially
from a reporting perspective
PENN STATE FORENSIC SCIENCE
� “Expanded Rules” for accepting
heteroplasmy data
� What is the empirical rate of sequencing
Interpretation CriteriaCirca 2012
� What is the empirical rate of sequencing errors across HV1, especially for INDEL’s?
� Can thresholds or filters be applied?
� What effect do these answers have on the reporting of low-level mtDNA heteroplasmy?
PENN STATE FORENSIC SCIENCE
Defining Errors
PCRemPCR
PyroSequencingAnalysis
Where do errors occur in the NGS
process?
PENN STATE FORENSIC SCIENCE
Addressing the Issue of ErrorsCirca 2012
Example of a
polymorphism
identified when the
sequence is sequence is
compared to the
rCRS … observed in
the vast majority of
sequence reads
NextGENe® Software from SoftGenetics, Inc
PENN STATE FORENSIC SCIENCE
Addressing the Issue of ErrorsCirca 2012
A “random” error in the sequence observed
in a small minority of
sequence readssequence reads
… or heteroplasmy if it
meets the criteria
previously described
PCRemPCR
PyroSequencingAnalysis
PENN STATE FORENSIC SCIENCE
Addressing the Issue of ErrorsCirca 2012
INDEL’s – homopolymer stretches produce the vast
majority of the indel errors, especially between
nucleotide positions 16160-16200
PENN STATE FORENSIC SCIENCE
�We looked at >200,000 reads from multiple 454
runs covering ~75,000,000 nucleotides of HV1
mtDNA sequence
Addressing the Issue of ErrorsCirca 2012
mtDNA sequence
�Measured the type and number of errors at
each position
�Assessed the data with the goal of establishing
a reporting threshold for low-level mtDNA
heteroplasmy
PENN STATE FORENSIC SCIENCE
�MORE READS = MORE ERRORS… however,
the errors are
�… typically repeatable and consistent
Observations
�… typically repeatable and consistent
�… typically observed in the same “poor” ratios in
relation to forward/reverse reads
�… and there appears to be a direct linear
relationship between the two (more reads = more
errors)
PENN STATE FORENSIC SCIENCE
Run 1 MID 9
Site Perfect Hetero A C G T Ins DelSite
Pattern
16187 +
16188 11;0 29;3 7673;6 C ccccC
Error Tally for Total Reads in a Single Run
16188 11;0 29;3 7673;6 C ccccC
16189 20;0 9;14 0;2 4;37 C 22;0 T cccccT
16190 2;14 5;0 0;17 A 7;0 C cccctCc
16191 2;0 2;0 24;5 tcCtc
16192 C 98;103 3;1 1;0 1;50 T tccTc
16193 1;0 4;1 8;27 T 47;5 C cctCat
16194 +
16195 3;0 0;9 C caTgc
NOTE: A heteroplasmic site can be seen at 16192
PENN STATE FORENSIC SCIENCE
Substitution Error Rates
On average, 1
in every 4 reads
has a single
randomsubstitution substitution
error
Read = ~340 bp
PENN STATE FORENSIC SCIENCE
Substitution Error Rates
Ignore sites w/out substitutions
Determine the average # of average # of substitutions per nucleotide type (A, G, T or C) per read
Normalize each of those values for read coverage
This exercise allows for an assessment of the threshold for baseline “noise” in relation to error-based artifacts
PENN STATE FORENSIC SCIENCE
0.02
0.025
0.03
Reliable Reporting Threshold of 0.2% (0.002)
0
0.005
0.01
0.015
0.02
16000 16050 16100 16150 16200 16250 16300 16350 16400
A
C
G
T
PENN STATE FORENSIC SCIENCE
0.004
0.005
0.006
Zoom In
Reliable Reporting Threshold of 0.2% (0.002)
0
0.001
0.002
0.003
0.004
16000 16050 16100 16150 16200 16250 16300 16350 16400
A
C
G
T
Zoom In
PENN STATE FORENSIC SCIENCE
0.004
0.005
0.006
APPLY the Reporting Threshold of 0.2% (0.002)
0
0.001
0.002
0.003
0.004
16000 16050 16100 16150 16200 16250 16300 16350 16400
A
C
G
T
PENN STATE FORENSIC SCIENCE
What About INDEL’s?
The vast majority of indels are in homopolymer stretches, primarily in the range of 16180-16193 (i.e., AAAACCCCCTCCCC)
PENN STATE FORENSIC SCIENCE
�How do we coalesce the 2011 interpretation
criteria with the new threshold to establish a
new set of criteria?
Further Assessments
new set of criteria?
� We cannot assume that 0.2% of 500 reads, or a
single observation of a substitution is enough to
report out a heteroplasmic site
� Therefore, a coverage threshold will also be
necessary - we’re in the process of evaluating the
size of that threshold
PENN STATE FORENSIC SCIENCE
Future Direction
Bridge Amplification Cluster Generation
MiSeq
Sequencing By SynthesisTruSeq Reversible Terminators
Better w/ Homopolymer Stretches Cheaper Chemistry
Greater Throughput Better Support
PENN STATE FORENSIC SCIENCE
Different Chemistries
Bridge Amplification Cluster Generation
MiSeq
Junior
Sequencing By SynthesisTruSeq Reversible Terminators
PyroSequencing
Sequencing By SynthesisSolid-State pH Meter
Emulsion PCR
PGM
PENN STATE FORENSIC SCIENCE
Homopolymer Stretches
Sequencing By SynthesisTruSeq Reversible Terminators BEST
Sequencing By SynthesisSequencing By SynthesisSolid-State pH Meter GOOD
PyroSequencing WORST
PENN STATE FORENSIC SCIENCE
Thanks!!
Manfred Kayser - Liam’s outside committee member
Cedric Neumann – Liam’s Penn State committee member
Erasmus MCNetherlands
Katie O’Hanlon – generated the 454 data used by Liam
454 LifeSciences/Roche – GS Junior
Illumina – MiSeqCydne Holt, Kathy Stephens
SoftGenetics – NextGENe®
John Fosnacht, Teresa Snyder-Leiby LiamPhillips
AIBiotech