Next Generation DNA Sequencing - Forensic Science at Penn State

PENN STATE FORENSIC SCIENCE

The Role of Next Generation

Mitchell M. Holland, Ph.D.Associate Professor

Biochemistry & Molecular BiologyDirector, Forensic Science Program

Pennsylvania State UniversityUniversity Park, PA

PowerPoint will be posted atwww.forensics.psu.edu

The Role of Next Generation DNA Sequencing in Forensic

mtDNA Analysis

Annual Meeting8 November 2012


History ofForensic mtDNA Analysis

� First applied in the late 1980’s to reunite grandchildren with grandparents in Argentina

Orrego C, Wilson AC, King MC. 1988. Identification of maternally-related individuals by amplification and direct sequencing of a highly polymorphic, noncoding region of mitochondrial DNA. Amer J Human Genet 43:A219

� AFDIL began using mtDNA analysis to identify military service members in 1991

� The FBI started using mtDNA analysis in casework in 1996

Holland MM, Fisher DL, Mitchell LG, Rodriguez WC, Canik JJ, Merril CR, Weedn VW. 1993. Mitochondrial DNA sequence analysis of human skeletal remains:

identification of remains from the Vietnam War. J Forensic Sci 38: 542

Mitochondrial DNA: State of Tennessee v. Paul WareBy C. Leland Davis, ADA

District Attorney’s Office, Chattanooga, TN


History of Forensic mtDNA Analysis

� Romanov Case 1994-2009

Nature Genetics 1996

Nature Genetics 1994

FSS

AFDIL

2009

AFDIL


History of Forensic mtDNA Analysis

� Vietnam Unknown Soldier 1998


History of DNA Sequencing

“In the early 1970’s one person would struggle to complete 100 bases of sequence in one year. Then two very similar techniques in one year. Then two very similar techniques were developed, one by Allan Maxam and Walter Gilbert in the United States and the other by Fredrick Sanger and his coworkers that made it possible for one person to sequence thousands of base pairs in a year.”

Mapping the Human Genome: DNA Sequencing, Los Alamos Science 1992


“Between 1975 and the present (1992), the number of base pairs of published sequence data grew from roughly 25,000 to almost 100


data grew from roughly 25,000 to almost 100 million. During that time longer and longer contiguous stretches of DNA have been sequenced. In 1991 the longest sequence to be completed was that of the cytomegalovirus genome, which is 229,354 base pairs.”



“By 1992 a cooperative effort in Europe had sequenced an entire chromosome of yeast, chromosome III, which is 315,357 base


chromosome III, which is 315,357 base pairs. And now efforts are underway to sequence million-base stretches of DNA. Accomplishing such large-scale sequencing projects is among the goals for the first five years of the Genome Project.”




“In 1990, when the plans for the Genome Project were being made, the estimated cost of sequencing was $2 to $5 per base.”


of sequencing was $2 to $5 per base.”

Translating into ~$6-15 Billion to sequence the first human genome … the actual cost was around $1-3 Billion


S35 labeled DNA fragments run through a polyacrylamide gel and exposed to x-ray film Technology

Advancements


Fluorescently labeled DNA fragments run through a CE

and detected by a CCD camera


Future of DNA Sequencing

Next GenerationSequencing = NGS

Second Generation Sequencing = SGS

Massive ParallelSequencing = MPS

Deep Sequencing = DS

Scientific American


Key Questions

Legal Challenges

Questioned Samplesv.

What are the issues that need to be addressed when introducing NGS in forensic DNA labs??

v.Reference Samples

Target Loci

Technology Transfer

Bioinformatics

in forensic DNA labs??


� Morphological SNP Markers

� Geoprofiling, eye color, skin pigmentation, etc

SNPs

� Kinship SNP Markers


STRs2011

2012

Promega 2012

2012


STRs

Mixture Deconvolution

Same “Issues” as Normal CE Analysis


ForensicApplications

Where do we start??


mtDNA Sequencing

Low HangingFruit


Historical Relevance

RFLP~1987-Late 1990’s

STRs~1991-Present

We DIDN’T go directly from RFLP to STRs



RFLP~1987 to Late 1990’s

Fast DQA1/PM

STRs~1991 to Present

Slower Track

Fast Track DQA1/PM

~1990 to Late 1990’s


DQA1/PM testing provided an easier path for the admissibility of PCR in


admissibility of PCR in courts of law … paving the

way for STRs


mtDNA testing may provide a similar path for the admissibility of NGS in

Can History Repeat Itself??

courts of law … paving the way again for STRs,

and also for SNPs


Searched NGS + mtDNA = >100 journal articles


Searched NGS + mtDNA = >100 journal articles

Penn State Group


Promega 2009

mtDNA & STRs

Forensic mtDNA NGS


Promega 2012

Forensic mtDNA NGS


Massive Parallel Sequencing (MPS)

The PlayersGS JuniorEarly 2010

MiSeq Late 2011

Ion PGMEarly 2011


Instrument Comparison

Highest Throughput = 1.6 Gb/run Lowest Error Rate

Nature Biotechnology 2012

Moderate Throughput = 100’s Mb/runFastest Output = 80-100 Mb/hour

Lowest Throughput = 70 Mb/run Longest Reads = 500-600 bp


Experimental Design

� Sample sources were blood and saliva only

� PCR primer and reaction conditions, along with 454

GS Junior procedures, are provided in our CMJ GS Junior procedures, are provided in our CMJ

paper from 2011

www.isabs.hrwww.cmj.hr

www.forensics.psu.edu


Our Objectives

� Can we develop a 454 NGS approach for

forensic mtDNA analysis? … from both pristine

(references) and challenged samples (old bone

& hair shafts)& hair shafts)

� Can we report out low level control region

mtDNA heteroplasmy using the 454 NGS

approach? … with the goal of increasing the

discrimination potential of the testing results


Our Objectives

� Assuming that we can report out low level

control region mtDNA heteroplasmy using the

454 NGS approach …

� … what are the criteria we need to address in

order to report out reliable results?

� … what are the important considerations when

answering the first question?


Considerations

� Reproducibility

� Concordance

� Threshold Definitions/Interpretation Criteria

� In addition to the normal quality filters

available in the analysis software, what other

filters are necessary or desirable?


SampleSanger

mtDNA Profile

Percent of Minor

Heteroplasmy & Site

454 GS Junior

mtDNA Profile

Percent of Minor

Heteroplasmy & Site

F2

16069T, 16093C,

16126C, 16261T,

16274A, 16355T

16311 – 18.4% C

16069T, 16093C,

16126C, 16261T,

16274A, 16355T

16093 – 3.71% T

16261 – 1.29% C

16311 – 20.14% C

F3

16069T, 16126C,

16145A, 16172C,

16261T

Not Detected

16069T, 16126C,

16145A, 16172C,

16261T

Not Detected

Evaluated 30 individuals from25 different mtDNA lineages

Table 3, Holland et al, CMJ 2011

F4 No polymorphisms Not Detected No polymorphisms Not Detected

F516129A, 16172C,

16223T, 16311CNot Detected

16129A, 16172C,

16223T, 16311C

16129 – 0.51% G

16311 – 0.33% T

F7, F12-13,

M13-14

16192T, 16256T,

16270TNot Detected

16192T, 16256T,

16270T16192 - 2.64-4.50% C

F8 16223T, 16362C Not Detected 16223T, 16362C 16223 – 1.86% C

F9 16356C Not Detected 16356C Not Detected

F10 16298C Not Detected 16298C 16298 – 0.45% T

F16

16126C, 16239T,

16294T, 16296T,

16304C

Not Detected

16126C, 16239T,

16294T, 16296T,

16304C

Not Detected

F25 16343G Not Detected 16343G Not Detected

F26 16093C Not Detected 16093C Not Detected

F27 16172C, 16278T Not Detected 16172C, 16278T Not Detected

M3 16355T Not Detected 16355T Not Detected

M4 16111T Not Detected 16111T 16111 – 0.52% C

0.33%or1/300


Sanger versus 454 GS Junior Heteroplasmy Detection

SANGER

Figure 2, Holland et al, CMJ 2011

3.71% C/THeteroplasmy

1.29% T/CHeteroplasmy

20.14% C/THeteroplasmy

SANGER

454


M5

16114A, 16129A,

16192T, 16213A,

16223T, 16278T,

16355T, 16362C

Not Detected

16114A, 16129A,

16192T, 16213A,

16223T, 16278T,

16355T, 16362C

16192 – 3.18% C

M716129A, 16223T,

16264TNot Detected

16129A, 16223T,

16264TNot Detected

M8 16224C, 16311C Not Detected 16224C, 16311C Not Detected

SampleSanger

mtDNA Profile

Percent of Minor

Heteroplasmy & Site

454 GS Junior

mtDNA Profile

Percent of Minor

Heteroplasmy & Site

11/25 (44%) of the lineages

showed some level of

Evaluated 30 individuals from25 different mtDNA lineages

M8 16224C, 16311C Not Detected 16224C, 16311C Not Detected

M916301T, 16343G,

16356CNot Detected

16301T, 16343G,

16356CNot Detected

M10 16304C Not Detected 16304C

16209 – 2.62% C

16222 – 2.30% T

16304 – 2.99% T

M11 16129A, 16223T Not Detected 16129A, 16223T Not Detected

M12 16069T, 16126C Not Detected 16069T, 16126C 16126- 1.14% T

M1516093C, 16224C,

16311CNot Detected

16093C, 16224C,

16311C16093 – 3.04% T

M1716126C, 16294T,

16296TNot Detected

16126C, 16294T,

16296TNot Detected

M1816278T, 16304C,

16311CNot Detected

16278T, 16304C,

16311C

16128 – 0.52% T

16278 – 0.77% C

16293 – 0.77% G

16304 – 1.00% T

M19, F2216069T, 16126C,

16222TNot Detected

16069T, 16126C,

16222TNot Detected

level of heteroplasmy

With 19 sites of heteroplasmy

observed across the 11

lineages


Issues to Consider

� Are the observed heteroplasmic positions

consistent with past studies on other samples?

� Yes – for example, Parsons et al, Nat Genet � Yes – for example, Parsons et al, Nat Genet

1997 & Tully et al, Am J Hum Genet 2000

� Are the sequence changes real, or are they

simply artifacts/errors of the PCR and/or

sequencing process?




sequencing process?

Addressing the IssueCirca 2011

� First, each of the reported heteroplasmic sequences

had coverage rates of at least 40 reads, with the vast majority having more than 100 reads

� EXAMPLE: if a variant was observed in 0.5% of the

sequences (1/200), there were at least 40 reads of

the heteroplasmic sequence in 8,000 total reads




sequencing process?

Addressing the IssueCirca 2011

� Second, a subset of samples were run in

triplicate or duplicate

� EXAMPLE: F5 (0.51% 16129 G, 0.33% 16311 T)

� 16129: Exp 1 = 0.51%, Exp 2 = 1.06%, Exp 3 = 0.36%

� 16311: Exp 1 = 0.33%, Exp 2 = 1.09%, Exp 3 = 1.82%


� “Rules” for accepting heteroplasmy data

� Is the coverage rate for the variant site at or

above 40 reads?

Interpretation CriteriaCirca 2011

above 40 reads?

� Is the ratio of forward-to-reverse reads consistent

with the total forward-to-reverse read-ratio?

� Is the variant site observed in other data

associated with the sequencing run, and is it a

plausible site?


� “Rules” for accepting heteroplasmy data

� Bottom Line


� If one or more of the “rules” is broken,

we did not report the observation of the

heteroplasmic variant


� “Expanded Rules” for accepting

heteroplasmy data

� Is the observed sequence variant a


� Is the observed sequence variant a

transition, transversion or INDEL, and

what effect might that have on the

interpretation process? … especially

from a reporting perspective


� “Expanded Rules” for accepting

heteroplasmy data

� What is the empirical rate of sequencing


� What is the empirical rate of sequencing errors across HV1, especially for INDEL’s?

� Can thresholds or filters be applied?

� What effect do these answers have on the reporting of low-level mtDNA heteroplasmy?


Defining Errors

PCRemPCR

PyroSequencingAnalysis

Where do errors occur in the NGS

process?


Addressing the Issue of ErrorsCirca 2012

Example of a

polymorphism

identified when the

sequence is sequence is

compared to the

rCRS … observed in

the vast majority of

sequence reads

NextGENe® Software from SoftGenetics, Inc



A “random” error in the sequence observed

in a small minority of

sequence readssequence reads

… or heteroplasmy if it

meets the criteria

previously described

PCRemPCR

PyroSequencingAnalysis



INDEL’s – homopolymer stretches produce the vast

majority of the indel errors, especially between

nucleotide positions 16160-16200


�We looked at >200,000 reads from multiple 454

runs covering ~75,000,000 nucleotides of HV1

mtDNA sequence


mtDNA sequence

�Measured the type and number of errors at

each position

�Assessed the data with the goal of establishing

a reporting threshold for low-level mtDNA

heteroplasmy


�MORE READS = MORE ERRORS… however,

the errors are

�… typically repeatable and consistent

Observations

�… typically repeatable and consistent

�… typically observed in the same “poor” ratios in

relation to forward/reverse reads

�… and there appears to be a direct linear

relationship between the two (more reads = more

errors)


Run 1 MID 9

Site Perfect Hetero A C G T Ins DelSite

Pattern

16187 +

16188 11;0 29;3 7673;6 C ccccC

Error Tally for Total Reads in a Single Run

16188 11;0 29;3 7673;6 C ccccC

16189 20;0 9;14 0;2 4;37 C 22;0 T cccccT

16190 2;14 5;0 0;17 A 7;0 C cccctCc

16191 2;0 2;0 24;5 tcCtc

16192 C 98;103 3;1 1;0 1;50 T tccTc

16193 1;0 4;1 8;27 T 47;5 C cctCat

16194 +

16195 3;0 0;9 C caTgc

NOTE: A heteroplasmic site can be seen at 16192


Substitution Error Rates

On average, 1

in every 4 reads

has a single

randomsubstitution substitution

error

Read = ~340 bp


Substitution Error Rates

Ignore sites w/out substitutions

Determine the average # of average # of substitutions per nucleotide type (A, G, T or C) per read

Normalize each of those values for read coverage

This exercise allows for an assessment of the threshold for baseline “noise” in relation to error-based artifacts


0.02

0.025

0.03

Reliable Reporting Threshold of 0.2% (0.002)

0

0.005

0.01

0.015

0.02

16000 16050 16100 16150 16200 16250 16300 16350 16400

A

C

G

T


0.004

0.005

0.006

Zoom In

Reliable Reporting Threshold of 0.2% (0.002)

0

0.001

0.002

0.003

0.004

16000 16050 16100 16150 16200 16250 16300 16350 16400

A

C

G

T

Zoom In


0.004

0.005

0.006

APPLY the Reporting Threshold of 0.2% (0.002)

0

0.001

0.002

0.003

0.004

16000 16050 16100 16150 16200 16250 16300 16350 16400

A

C

G

T


What About INDEL’s?

The vast majority of indels are in homopolymer stretches, primarily in the range of 16180-16193 (i.e., AAAACCCCCTCCCC)


�How do we coalesce the 2011 interpretation

criteria with the new threshold to establish a

new set of criteria?

Further Assessments

new set of criteria?

� We cannot assume that 0.2% of 500 reads, or a

single observation of a substitution is enough to

report out a heteroplasmic site

� Therefore, a coverage threshold will also be

necessary - we’re in the process of evaluating the

size of that threshold


Future Direction

Bridge Amplification Cluster Generation

MiSeq

Sequencing By SynthesisTruSeq Reversible Terminators

Better w/ Homopolymer Stretches Cheaper Chemistry

Greater Throughput Better Support


Different Chemistries

Bridge Amplification Cluster Generation

MiSeq

Junior

Sequencing By SynthesisTruSeq Reversible Terminators

PyroSequencing

Sequencing By SynthesisSolid-State pH Meter

Emulsion PCR

PGM


Homopolymer Stretches

Sequencing By SynthesisTruSeq Reversible Terminators BEST

Sequencing By SynthesisSequencing By SynthesisSolid-State pH Meter GOOD

PyroSequencing WORST


Thanks!!

Manfred Kayser - Liam’s outside committee member

Cedric Neumann – Liam’s Penn State committee member

Erasmus MCNetherlands

Katie O’Hanlon – generated the 454 data used by Liam

454 LifeSciences/Roche – GS Junior

Illumina – MiSeqCydne Holt, Kathy Stephens

SoftGenetics – NextGENe®

John Fosnacht, Teresa Snyder-Leiby LiamPhillips

AIBiotech

Next Generation DNA Sequencing - Forensic Science at Penn State

Documents

Transcript of Next Generation DNA Sequencing - Forensic Science at Penn State