Highly Heterozygous STR Markers for Enhanced DNA Mixture ...

Highly Heterozygous STR Markers for Enhanced DNA Mixture

Deconvolution

Nicole M.M. Novroski, PhDAssistant Professor, Forensic Science Program

Department of Anthropology

University of Toronto Mississauga

Email: [email protected]

Web: www.nicolenovroski.com

Thursday, September 26th, 2019

1

mailto:[email protected]

http://www.nicolenovroski.com/

• Also known as microsatellites

• Highly polymorphic genetic markers

• Repeat units composed of 2-7 nucleotides

• Vary in length → 2 or more repeats

• Heterozygosity makes STRs useful for forensic DNA typing

Short Tandem Repeats (STRs)

2

• Simple[TCTA]n → TCTATCTATCTATCTA…

• Compound[TCTA]a [TCTG]b [TCTA]c → TCTATCTATCTGTCTGTCTATCTATCTA…

• Complex

[TCTA]a [TCTG]b TC [TCTA]c TCA [TCTA]d → TCTATCTGTCTCTATCTATCATCTA…

Diversity of STRs is also affected by the flanking region in the amplicon!

Short Tandem Repeats (STRs)

3

Overview of Forensic DNA Typing

DNA Extraction

DNA Quantitation

PCR Amplification of Multiple STR loci

Biology

Technology

Separation and Detection of PCR Products (STR Alleles)

DNA Sample Genotype Determination

Application

Compare sample genotype to reference/crime scene

samples

Compare DNA profiles to population databases

Generate Reports using Statistics (RMP, CPI, LR) 4

Capillary Electrophoresis (CE)

Typical DNA Profile

http://www.promega.com/products/genetic-identity/str-amplification/powerplex-fusion-str-kits/

• A homozygotegenotype is made of two alleles that are the same!

• A heterozygotegenotype is made of two alleles that are different!

Each DNA STR locus (location) has 2 alleles (one from each parent) → where the combination of alleles is a genotype!

5

DNA Profile Comparison – an example!

6

Image taken from: http://dnaproject.co.za/dna-database

DNA Mixtures

DNA evidence

7

Mixture? Mixture?

Mixture?

Mixture?

Summary of Casework Mixtures

http://www.cstl.nist.gov/biotech/strbase/training/AAFS2008_1_CaseworkSurvey.pdf

8

DNA Mixture Interpretation

=

How many people are in the DNA mixture?Whose DNA is whose?

Which alleles belong to which alleles? How much of each person’s DNA is in the sample? 9

Example of a 2-person resolvable mixture

+

10

Example of an unresolvable mixture

?

11

Current State of Forensic Genetics

• Crime labs use PCR-CE to amplify and separate STR alleles by size from biological samples of victims, suspects, and crime scene evidence• Comparisons of DNA profiles (genotyped alleles at each locus) look for

similarities and differences between allele lengths at each STR locus

•Why is this problematic?• Individuals can have the same genotype at any given locus

• The length of the STR does not use all of the genetic information

• Mixtures are TRICKY!

12

• A high-throughput approach to DNA Sequencing

• Also known as Next Generation Sequencing (NGS)

• Improved alternative to PCR-CE methods• Large amount of DNA data• Minimal input DNA

• General process involves sequencing a large number of fragments in parallel• Spatially separated, clonally amplified DNA templates

• Targeted amplification of STRs can be transferred to MPS• ForenSeq™• Custom Amplicon Designs

Massively Parallel Sequencing

13

PCR-CE versus MPS

• PCR-CE relies solely on length to determine STR genotype

• MPS capable of revealing both size and exact sequence of STR alleles• Provides opportunity to identify sequence variation

Image taken from Novroski et al. (2016) Supplemental Table 2

14

Allele 1Allele 2Allele 3

Using MPS for a given STR X can reveal:

• 14 allele has (at least) 3 motifs

• (TCTA)2 TCTG (TCTA)11

• TCTA TCTG (TCTA)12

• (TCTA)14

• 15 allele has (at least) 5 motifs

• (TCTA)2 TCTG (TCTA)12

• TCTA (TCTG)2 (TCTA)12

• TCTA TCTG TGTA (TCTA)12

• (TCTA)15

• TCTA TCTG (TCTA)13

Using CE for the same STR:• 14 allele is simply a 14• 15 allele is simply a 15

Sequence Variation

15

1414a14b

15

15a

15b

15c

15d

How does sequence variation change interpretation?

• Melissa and Marie are both a [14,15] at STR X

If at least one motif differs in the [14,15] genotype between Melissa and Marie, their (CE) length-based genotypes would NO LONGER be

the same using MPS!

(TCTA)14(TCTA)15

Genotype:[14,15]

(TCTA)14 TCTA TCTG TGTA (TCTA)12

Genotype:[14,15a]

16

• A mixture of Melissa and Marie’s alleles at STR X would now provide more information:

14 15

Length-based Profile(PCR-CE)

Potential options for this mixture include:Melissa and Marie are both heterozygotes: [14,15]

Melissa and Marie are both homozygotes: [14,14] and [15,15]One is a heterozygote and one is a homozygote: [14,15 and [15,15]

14 15

Potential options for this mixture include:Melissa and Marie are both heterozygotes: [14,15]

One is a heterozygote and one is a homozygote: [14,14 and [15,15a]The analyst can eliminate a possibility!

Sequence-based Profile(MPS)

How does sequence variation change interpretation?

17

18

MPS is a promising technology for forensic genetics and using MPS for STR typing has revealed more

variation for some STR loci!

Extensive Characterization of STR Loci

• Analysis of STR sequence variation for 59 STR loci • 27 autosomal, 7 X chromosome and 25 Y chromosome STRs

• Includes ALL CODIS core loci (n=20; CODIS is the national DNA database)

• 777 individuals in four major population groups • African American, Hispanic, Caucasian, Asian

• Library Preparation and sequencing performed using MiSeqFGx Forensic Genomics System (Illumina)

•Data analyzed using the ForenSeq software, and advanced bioinformatic tools including STRait Razor and in-house excel workbooks

19

•Four general categories of sequence variation:• 1 – Increase in alleles due to repeat region variation

• e.g., D2S1338, D12S391

• 2 – Increase in alleles due to flanking region variation• e.g., D7S820, D13S317

• 3 – Increase in alleles due to both repeat region and flanking region variation• e.g., D18S51, DXS10135

• 4 – No increase in alleles beyond length-based methods• e.g., TPOX, Y-GATA-H4

20

Extensive Characterization Results

Image taken from Novroski et al. (2016)21


• The diversity of STR allele variation continues to increase with expanded population studies • An increase of 644 alleles using a sequence-based approach

• Over 400 novel variants discovered in 777 population samples across 4 populations

• Population genetic analyses revealed increased heterozygosityand discrimination power for some loci!• Increased heterozygosity will improve discrimination power

• A few loci did not show an increase in diversity or heterozygosity with a sequence-based approach

22


•Complex DNA mixtures are commonplace in forensic biological evidence• Despite best efforts for technological and statistical

improvements, many mixtures remain unresolvable

•The diversity of STR allele variation can increase with a sequence-based approach• HOWEVER… a few loci had no observable increase in diversity or

heterozygosity• No gain in information

• No improved DNA mixture deconvolution capabilities

• Core loci not selected for based on sequence variation

23

Problem(s)

24

There are highly polymorphic STRs in the human genome that may be better suited than currently applied markers (based on sequence variation and allele length spread)

that can facilitate deconvolution of component contributors in a mixed DNA sample

•Phase 1 – Establish candidate STRs with increased diversity from freely available datasets• Focus on STR loci with high heterozygosity, reduced length-based allele

spread, and tetranucleotides or larger repeats

•Phase 2 – Preliminary analysis of STR candidates• Evaluate population genetic parameters (heterozygosity, allele spread)

• Evaluate marker performance

•Phase 3 – Develop a comprehensive DNA mixture deconvolution panel• Characterize STRs with U.S. population samples

• Evaluate mixture deconvolution capabilities of each locus DNA mixtures

25

Approach

Phase 1: STR Candidate Selection

• 1000 Genomes Project (raw sequences, unsorted)

• STR Catalog Viewer• Summary of human STR variation compiled using lobSTR software

26

Tetranucleotides and larger (increase PCR

efficiency, reduce artifacts

Small length-based allele spread (minimized preferential

amplification, diversity of alleles maintained when coupled with

>80% heterozygosity)

80% Heterozygosity(↑ variability of

markers for easier differentiation

between individuals)

27


28


29


“77” allele corresponds to a 12 length-based repeat plus flanking information

http://strcat.teamerlich.org/chart/chr1/187550371/187550451

30


Initial Candidate STR Search

Manual STR Mining of LobSTR dataset (n = 1102

candidates)

Perl-based STR Mining of LobSTR dataset (n = 2784

candidates)

Overlap of STR Candidate in both Manual and Perl-based Mining (n = 793 candidates)

DesignStudio Testing using Repeat Region as Targets (n = 337 candidates, off-center)

DesignStudio Testing using Repeat Region AND Flanking Region as Targets (n = 201

candidates)

Inclusion of compatible Phillips et. al STRs (n = 47)

Preliminary Panel using TruSeq Custom Amplicons (n

= 248 candidates)

Perl-based STR Mining of 1000 Genomes BAM files (n

= 544 candidates)

31


Phase 3: DECoDE Panel Assessment and in silico mixtures

Using top ranked candidates for heterozygosity and chemistry compatibility → 73 Candidates Selected for Refined (DECoDE) Panel

53 STR Candidates that met criteria of high heterozygosity (>80%), reduced length-based allele spread (≤10), and a repeat size of four or more nucleotides

Considering (some) Poor, (all) Fair, (all) Good Performing Candidates → Heterozygosity and Allele Spread Assessed and Ranked

From the 248 Candidates → 55 Failures; 58 Poor Performers; 72 Fair Performers; 63 Good Performers

Sequencing Performance → STR Classified on Read Depth (Depth of Coverage) and Chemistry Compatibility

Illumina TruSeq Custom Amplicon Preliminary Panel (248 Candidates)

Phase 1 with >3000 Candidates → 793 Candidates → Design and Redesign → 248 Candidates

32

Phase 2: Preliminary Panel Summary

• The STR DNA EnhanCed DEconvolution panel

• 73 highly heterozygous loci using MPS chemistry• 15 loci previously described by Phillips (2016) and others

• 451 unrelated individuals from three U.S. populations• Caucasian (CAU; n=155); • Hispanic (HIS; n=148); • African American (AFA; n=148)

• Each STR locus was characterized and reviewed manually for diversity using in-house Excel workbooks• Alleles characterized by length and sequence• Population genetics analyses (heterozygosity; Hardy-Weinberg equilibrium (HWE);

linkage disequilibrium (LD); random match probabilities (RMP))

33

Phase 3: The STR DECoDE Panel

Where SNP = single nucleotide polymorphism; SB = sequence-based; LB = length-based; Ho = observed heterozygosity; He = expected 1 heterozygosity; Allele spread refers to the length-based difference between the smallest observed length-based allele and the largest observed 2 length-based allele. 3

34


ChromosomeLocus

Name

GRCh38

Reference

Allele

Repeat

LengthLocus Type Motif 5' SNPs 3' SNPs SB He SB Ho

Allele

Spread

3 D3S2406 33 4 COMPLEX[TATC]a [TGTC]b [CGTC]c

[CATC]d

rs2035580; rs533349040;

rs551975676;

rs566835150;

rs146488521;

rs555445819;

rs567665857;

rs535707905; rs71625920

rs573290642;

rs543848107;

rs190000499;

rs143356890;

0.9855 0.9889 17

3 D3A57 21 4 COMPLEX[TCTT]a [TCGT]b T [TCTT]c

[TCCT]d [TCTT]e [TCCT]frs2694124

rs1451872314;

rs12292659640.9717 0.9823 15

8 D8A26 19 4 COMPLEX[TTCC]a N16 [CTTT]b C

[CTTT]crs116668567 rs545041953 0.9700 0.9491 20

8 D8S1132 20 4 COMPLEX [TCTA]a TCA [TCTA]brs139381851;

rs142372169rs568433577 0.9588 0.9292 12

15 D15S822 17 4 COMPOUND [TATC]a [TCTA]b none observedrs117117801;

rs80418480.9571 0.9513 15

3 D3N61 17 4 COMPLEX[TTTC]a TTC [TTTC]b TAT

[TA]c [TTTC]drs17026573; rs9841195 rs4855796 0.9566 0.9178 15

8 D8A29 11 5 SIMPLE [AAAGG]a none observed rs11166830 0.9545 0.9222 16

11 D11N29 14 5 SIMPLE [GAGAA]ars538065917; rs6590431;

rs200846422none observed 0.9508 0.9111 16

• Summary:• 71 STRs had heterozygosities above 80%

• 17 STRs had allele spreads of 10 length-based alleles or less• Where 43 had between 11-15, and 13 had spreads of 16 or greater

• All STRs met the criteria of tetranucleotide motifs or larger• 10 were simple; 6 were compound; and 57 were complex

• HWE, LD and RMP seemed to meet expectations

• Some loci revealed operationally problematic characteristics• Dinucleotide patterns

• Two distinct repeat regions

• Homopolymer stretches

35


• A subset of 20 DECoDE loci selected for comparison to the CODIS core loci

• The current requirement is 20 CODIS core loci for upload into the national DNA database

• high heterozygosity (>90%)

• Operationally problematic loci (even if heterozygosity >90%) were excluded

• 443 U.S. population samples • African American, (AFA; n=140, 8 incomplete profile samples removed)

• Caucasian, (CAU; n=155)

• Hispanic, (HIS; n=148)

36

Phase 3: In silico Mixtures

CODIS Panel Loci DECoDE Panel Loci BEST Panel Loci

D2S1338 D3S2406 D3S2406

D12S391 D2S1360 D2S1360

D1S1656 D7S3048 D7S3048

D21S11 D8S1132 D8S1132

D8S1179 D11S2368 D11S2368

vWA D15S822 D15S822

D3S1358 D2N2 D2N2

D18S51 D1N10 D1N10

FGA D12N15 D12N15

D19S433 D1N16 D1N19

D13S317 D1N19 D1N21

D5S818 D1N21 D8N23

D16S539 D8N23 D15N26

D22S1045 D15N26 D14N56

D7S820 D14N56 D3N61

D2S441 D3N61 D12S1338

CSF1PO D4N70 D4N70

D10S1248 D11N52 D2S1338

TPOX D17N32 D1S1656

TH01 D2N43 D11N52

Orange cells reflect the CODIS core loci. 37

38

Phase 3: Panel Comparisons

Summary of length-based and sequence-based heterozygosities (He) in descending rank order for three U.S. populations (AFA=African American; CAU=Caucasian; HIS=Hispanic) for the CODIS core loci (n=20) and a subset (n=20; Supplemental Table 6) of the DECoDE panel loci.

Locus (in descending rank order of (average) heterozygosity)

39


• Simulations using empirical data• 443 individuals

• In silico two-person mixtures for all possible pairwise comparisons (97,903)

• 1-Evaluate each locus (n=20 per panel) for ability to observe four alleles in each mixture

• 2-Evaluate each panel (n=3) for ability to fully resolve (observe four alleles at every locus) each mixture• CODIS core loci

• DECoDE loci (n=20)

• BEST loci (mixture of CODIS and DECoDE loci; n=17 DECoDE and n=3 CODIS)

40

Phase 3: In silico Mixtures

• For each locus:

• 443 individuals → 97,903 two-person mixtures!• Total comparisons per panel (20 loci X 97,903 mixtures): 1,958,060

• 3 Panels X 2 Types of Alleles (length-based and sequence-based) = 11,748,360 total in silico mixtures

41

Phase 3: In silico Mixtures: Locus Performance

Mixture

• For each locus:

• Consider the following possibilities:• Each person can be homozygote or heterozygote at a locus

• By length, sequence, or both!

• Two individuals can share one or both alleles (or ideally, NO alleles)

42


Mixture

Comparison of the proportion per category of resolved alleles in in silico two-person mixtures, presented as a summary of all loci for each of the three panels (C=CODIS; D=DECoDE; B=BEST; 443 (n) individuals comprising three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic); two-person (k) mixtures; 97,903 (x) comparisons per locus (n=20) in each panel; total comparisons (N) = 1,958,060 for each DNA profile type (LB=length-based; and SB=sequenced-based)). 43


• Simulations of each panel’s ability to fully resolve two-person DNA mixtures across full 20-locus DNA profiles• i.e., observe four alleles per locus for all loci in mixed DNA profile

• 443 individuals comprising three U.S. populations

• 97,903 comparisons per panel per allele type (length-based and sequence-based)

• Total: 587,418 two-person in silico mixtures

44

Phase 3: In silico Mixtures: Panel Performance

For each panel (CODIS, DECoDE, BEST):How many loci can be fully resolved in a DNA profile?

45


Full ResolutionCODIS (LB)

DECoDE(LB)

BEST (LB)

CODIS (SB)

DECoDE(SB)

BEST (SB)

0 Loci 155 0 0 3 0 0

1 Locus 1330 6 7 104 0 0

2 Loci 5001 27 76 663 0 0

3 Loci 11715 211 359 2541 0 1

4 Loci 18200 839 1181 6333 8 4

5 Loci 20574 2320 2885 12056 14 41

6 Loci 17881 5242 5971 17179 128 139

7 Loci 12141 9592 10134 18791 440 496

8 Loci 6514 14203 14249 16609 1204 1275

9 Loci 2889 17073 16079 11656 2801 2682

10 Loci 1063 16806 15882 6966 5379 5230

11 Loci 317 13764 13121 3169 9287 8794

12 Loci 98 9305 8937 1253 13194 12117

13 Loci 18 5005 5235 431 16137 15125

14 Loci 5 2398 2475 119 16747 16270

15 Loci 2 823 919 22 14580 14779

16 Loci 0 226 322 6 10125 10755

17 Loci 0 53 62 1 5266 6560

18 Loci 0 10 8 1 2005 2765

19 Loci 0 0 1 0 539 777

20 Loci (Full Profile) 0 0 0 0 49 93

Total 97,903 97,903 97,903 97,903 97,903 97,903

LB = length-based; SB = sequence-based

N = 443 individuals representing three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic samples)

46


Highlighted in BOLD is the mode. Mathematically – the mode refers to the value most observed in the data.Here – the mode refers to the number of loci with the greatest number of counts for which four alleles at each locus was resolved.

Minimum: 0

Mode: 5

Maximum: 15

47

CODIS Panel Performance - Length

97,903 Mixtures

155

20,574

2

Minimum: 0

Mode: 8

Maximum: 18

48

CODIS Panel Performance - Sequence

97,903 Mixtures

3

18,791

1

Minimum: 1

Mode: 9

Maximum: 18

49

DECoDE Panel Performance - Length

97,903 Mixtures

6

17,073

10

Minimum: 4

Mode: 14

Maximum: 20

50

DECoDE Panel Performance - Sequence

97,903 Mixtures

8

16,747

49


Current CODIS Minimum: 0 (n=155)

DECoDE Minimum: 4 (n=8)

51

↑4

Current CODIS Maximum: 15 (n=2)

DECoDE Maximum: ALL 20 (n=49)

52

↑5


Current CODIS Mode: 5 Fully Resolved Loci (n=20,574)

DECoDE Mode: 14 Fully Resolved Loci (n= 16,747)

53

↑9


LB = length-based; SB = sequence-based

N = 443 individuals representing three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic samples)

54


↑9

Current Method (PCR-CE of CODIS loci)

Proposed Method (DECoDE and MPS)

Uses DNA Sequence Information? NO YES

Heterozygosity >80%? SOME ALL

Allele spread < 10? SOME SOME

Four-allele Loci 25.9 % (510,285 of 1,966,920 mixtures)

67.1 % (1,320,031 of 1,966,920; ↑ 259 %)

55

Phase 3: Summary and Significance

• Confirmation of the dataset and Concordance Testing• Reproducing this study with additional datasets and/or using other platforms is desirable

• The findings presented herein underwent robust scrutiny, but more data are always better!

• The Stutter Effect• Strand slippage during amplification/replication

• Mixture studies with stutter and minor contributors

• Chemistry and Instrumentation• Read length a limitation throughout

• Exploration of additional markers may be possible

• Loss of Information• A portion of candidate markers yielded no data or were not compatible

• Characterization of additional candidate loci

56

Future Directions

• Center for Human Identification • Dr. Bruce Budowle

• Illumina and Verogen• Drs. Marty Flores and Bob Kolouch

• Exact Diagnostics• Jerry Boonyaratanakornkit and Wahaj Zuberi

• NIJ Grant 2015-DN-BX-K067 “Enhancing Mixture Interpretation with Highly Informative STRs”

• Forensic Sciences Foundation 2015-16 Lucas Grant

• UNTHSC Scholarship Funding

57

Acknowledgements

Highly Heterozygous STR Markers for Enhanced DNA Mixture ...

Documents

Transcript of Highly Heterozygous STR Markers for Enhanced DNA Mixture ...