Highly Heterozygous STR Markers for Enhanced DNA Mixture ...
Transcript of Highly Heterozygous STR Markers for Enhanced DNA Mixture ...
Highly Heterozygous STR Markers for Enhanced DNA Mixture
Deconvolution
Nicole M.M. Novroski, PhDAssistant Professor, Forensic Science Program
Department of Anthropology
University of Toronto Mississauga
Email: [email protected]
Web: www.nicolenovroski.com
Thursday, September 26th, 2019
1
• Also known as microsatellites
• Highly polymorphic genetic markers
• Repeat units composed of 2-7 nucleotides
• Vary in length → 2 or more repeats
• Heterozygosity makes STRs useful for forensic DNA typing
Short Tandem Repeats (STRs)
2
• Simple[TCTA]n → TCTATCTATCTATCTA…
• Compound[TCTA]a [TCTG]b [TCTA]c → TCTATCTATCTGTCTGTCTATCTATCTA…
• Complex
[TCTA]a [TCTG]b TC [TCTA]c TCA [TCTA]d → TCTATCTGTCTCTATCTATCATCTA…
Diversity of STRs is also affected by the flanking region in the amplicon!
Short Tandem Repeats (STRs)
3
Overview of Forensic DNA Typing
DNA Extraction
DNA Quantitation
PCR Amplification of Multiple STR loci
Biology
Technology
Separation and Detection of PCR Products (STR Alleles)
DNA Sample Genotype Determination
Application
Compare sample genotype to reference/crime scene
samples
Compare DNA profiles to population databases
Generate Reports using Statistics (RMP, CPI, LR) 4
Capillary Electrophoresis (CE)
Typical DNA Profile
http://www.promega.com/products/genetic-identity/str-amplification/powerplex-fusion-str-kits/
• A homozygotegenotype is made of two alleles that are the same!
• A heterozygotegenotype is made of two alleles that are different!
Each DNA STR locus (location) has 2 alleles (one from each parent) → where the combination of alleles is a genotype!
5
DNA Profile Comparison – an example!
6
Image taken from: http://dnaproject.co.za/dna-database
DNA Mixtures
DNA evidence
7
Mixture? Mixture?
Mixture?
Mixture?
Summary of Casework Mixtures
http://www.cstl.nist.gov/biotech/strbase/training/AAFS2008_1_CaseworkSurvey.pdf
8
DNA Mixture Interpretation
=
How many people are in the DNA mixture?Whose DNA is whose?
Which alleles belong to which alleles? How much of each person’s DNA is in the sample? 9
Example of a 2-person resolvable mixture
+
10
Example of an unresolvable mixture
?
11
Current State of Forensic Genetics
• Crime labs use PCR-CE to amplify and separate STR alleles by size from biological samples of victims, suspects, and crime scene evidence• Comparisons of DNA profiles (genotyped alleles at each locus) look for
similarities and differences between allele lengths at each STR locus
•Why is this problematic?• Individuals can have the same genotype at any given locus
• The length of the STR does not use all of the genetic information
• Mixtures are TRICKY!
12
• A high-throughput approach to DNA Sequencing
• Also known as Next Generation Sequencing (NGS)
• Improved alternative to PCR-CE methods• Large amount of DNA data• Minimal input DNA
• General process involves sequencing a large number of fragments in parallel• Spatially separated, clonally amplified DNA templates
• Targeted amplification of STRs can be transferred to MPS• ForenSeq™• Custom Amplicon Designs
Massively Parallel Sequencing
13
PCR-CE versus MPS
• PCR-CE relies solely on length to determine STR genotype
• MPS capable of revealing both size and exact sequence of STR alleles• Provides opportunity to identify sequence variation
Image taken from Novroski et al. (2016) Supplemental Table 2
14
Allele 1Allele 2Allele 3
Using MPS for a given STR X can reveal:
• 14 allele has (at least) 3 motifs
• (TCTA)2 TCTG (TCTA)11
• TCTA TCTG (TCTA)12
• (TCTA)14
• 15 allele has (at least) 5 motifs
• (TCTA)2 TCTG (TCTA)12
• TCTA (TCTG)2 (TCTA)12
• TCTA TCTG TGTA (TCTA)12
• (TCTA)15
• TCTA TCTG (TCTA)13
Using CE for the same STR:• 14 allele is simply a 14• 15 allele is simply a 15
Sequence Variation
15
1414a14b
15
15a
15b
15c
15d
How does sequence variation change interpretation?
• Melissa and Marie are both a [14,15] at STR X
If at least one motif differs in the [14,15] genotype between Melissa and Marie, their (CE) length-based genotypes would NO LONGER be
the same using MPS!
(TCTA)14(TCTA)15
Genotype:[14,15]
(TCTA)14 TCTA TCTG TGTA (TCTA)12
Genotype:[14,15a]
16
• A mixture of Melissa and Marie’s alleles at STR X would now provide more information:
14 15
Length-based Profile(PCR-CE)
Potential options for this mixture include:Melissa and Marie are both heterozygotes: [14,15]
Melissa and Marie are both homozygotes: [14,14] and [15,15]One is a heterozygote and one is a homozygote: [14,15 and [15,15]
14 15
Potential options for this mixture include:Melissa and Marie are both heterozygotes: [14,15]
One is a heterozygote and one is a homozygote: [14,14 and [15,15a]The analyst can eliminate a possibility!
Sequence-based Profile(MPS)
How does sequence variation change interpretation?
17
18
MPS is a promising technology for forensic genetics and using MPS for STR typing has revealed more
variation for some STR loci!
Extensive Characterization of STR Loci
• Analysis of STR sequence variation for 59 STR loci • 27 autosomal, 7 X chromosome and 25 Y chromosome STRs
• Includes ALL CODIS core loci (n=20; CODIS is the national DNA database)
• 777 individuals in four major population groups • African American, Hispanic, Caucasian, Asian
• Library Preparation and sequencing performed using MiSeqFGx Forensic Genomics System (Illumina)
•Data analyzed using the ForenSeq software, and advanced bioinformatic tools including STRait Razor and in-house excel workbooks
19
•Four general categories of sequence variation:• 1 – Increase in alleles due to repeat region variation
• e.g., D2S1338, D12S391
• 2 – Increase in alleles due to flanking region variation• e.g., D7S820, D13S317
• 3 – Increase in alleles due to both repeat region and flanking region variation• e.g., D18S51, DXS10135
• 4 – No increase in alleles beyond length-based methods• e.g., TPOX, Y-GATA-H4
20
Extensive Characterization Results
Image taken from Novroski et al. (2016)21
Extensive Characterization Results
• The diversity of STR allele variation continues to increase with expanded population studies • An increase of 644 alleles using a sequence-based approach
• Over 400 novel variants discovered in 777 population samples across 4 populations
• Population genetic analyses revealed increased heterozygosityand discrimination power for some loci!• Increased heterozygosity will improve discrimination power
• A few loci did not show an increase in diversity or heterozygosity with a sequence-based approach
22
Extensive Characterization Results
•Complex DNA mixtures are commonplace in forensic biological evidence• Despite best efforts for technological and statistical
improvements, many mixtures remain unresolvable
•The diversity of STR allele variation can increase with a sequence-based approach• HOWEVER… a few loci had no observable increase in diversity or
heterozygosity• No gain in information
• No improved DNA mixture deconvolution capabilities
• Core loci not selected for based on sequence variation
23
Problem(s)
24
There are highly polymorphic STRs in the human genome that may be better suited than currently applied markers (based on sequence variation and allele length spread)
that can facilitate deconvolution of component contributors in a mixed DNA sample
•Phase 1 – Establish candidate STRs with increased diversity from freely available datasets• Focus on STR loci with high heterozygosity, reduced length-based allele
spread, and tetranucleotides or larger repeats
•Phase 2 – Preliminary analysis of STR candidates• Evaluate population genetic parameters (heterozygosity, allele spread)
• Evaluate marker performance
•Phase 3 – Develop a comprehensive DNA mixture deconvolution panel• Characterize STRs with U.S. population samples
• Evaluate mixture deconvolution capabilities of each locus DNA mixtures
25
Approach
Phase 1: STR Candidate Selection
• 1000 Genomes Project (raw sequences, unsorted)
• STR Catalog Viewer• Summary of human STR variation compiled using lobSTR software
26
Tetranucleotides and larger (increase PCR
efficiency, reduce artifacts
Small length-based allele spread (minimized preferential
amplification, diversity of alleles maintained when coupled with
>80% heterozygosity)
80% Heterozygosity(↑ variability of
markers for easier differentiation
between individuals)
27
Phase 1: STR Candidate Selection
28
Phase 1: STR Candidate Selection
29
Phase 1: STR Candidate Selection
“77” allele corresponds to a 12 length-based repeat plus flanking information
http://strcat.teamerlich.org/chart/chr1/187550371/187550451
30
Phase 1: STR Candidate Selection
Initial Candidate STR Search
Manual STR Mining of LobSTR dataset (n = 1102
candidates)
Perl-based STR Mining of LobSTR dataset (n = 2784
candidates)
Overlap of STR Candidate in both Manual and Perl-based Mining (n = 793 candidates)
DesignStudio Testing using Repeat Region as Targets (n = 337 candidates, off-center)
DesignStudio Testing using Repeat Region AND Flanking Region as Targets (n = 201
candidates)
Inclusion of compatible Phillips et. al STRs (n = 47)
Preliminary Panel using TruSeq Custom Amplicons (n
= 248 candidates)
Perl-based STR Mining of 1000 Genomes BAM files (n
= 544 candidates)
31
Phase 1: STR Candidate Selection
Phase 3: DECoDE Panel Assessment and in silico mixtures
Using top ranked candidates for heterozygosity and chemistry compatibility → 73 Candidates Selected for Refined (DECoDE) Panel
53 STR Candidates that met criteria of high heterozygosity (>80%), reduced length-based allele spread (≤10), and a repeat size of four or more nucleotides
Considering (some) Poor, (all) Fair, (all) Good Performing Candidates → Heterozygosity and Allele Spread Assessed and Ranked
From the 248 Candidates → 55 Failures; 58 Poor Performers; 72 Fair Performers; 63 Good Performers
Sequencing Performance → STR Classified on Read Depth (Depth of Coverage) and Chemistry Compatibility
Illumina TruSeq Custom Amplicon Preliminary Panel (248 Candidates)
Phase 1 with >3000 Candidates → 793 Candidates → Design and Redesign → 248 Candidates
32
Phase 2: Preliminary Panel Summary
• The STR DNA EnhanCed DEconvolution panel
• 73 highly heterozygous loci using MPS chemistry• 15 loci previously described by Phillips (2016) and others
• 451 unrelated individuals from three U.S. populations• Caucasian (CAU; n=155); • Hispanic (HIS; n=148); • African American (AFA; n=148)
• Each STR locus was characterized and reviewed manually for diversity using in-house Excel workbooks• Alleles characterized by length and sequence• Population genetics analyses (heterozygosity; Hardy-Weinberg equilibrium (HWE);
linkage disequilibrium (LD); random match probabilities (RMP))
33
Phase 3: The STR DECoDE Panel
Where SNP = single nucleotide polymorphism; SB = sequence-based; LB = length-based; Ho = observed heterozygosity; He = expected 1 heterozygosity; Allele spread refers to the length-based difference between the smallest observed length-based allele and the largest observed 2 length-based allele. 3
34
Phase 3: The STR DECoDE Panel
ChromosomeLocus
Name
GRCh38
Reference
Allele
Repeat
LengthLocus Type Motif 5' SNPs 3' SNPs SB He SB Ho
Allele
Spread
3 D3S2406 33 4 COMPLEX[TATC]a [TGTC]b [CGTC]c
[CATC]d
rs2035580; rs533349040;
rs551975676;
rs566835150;
rs146488521;
rs555445819;
rs567665857;
rs535707905; rs71625920
rs573290642;
rs543848107;
rs190000499;
rs143356890;
0.9855 0.9889 17
3 D3A57 21 4 COMPLEX[TCTT]a [TCGT]b T [TCTT]c
[TCCT]d [TCTT]e [TCCT]frs2694124
rs1451872314;
rs12292659640.9717 0.9823 15
8 D8A26 19 4 COMPLEX[TTCC]a N16 [CTTT]b C
[CTTT]crs116668567 rs545041953 0.9700 0.9491 20
8 D8S1132 20 4 COMPLEX [TCTA]a TCA [TCTA]brs139381851;
rs142372169rs568433577 0.9588 0.9292 12
15 D15S822 17 4 COMPOUND [TATC]a [TCTA]b none observedrs117117801;
rs80418480.9571 0.9513 15
3 D3N61 17 4 COMPLEX[TTTC]a TTC [TTTC]b TAT
[TA]c [TTTC]drs17026573; rs9841195 rs4855796 0.9566 0.9178 15
8 D8A29 11 5 SIMPLE [AAAGG]a none observed rs11166830 0.9545 0.9222 16
11 D11N29 14 5 SIMPLE [GAGAA]ars538065917; rs6590431;
rs200846422none observed 0.9508 0.9111 16
• Summary:• 71 STRs had heterozygosities above 80%
• 17 STRs had allele spreads of 10 length-based alleles or less• Where 43 had between 11-15, and 13 had spreads of 16 or greater
• All STRs met the criteria of tetranucleotide motifs or larger• 10 were simple; 6 were compound; and 57 were complex
• HWE, LD and RMP seemed to meet expectations
• Some loci revealed operationally problematic characteristics• Dinucleotide patterns
• Two distinct repeat regions
• Homopolymer stretches
35
Phase 3: The STR DECoDE Panel
• A subset of 20 DECoDE loci selected for comparison to the CODIS core loci
• The current requirement is 20 CODIS core loci for upload into the national DNA database
• high heterozygosity (>90%)
• Operationally problematic loci (even if heterozygosity >90%) were excluded
• 443 U.S. population samples • African American, (AFA; n=140, 8 incomplete profile samples removed)
• Caucasian, (CAU; n=155)
• Hispanic, (HIS; n=148)
36
Phase 3: In silico Mixtures
CODIS Panel Loci DECoDE Panel Loci BEST Panel Loci
D2S1338 D3S2406 D3S2406
D12S391 D2S1360 D2S1360
D1S1656 D7S3048 D7S3048
D21S11 D8S1132 D8S1132
D8S1179 D11S2368 D11S2368
vWA D15S822 D15S822
D3S1358 D2N2 D2N2
D18S51 D1N10 D1N10
FGA D12N15 D12N15
D19S433 D1N16 D1N19
D13S317 D1N19 D1N21
D5S818 D1N21 D8N23
D16S539 D8N23 D15N26
D22S1045 D15N26 D14N56
D7S820 D14N56 D3N61
D2S441 D3N61 D12S1338
CSF1PO D4N70 D4N70
D10S1248 D11N52 D2S1338
TPOX D17N32 D1S1656
TH01 D2N43 D11N52
Orange cells reflect the CODIS core loci. 37
38
Phase 3: Panel Comparisons
Summary of length-based and sequence-based heterozygosities (He) in descending rank order for three U.S. populations (AFA=African American; CAU=Caucasian; HIS=Hispanic) for the CODIS core loci (n=20) and a subset (n=20; Supplemental Table 6) of the DECoDE panel loci.
Locus (in descending rank order of (average) heterozygosity)
39
Phase 3: Panel Comparisons
• Simulations using empirical data• 443 individuals
• In silico two-person mixtures for all possible pairwise comparisons (97,903)
• 1-Evaluate each locus (n=20 per panel) for ability to observe four alleles in each mixture
• 2-Evaluate each panel (n=3) for ability to fully resolve (observe four alleles at every locus) each mixture• CODIS core loci
• DECoDE loci (n=20)
• BEST loci (mixture of CODIS and DECoDE loci; n=17 DECoDE and n=3 CODIS)
40
Phase 3: In silico Mixtures
• For each locus:
• 443 individuals → 97,903 two-person mixtures!• Total comparisons per panel (20 loci X 97,903 mixtures): 1,958,060
• 3 Panels X 2 Types of Alleles (length-based and sequence-based) = 11,748,360 total in silico mixtures
41
Phase 3: In silico Mixtures: Locus Performance
Mixture
• For each locus:
• Consider the following possibilities:• Each person can be homozygote or heterozygote at a locus
• By length, sequence, or both!
• Two individuals can share one or both alleles (or ideally, NO alleles)
42
Phase 3: In silico Mixtures: Locus Performance
Mixture
Comparison of the proportion per category of resolved alleles in in silico two-person mixtures, presented as a summary of all loci for each of the three panels (C=CODIS; D=DECoDE; B=BEST; 443 (n) individuals comprising three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic); two-person (k) mixtures; 97,903 (x) comparisons per locus (n=20) in each panel; total comparisons (N) = 1,958,060 for each DNA profile type (LB=length-based; and SB=sequenced-based)). 43
Phase 3: In silico Mixtures: Locus Performance
• Simulations of each panel’s ability to fully resolve two-person DNA mixtures across full 20-locus DNA profiles• i.e., observe four alleles per locus for all loci in mixed DNA profile
• 443 individuals comprising three U.S. populations
• 97,903 comparisons per panel per allele type (length-based and sequence-based)
• Total: 587,418 two-person in silico mixtures
44
Phase 3: In silico Mixtures: Panel Performance
For each panel (CODIS, DECoDE, BEST):How many loci can be fully resolved in a DNA profile?
45
Phase 3: In silico Mixtures: Panel Performance
Full ResolutionCODIS (LB)
DECoDE(LB)
BEST (LB)
CODIS (SB)
DECoDE(SB)
BEST (SB)
0 Loci 155 0 0 3 0 0
1 Locus 1330 6 7 104 0 0
2 Loci 5001 27 76 663 0 0
3 Loci 11715 211 359 2541 0 1
4 Loci 18200 839 1181 6333 8 4
5 Loci 20574 2320 2885 12056 14 41
6 Loci 17881 5242 5971 17179 128 139
7 Loci 12141 9592 10134 18791 440 496
8 Loci 6514 14203 14249 16609 1204 1275
9 Loci 2889 17073 16079 11656 2801 2682
10 Loci 1063 16806 15882 6966 5379 5230
11 Loci 317 13764 13121 3169 9287 8794
12 Loci 98 9305 8937 1253 13194 12117
13 Loci 18 5005 5235 431 16137 15125
14 Loci 5 2398 2475 119 16747 16270
15 Loci 2 823 919 22 14580 14779
16 Loci 0 226 322 6 10125 10755
17 Loci 0 53 62 1 5266 6560
18 Loci 0 10 8 1 2005 2765
19 Loci 0 0 1 0 539 777
20 Loci (Full Profile) 0 0 0 0 49 93
Total 97,903 97,903 97,903 97,903 97,903 97,903
LB = length-based; SB = sequence-based
N = 443 individuals representing three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic samples)
46
Phase 3: In silico Mixtures: Panel Performance
Highlighted in BOLD is the mode. Mathematically – the mode refers to the value most observed in the data.Here – the mode refers to the number of loci with the greatest number of counts for which four alleles at each locus was resolved.
Minimum: 0
Mode: 5
Maximum: 15
47
CODIS Panel Performance - Length
97,903 Mixtures
155
20,574
2
Minimum: 0
Mode: 8
Maximum: 18
48
CODIS Panel Performance - Sequence
97,903 Mixtures
3
18,791
1
Minimum: 1
Mode: 9
Maximum: 18
49
DECoDE Panel Performance - Length
97,903 Mixtures
6
17,073
10
Minimum: 4
Mode: 14
Maximum: 20
50
DECoDE Panel Performance - Sequence
97,903 Mixtures
8
16,747
49
Phase 3: Panel Comparisons
Current CODIS Minimum: 0 (n=155)
DECoDE Minimum: 4 (n=8)
51
↑4
Current CODIS Maximum: 15 (n=2)
DECoDE Maximum: ALL 20 (n=49)
52
↑5
Phase 3: In silico Mixtures: Panel Performance
Current CODIS Mode: 5 Fully Resolved Loci (n=20,574)
DECoDE Mode: 14 Fully Resolved Loci (n= 16,747)
53
↑9
Phase 3: In silico Mixtures: Panel Performance
LB = length-based; SB = sequence-based
N = 443 individuals representing three U.S. populations (n=140 African American; n=155 Caucasian; and n=148 Hispanic samples)
54
Phase 3: In silico Mixtures: Panel Performance
↑9
Current Method (PCR-CE of CODIS loci)
Proposed Method (DECoDE and MPS)
Uses DNA Sequence Information? NO YES
Heterozygosity >80%? SOME ALL
Allele spread < 10? SOME SOME
Four-allele Loci 25.9 % (510,285 of 1,966,920 mixtures)
67.1 % (1,320,031 of 1,966,920; ↑ 259 %)
55
Phase 3: Summary and Significance
• Confirmation of the dataset and Concordance Testing• Reproducing this study with additional datasets and/or using other platforms is desirable
• The findings presented herein underwent robust scrutiny, but more data are always better!
• The Stutter Effect• Strand slippage during amplification/replication
• Mixture studies with stutter and minor contributors
• Chemistry and Instrumentation• Read length a limitation throughout
• Exploration of additional markers may be possible
• Loss of Information• A portion of candidate markers yielded no data or were not compatible
• Characterization of additional candidate loci
56
Future Directions
• Center for Human Identification • Dr. Bruce Budowle
• Illumina and Verogen• Drs. Marty Flores and Bob Kolouch
• Exact Diagnostics• Jerry Boonyaratanakornkit and Wahaj Zuberi
• NIJ Grant 2015-DN-BX-K067 “Enhancing Mixture Interpretation with Highly Informative STRs”
• Forensic Sciences Foundation 2015-16 Lucas Grant
• UNTHSC Scholarship Funding
57
Acknowledgements