Post on 06-Aug-2015
Calculating Guilt:Using open-source
software in forensic DNA testing
Sarah Chenowethsarah@dreamwidth.org
@sarahquaint
Disclaimer
• All opinions are my own.
• Dammit, Jim, I’m a chemist, not a programmer.
• …or a statistician.
• slideshare.net/dreamwidth
Gameplan
• Forensic DNA 101
• What sort of profiles do I obtain?
• Statistics: giving weight to those profiles
• Open-or-not software for calculating these statistics
Rosalind was robbed.
• 23 pairs of chromosome
• >3 billion base pairs
• ~2% is coding DNA (genes)
• ~20-40% is regulatory
• ~50% is highly repetitive
AATGAATGAATGAATGAATGAATGAATG <— 7 repeats
AATGAATGAATGAATGAATGAATGAATGAATGAATGAAT <— 9.3 repeats
STR = Short Tandem Repeat
On chromosome 11, there is an area called TH01, where the STR “AATG” repeats over an over again.
On the chromosome from my mother, it repeats 7 times, and on the one from my father, it repeats 9.3 times.
Source: National Human Genome Research Institute
You are not a special snowflake.
• Most of your DNA, including your genes, is “highly conserved”
• All humans are 99.9% identical
• Of course, 0.1% of 3 billion = 3 million base pairs of variation
It’s like an EAN on the back of a book…
• A forensic DNA profile is the length of 23 STRs, each between 100-500 base pairs in length
• <3% of 1% of 1% of your genome
• Unique “barcode”, except for identical siblings.
Included or excluded?
• Single-source profiles are simple. But we mostly see mixtures.
• DNA is the gold standard, carries a lot of weight.
• Must characterize all inclusions with a statistic.
• Make the qualitative statement (excluded, or matches), characterize it with a quantitative statistic, and let the trier of fact evaluate.
• 4 alleles at Penta E: 5,7,9,13
• Say this is an assault. We can assume that the victim is present, and we know the victim is 7,9.
• So: what are the odds that a random person in the population is a 5,13?
Likelihood Ratio (LHR)
Likelihood Ratio (LHR)• How frequently do we see the 5 allele? About
4%
• How frequently do we see the 13 allele? About 5%
• At this one locus: 360 times more likely it’s Sarah & Robert than Sarah & someone picked at random from the population.
• Calculate this at all 22 loci, and multiply together: 1.6 x 1023 (160,000,000,000,000,000,000,000)
“A reasonable degree of scientific certainty.”
• DNA is a living, biological substance = messy
• Our testing procedure is super-sensitive. <10 cells
• The law wants a clear line between guilty and not guilty; science is full of, “Well, maybe; it depends.”
• Our classic statistical tools can’t handle these incomplete mixtures.
…now what?• Only use the loci where the
suspect is present? That’s horribly biased.
• Throw up our hands and refuse to draw conclusions on partial data? Also biased!
• The least awful solution is to only use the loci that we know have complete info: the ones with two minor loci.
source: my sister, who is the biological mother of this pouty kid.
Understating is just as bad as overstating.
• Well, almost. The justice system is designed to err on the side of caution, and benefit the defendant.
• Take a conservative approach.
• But not using all the data isn’t always conservative: what if that was exculpatory information?
Probabilistic genotyping
Semi-continuous
• Considers the probability of drop out when calculating the LHR.
• Open source. Fast.
• Still doesn’t use all the data (peak height ratios, stutter).Scenario:
The victim is: 20,20The suspect is: 19,22
What is the probability the suspect isa contributor, but the 19 dropped out?
Lab Retriever
• if we had a complete mixture =1.6 x 1023 160,000,000,000,000,000,000,000
• partial mixture, so we only use 4 loci for LHR = 1.4 x 106 = 1,400,000
• same partial mixture, semi-continuous LHR = 7.3 x 1020 = 730,000,000,000,000,000,000
Probabilistic genotyping
Continuous
• Markov-chain Monte Carlo (MCMC) simulations.
• Uses all of the data, with fewer assumptions.
• Doesn’t just give you the best estimate: gives you a range.Probable genotype ofthe minor contributor:
AC: 40%BC: 25%CC: 20%CQ: 15%
STRMix
• Developed by the ESR (Environmental Science and Research, NZ) and FSSA (Forensic Science South Australia)
• Increasingly becoming the standard
• 20K USD initially, 5K/yr support contract
The justice system does not embrace open
source.• The data is reliable: but is my interpretation?
• But I don’t tell “the whole truth, and nothing but the truth.” I can only answer the questions I’m asked.
• Prosecutor misstatement: “That means there’s a one in a quadrillion chance it’s someone else!”
• Defense misstatement: “She didn’t test the DNA of a quadrillion people, so there’s no way that’s true!”
Currently, in forensic DNA:
• Binary statistics: yes
• Semi-continuous: yes
• Continuous: no
• Frequency databases: yes
• Data analysis: no
• CODIS: hell to the nosource: Wikimedia commons
There is too much.Let me sum up:
• Transparency is the key to credibility.
• I need to document all my observations, results, and calculations so they are reproducible.
• Open software are necessary for independent verification.