Improved Models and Algorithms for Universal DNA Tag...
Transcript of Improved Models and Algorithms for Universal DNA Tag...
Improved Models and Algorithms for Universal DNA Tag Systems
Tejas IyerGeorgia Tech
David CashGeorgia Tech
Outline of Part 1: ExposiFon
Mo#va#on: The bio problem and applicaFons
Formaliza#on: The math problem
Analysis: Bounding the best possible soluFon
Part 2 (Tejas) is original contribuFon
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (1): DNA compuFng• Methods that exploit massive parallel and self‐assembly nature
of DNA to solve hard computaFonal problems
• Step 1 of DNA compuFng: encode the problem
• A trivial (ignored) step in most models of computaFon.e.g. Turing machines, circuit families, random access machines
• But the thermodynamics of DNA gets in the way. HybridizaFon? Secondary structures? More...
ACTGTTTCATTAAGCGCGTT
⠇
GGTAATTAAC
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
• TesFng several SNPs is expensive or impossible if done individually
MoFvaFon (2): SNP microarrays
• Single NucleoFde Polymorphism (SNP) Genotyping
• DetecFng variaFon at a single locus (base) within a populaFon
• Several important applicaFons in medicine: helps explain how single bases affect our reacFon to diseases and drugs
• TesFng several SNPs is expensive or impossible if done individually
• One soluFon: SNP microarrays
• Main technical component mass produced to reduce cost.
• Allow one to run hundreds of thousands of SNPs simultaneously
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
G T
A C
MoFvaFon (2): SNP microarraysTags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTCG T
A C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A C
?TGAA
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
ACTT
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAATGGATTAAC
G T
A
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAG
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
ACTT
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAATGGATTAACA
G T
A
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
ACTT GTAATCCAA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
C
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
ACTT?TGAAACTTGGGTTACAC TATGACCAGTG
ACTT
CGTAATCCAA
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
?TGAA
ACTT
CGTAATCCAA
ACTT
TTATGA
CCAG
GGGTTACACACTT
G
MoFvaFon (2): SNP microarraysSNPs to genotype:Tags:
TGGATTAACGTAATCCAAGGGTTACACTATGACCAG
AnF‐Tags:
ACCTAATTGCATTAGGTTCCCAATGTGATACTGGTC
Microarray:
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
?TGAA
?TGAA
?TGAA
G T
A
ACTTTGGATT
AACA
C
?TGAA
ACTT
CGTAATCCAA
ACTT
TTATGA
CCAG
GGGTTACACACTT
GObserve
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
ACTT
TTATGA
CCAG
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTCACTT
TTATGA
CCAG
MoFvaFon (2): SNP microarrays
• Same tags and anF‐tags mass produced and used for SNPs as needed ‐ analogous to general computer hardware.
• Our project focus: choosing tags so that they always “find” the correct anF‐tag
ACCTAATTG
CATTAGGTT
CCCAATGTG
ATACTGGTC
ACTTT
TATGACCAG
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
• One approach: choose tags to have high Hamming distance
• i.e. few matches when aligned
• Use techniques from error correcFng codes
• Limited success...
Choosing tags/anF‐tags• Want to avoid mishybridizaFons and have as many tags as
possible.
• But as we add tags, some will eventually be “too similar” and start hybridizing.
• One approach: choose tags to have high Hamming distance
• i.e. few matches when aligned
• Use techniques from error correcFng codes
• Limited success...
• Other ad hoc approaches suggested
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.
Later approach to tag design
• Coding was developed for communicaFons theory ‐ it ignores thermodynamic properFes of DNA that determine hybridizaFon
• Ben‐Dor et al. and Brenner suggested that we assume:
Mishybridiza7on only occurs when two tags contain long common substrings.
• SFll very simple and unrealisFc, but allows one to formalize the problem and get provably good results for tag sets.
• But how good is it in pracFce?
• Not addressed in current work!
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
• Higher implies stronger bond
DNA Thermodynamics (Review)
• mel7ng temperature TM(U,V): when 50% of U,V are in duplex
• Higher implies stronger bond
• CalculaFng melFng temperature:
1. 2‐4 Rule: TM(U,V) proporFonal to 2(# A‐T bonds) + 4(# G‐C bonds)
2. Nearest neighbor: look up interacFons between adjacent bases in experimental table.
3. Wetmur’s equa#on: applies to longer strings only.
A model for tag design• Formalized by Ben‐Dor et al.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
• (1) ensures that each tag hybridizes with its anF‐tag strongly.
A model for tag design• Formalized by Ben‐Dor et al.
• Define the weight of a string s as w(s) = (#A/T) + 2(#G/C)
ApplicaFon fixes temperatures h, c. An (h,c)‐code saFsfies two condiFons:
1. Each tag t must have w(t) ≥ h.
2. Any string s such that w(s) ≥ c appears in at most one tag.
• (1) ensures that each tag hybridizes with its anF‐tag strongly.
• (2) is meant to ensure that tags do not bond with the wrong anF‐tag, but it is more subtle.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond
The Ben‐Dor et al. Model (cont)2. Any string s of weight ≥ c appears in at most one tag.
ACGCTGTA TCTGTAATGACNot allowed:
• Reflects original assumpFon that hybridizaFon occurs only if long tags share a long substring.
• Also incorporates 2‐4 Rule: more G/C bases imply stronger bond
• Allows them to prove an upper bound on the number of tags in an allowed system.
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)(1 +!
3)n
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)
Theorem: For any c and h, an (h,c)‐code may contain at most
tags
(1 +!
3)n
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
h! c + 1
Upper Bound of Ben‐Dor et al.Let Gn be the number of strings of weight n
(proporFonal to by standard recurrence relaFon)
Theorem: For any c and h, an (h,c)‐code may contain at most
tags
(1 +!
3)n
Remark: SFll exponenFal in c, so it allows for quite large codes.
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
h! c + 1
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
DefiniFons:
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Upper Bound: Proof
Let a c‐token be a string that contains no proper suffix of weight c.
The tail weight of a c‐token is the weight of its last character.
The tail weight of a tag is the sum of tail weights of all of the c‐tokens it contains.
DefiniFons:
2. Any c‐token s of weight ≥ c appears in at most one tag.
Strategy:1. Show that each tag has tail weight ≥ h ‐ c + 12. Show that a (h,c)‐code can have total tail weight at most
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
Upper Bound Proof, Part 1
Claim 1: Each tag has tail weight ≥ h ‐ c + 1
Example: c = 4
Tag:c‐tokens:
G A C C A A T Tail WtG A C 2
C C 2C C A 1
C A A 1C A A T 1
ObservaFon: every character gets counted, except at most (c‐1) beginning weight
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Upper Bound Proof, Part 2
Claim 2: Total tail weight ≤
Note: There are at most Gc c‐tokens by definiFon, so 2⋅Gc is trivial.
For this bound, divide them into classes:
Class
<c-2>S
S<c-3>S
<c-1>W
S<c-2>W
Occurences Total Tail Wt.
2⋅Gc-2 4⋅Gc-24⋅Gc-3 8⋅Gc-32⋅Gc-1 2⋅Gc-14⋅Gc-2 2⋅Gc-2
2 · Gc!1 + 6 · Gc!2 + 8 · Gc!3
Actually 2⋅Gc-2
Part 2