1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion...
-
date post
15-Jan-2016 -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion...
![Page 1: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/1.jpg)
1
Highly Scalable Algorithms for Robust String Barcoding
Bhaskar DasGupta*
Kishori M. Konwar
Ion Mandoiu
Alex Shavartsman
Computer Science & Engineering Department
University of Connecticut
Storrs, CT
*Department of Computer Science University of Illinois at Chicago
Chicago, IL
![Page 2: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/2.jpg)
2
Motivation• There are many critical situations when one needs to rapidly identify an unknown genomic sequence from among a given set of known sequences
Rapid identification of pathogens in epidemic outbreaks Monitoring of microbial communities, e.g., in environmental studies Fast database search
![Page 3: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/3.jpg)
3
Possible Approaches
• Sequencing based: sequence the unknown DNA sequence, then use similarity search programs such as BLAST to identify the unknown virus sequence for pathogens in databaseSequencing is prohibitively expensive and time consuming
• Hybridization Based: identify the unknown sequence by testing for the presence of certain subsequencesSubsequence tests can be performed quickly and at low cost
using a variety of hybridization based methods
![Page 4: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/4.jpg)
4
Sequence Fingerprints• For each sequence, find a subsequence that appears
in that sequence and only in that sequence
GTTGC GTTC CAT
CAGTTGC 1 0 0
CAGTTC 0 1 0
CATGGA 0 0 1
• Sequence barcodes: 0/1 vectors• When using fingerprints, barcode length = #sequences
![Page 5: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/5.jpg)
5
String Barcoding
• (Borneman et al.’01, Rash & Gusfield’02): Unique occurrence of tested subsequences not needed, as long as 0/1 barcodes are unique
TG CAGT
CAGTTGC 1 1
CAGTTC 0 1
CATGGA 1 0
• When using non-unique subsequences, barcode length can be much smaller than #sequences
![Page 6: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/6.jpg)
6
Overview• Problem Formulation and Previous Work
• Greedy Setcover Algorithm
• Experimental Results
• Conclusions
![Page 7: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/7.jpg)
7
Given:
Genomic sequences g1,…, gn
Find:
Minimum number of distinguisher strings t1,…,tk
Such that:
For every gi gj there exists a distinguisher tl which is a substring of gi or gj but not of both
- At least log2n distinguishers needed
- n distinguishers are always sufficient
Problem Definition
![Page 8: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/8.jpg)
8
Computational Complexity
• [Berman et al.’04] Cannot be approximated within a factor of (1-)ln(n) unless NP=DTIME(nloglog(n))
![Page 9: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/9.jpg)
9
• A non-redundant set of candidate distinguishers is generated using a suffix tree• One variable vi for each candidate distinguisher xi
vi = 1 xi is selected vi = 0 xi is not selected
Rash & Gusfield Integer Program
![Page 10: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/10.jpg)
10
Integer Program ExampleMinimize VVTGTG + V + VATGGAATGGA+ V+ VCAGTCAGT + V + VTTC +V +VGTGC #objective function
Such that VVTGTG + VVTTC + VVGTGC >= 1 #constraint to cover pair 1,2 VVATGGAATGGA + VVCAGTCAGT + VVGTGC >= 1 #constraint to cover pair 1,3 VVTGTG + VVATGGAATGGA + VVCAGTCAGT + VVTTC >= 1 #constraint to cover pair 2,3Binaries #all variables are 0/1 VVTGTG V VATGGAATGGA VVCAGTCAGT VVTTC VVGTGC End
TG ATGGA
1. CAGTGC 1 0
2. CAGTTC 0 0
3. CCATGGA 1 1
![Page 11: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/11.jpg)
11
Limitations of Integer Program Method
• Works only for moderately sized datasets 50-150 sequencesAverage sequence length ~1000 nucleotidesUp to 4 hours needed to come within 20% of optimum
![Page 12: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/12.jpg)
12
Information Content Heuristic
• [Berman et al. 2004]Keep track of the partition defined by distinguishers
selected so far In every step, choose candidate that reduces partition
entropy by largest amount
• Theorem: Information Content Heuristic is always finding a #distinguishers within 1+ln(n) of optimum
![Page 13: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/13.jpg)
17
Limitations of ICH• Real genomic sequences contain degenerate
nucleotides (e.g., N for any of {A,T,C,G}) due to sequencing errors and known single nucleotide polymorphisms
• Distinguisher-to-sequence matches:Perfect matchesPerfect mismatchesUncertain matches
• Information Content cannot be defined in the presence of uncertain matches
ATCNAT
ATC 1
CCC 0
CCA ?
![Page 14: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/14.jpg)
18
Other Heuristics
• (Cazalis et al 2004): greedy setcover, simulated annealing, and genetic algorithms for distinguisher selection
• To achieve practical running time, only a small random subset (2000 candidates) of all candidate distinguishers is consideredNo data provided on the loss of solution quality due
to this restriction
![Page 15: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/15.jpg)
19
Overview• Problem Formulation and Previous work
• Greedy Setcover Algorithm
• Experimental Results
• Conclusions
![Page 16: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/16.jpg)
20
Setcover Greedy Heuristic
• Phase I: Candidate Generation Generate a representative set of candidate
distinguishers from the source sequences
• Phase II: Greedy Distinguisher SelectionIn every step, choose candidate that distinguishes the
largest number of not yet distinguished pairs
![Page 17: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/17.jpg)
21
Candidate Generation
• A set of candidate distinguishers guaranteed to contain an optimum solution is generated from the sequences
• We do not generate certain redundant candidates A candidate is redundant if there is another candidate
that appears exactly in the same set of sequencesFor every sequence we generate only one of the
substrings that appear exclusively in that sequence
![Page 18: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/18.jpg)
22
Efficient Candidate Generation• Our implementation uses simple array datastructures
We generate candidates in increasing order of lengthExact match positions for candidates of length l-1 used to
generate the exact matches for candidates of length l
• Candidates that do not satisfy individual given biochemical constraints, such as minimum/maximum length, GC content, melting temperature, are discarded
![Page 19: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/19.jpg)
23
Setcover Greedy Heuristic
• Phase I: Candidate Generation Generate a set of candidate distinguishers from the
source sequences
• Phase II: Greedy Distinguisher SelectionIn every step, choose candidate that distinguishes the
largest number of not yet distinguished pairs
![Page 20: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/20.jpg)
24
Distinguisher Selection as Set Cover
• Set Cover Problem: given a universal set U and a family of subsets, find a minimum number of subsets covering U
• Distinguisher selection is a special case of set cover:Elements to be covered are the pairs of sequencesEach candidate distinguisher defines a set of pairs that it
separates
• By a classical result, the greedy algorithm has an approximation factor of 1+ln(|U|) Setcover greedy has approximation factor of 2*ln(n) for
distinguisher selection with n sequences
![Page 21: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/21.jpg)
25
Distinguisher Selection• Start with an empty set D of distinguishers
• While there are pairs of sequences not yet distinguished, do:
Compute for each remaining candidate c its coverage gain (c, D) – the number of not yet distinguished pairs of sequences that are distinguished by c
Add the candidate with maximum coverage gain to D
• Return D
![Page 22: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/22.jpg)
26
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
D = { }
![Page 23: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/23.jpg)
27
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
c=TCAG
D = { }
![Page 24: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/24.jpg)
28
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
c=TCAG
D = { }
(c, D)= 3 x (5 –3) = 6
![Page 25: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/25.jpg)
29
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
D = {TCAG}
![Page 26: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/26.jpg)
30
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
D = {TCAG}
c=AAT
![Page 27: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/27.jpg)
31
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
D = {TCAG}
c=AAT
(c,D)= 1 x (2-1) + 1 x (3-1) = 3
![Page 28: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/28.jpg)
32
Computation of (c, D):
CATCAGA
TTCAGT
TAT
AATAG
AATCAG
D = {TCAG,AAT}
![Page 29: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/29.jpg)
33
Computation of (c, D)
|\|||),(1
cic
k
ii MSMSDc
• S1, S2, …, Sk are the subsets in the partition defined by D
• Mc is the set of matches of candidate c
• Using simple datastructures, computation can be done in linear time (in the number of sequences)
![Page 30: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/30.jpg)
34
Lazy Update of Gains
• Coverage gains are monotonically non-increasing during the algorithm
• Re-compute coverage gain for a candidate only if last saved gain is higher than the gain of current best candidate
• In practice this speeds-up the selection algorithm by a factor of ~2
![Page 31: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/31.jpg)
35
• Degenerate basesA pair of sequences is separated by candidate c if
c has at least one perfect match with one of the sequences, andc has perfect mismatches at all positions of the other sequence
Gain computation done in O(n2) time using a simple coverage matrix data-structure
• Redundancy rA pair of sequences is counted in the gain function until r
distinguishers separate it
• Distinguisher cross-hybridizationMinimum edit distance, or maximum common substring
weight, bound for every pair of selected distinguishersCandidates incompatible with a selected distinguisher
removed from candidate list
Algorithm Extensions
![Page 32: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/32.jpg)
36
Overview• Problem Formulation and Previous work
• Greedy Setcover Algorithm
• Experimental Results
• Conclusions
![Page 33: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/33.jpg)
37
• Randomly generated instances Equal probabilities assigned to each of the four
nucleotides
• Microbial genomes extracted from NCBI databasesSequence lengths between 490 Kbases to 4.75
MbasesSmall number of degenerate bases
Testcases
![Page 34: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/34.jpg)
38
Selection time, L=10k, r=1
basic – O(n2) computation of gains using matrix datastructure
partition – O(n) computation of gains using partition-based datastructure
![Page 35: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/35.jpg)
39
Candidate Sampling, n=1000, L=10k, r=1
![Page 36: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/36.jpg)
40
Comparison to ICH, L=10k, r=1
Algo
n
10 20 50 100 200 500 1000
log2n
ICH
SGA
4 5 6 7 8 9 10
4.0 5.0 7.0 8.0 10.0 12.2 14.1
4.0 5.0 7.0 8.0 10.0 12.3 14.1
![Page 37: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/37.jpg)
41
Varying Redundancy, L=10k
0
10
20
30
40
50
60
70
80
0 5 10 15 20
Redundancy
#Dis
tin
gu
ish
ers
n=10 n=20 n=50 n=100
n=200 n=500 n=1000
![Page 38: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/38.jpg)
42
• 20 NCBI microbial genomic sequences
• Distinguisher melting temperature range of 55-60 oC
• GC content range of 40-60%
• Max common subsequence weight bound of 5 weight(A)=weight(T)=1, weight(C)=weight(G)=2
NCBI testcase
![Page 39: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/39.jpg)
43AACTGTCTCACGACGTTCTGAA
GATTCGAACCCCCGA
GTGGATGCCTTGGCA
GGACTACCAGGGTATCTAATCCTG
AAAGAAGATAGAGCAGCAGCT
AAGCGCGTCGCAAA
CACAAGGAGTGAGTGTTGC
CGGTTTTGTGCTTCATGG
CCATTGACAATTTCAACACC
Organism Mb Barcode
Nanoarchaeum equitans Kin4-M 0.49 0 0 0 0 0 0 0 0 1Mycobacterium tuberculosis CDC1551 4.40 0 0 0 0 0 0 1 0 0Brucella suis 1330 chromosome 1 2.11 0 0 0 0 1 1 0 1 0Leifsonia xyli subsp. xyli str. CTCB07 2.58 0 0 0 0 0 0 1 0 1Mannheimia succiniciproducens MBEL55E 2.31 0 0 0 0 1 1 1 0 0Geobacter sulfurreducens PCA 3.81 0 0 0 1 0 0 0 0 0Rickettsia typhi str. Wilmington 1.11 0 0 0 0 0 1 1 0 1Picrophilus torridus DSM 9790 1.55 0 1 0 0 0 0 0 0 1Mesoplasma florum L1 0.79 0 0 0 0 0 0 0 1 1Methylococcus capsulatus str. Bath 3.30 0 0 0 0 0 1 0 0 1Propionibacterium acnes KPA171202 2.56 0 0 0 0 0 0 1 1 0 Mycoplasma mobile 163K 0.78 0 0 0 0 0 1 0 1 1Mycoplasma hyopneumoniae 232 0.89 1 0 0 0 0 1 0 1 1Bacillus licheniformis DSM 13 4.22 0 0 0 0 0 1 1 1 0 Legionella pneumophila subsp. pneumophila str. Philadelphia 1
3.40 0 0 0 0 0 1 1 0 0
Onion yellows phytoplasma OY-M DNA 0.86 0 0 0 0 1 1 1 1 0 Staphylococcus aureus subsp. Aureus strain MRSA252
2.90 0 0 1 0 0 1 1 1 1
Staphylococcus aureus strain MSSA476 2.80 0 0 0 0 0 1 1 1 1 Burkholderia pseudomallei strain K96243 chromosome 1
4.07 0 0 0 0 0 1 0 0 0
Bartonella henselae strain Houston-1 1.93 0 0 0 0 0 1 0 1 0GC (%) 60.0 45.5 60.0 50.0 57.1 50.0 52.6 42.9 40.0Tm (oC) 55.6 59.6 55.4 59.3 56.9 58.6 55.1 55.4 56.3
NCBI testcase, r=1
![Page 40: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/40.jpg)
44
Results on 29 Microbial Sequences (76 Mb)
Redun lmin lmax MinEdit Select Time #Distinguishers
1
1
0 0 14.2 6.015 40 6 2.6 8.0
5
5
0 0 20.3 21.015 40 6 8.7 31.0
10
10
0 0 22.9 41.015 40 6 16.4 60.0
20
20
0 0 26.8 76.015 40 6 33.4 123.0
![Page 41: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/41.jpg)
45
Overview• Problem Formulation and Previous work
• Greedy Setcover Algorithm
• Experimental Results
• Conclusions
![Page 42: 1 Highly Scalable Algorithms for Robust String Barcoding Bhaskar DasGupta * Kishori M. Konwar Ion Mandoiu Alex Shavartsman Computer Science & Engineering.](https://reader036.fdocuments.us/reader036/viewer/2022062409/56649d495503460f94a258e4/html5/thumbnails/42.jpg)
46
• We provided highly scalable algorithms for the robust string barcoding problem, capable of handling whole genomic sequences of up to bacterial size
• Distinguisher selection based whole genomic sequences results in a number of distinguishers nearly matching the information theoretic lower bounds for the problem
• The software can be used online at http://dna.engr.uconn.edu/~software/DNA-BAR/
Conclusions