Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center...
-
Upload
estella-golden -
Category
Documents
-
view
230 -
download
0
Transcript of Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center...
Optimal k-mer superstrings
for protein identification
and DNA assay design.
Optimal k-mer superstrings
for protein identification
and DNA assay design.Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
2
k-mer (Sub-)Problems
• Enumerate:• For all (distinct) k-mers, do
• Existence:• ...with respect to exact (& inexact) count ¸ x
• Uniqueness:• ...with respect to exact & inexact match
• Near-neighbors:• ...with respect to inexact match
• Representation:• Represent (distinct) k-mers for other tools• Fast annotation of k-mer counts on original sequences
3
Applications of k-mer sets
• Peptide Identification• Represent all amino-acid 30-mers• ...that occur at least twice in human dbEST
• PCR Primer Design:• Test DNA 20-mer primers for uniqueness• What does it mean to be unique?
• DNA sequencing error / repeat detection• Eliminate mers that are too rare or too frequent
• Pathogen signatures• Near-neighbors imply potential false-positives
4
k-mer Superstring Problem
Given• A set of sequences S = { S1, ..., Sn }
• Sequence database• Word size kFind• A new set of sequences T = { T1, ..., Tm }Such that• Total length of T is minimized, and• T is complete and correct w.r.t. k-mers of S
5
k-mer Superstring Problem
• Completeness• All of the k-mers of S are represented
• Correctness• No additional k-mers are present
• Minimize the total representation length• Correlates with running time
6
Shortest (common) superstring problem
• General strings (arbitrary length)• Single output string• Completeness for input sequences only
• Classical NP-hard problem• Garey and Johnson
• Approximate within ~ 2.5*OPT• Max-SNP hard
• One of the first algorithmic approaches to genome assembly
7
de Bruijn Sequences
de Bruijn sequences represent all words of length k from some alphabet A.
A = {0,1}, k = 3: s = 0001110100
A = {0,1}, k = 4: s = 0000111101011001000
8
de Bruijn Graph: A = {0,1}, k = 4
110
011
100
001
000 010 111101
1 1
11
1
11
1
00
0
00
0
00
9
de Bruijn Sequences & Graphs
de Bruijn graphs (k,A):• Edges represent length k words from A• Each node has
• in degree |A|• out degree |A|
Eulerian tour constructs de Bruijn sequence.
10
Sequencing-by-Hybridization-graph
ACDEFGI, ACDEFACG, DEFGEFGI
11
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
12
Sequence Databases & CSBH-graphs
• Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
13
C3 Enumeration
• Complete• All k-mers are present
• Correct• No other k-mers are present
• Compact• No k-mer is present more than once
14
Correct, Complete, Compact (C3) Enumeration
• Set of paths that use each edge exactly once
ACDEFGEFGI, DEFACG
15
Correct, Complete (C2) Enumeration
• Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
16
Patching the CSBH-graph
• Use artificial edges to fix unbalanced nodes
17
Patching the CSBH-graph
• Use matching-style formulations to choose artificial edges• Optimal C2/C3 enumeration in polynomial time.
• Chinese Postman Problem• Edmonds and Johnson, ’73
• l-tuple DNA sequencing• Pevzner, ’89
• Shortest (Common) Superstring• MAX-SNP-hard, 2.5 approx algorithm
18
Related work
• Chinese Postman Problem• Undirected graph, weighted edges• Shortest path that uses all the edges
• Solvable in polynomial time• Construct minimum weighted matching
between nodes of odd-degree• Add matching to graph and find Eulerian
path• Minimize weight of extra edges used
19
C2 Enumeration
• Chinese postman problem, except:• Directed graph
• Add edges from nodes with surplus in-degree to nodes with surplus out-degree
• Fixed cost teleportation option• Can always “start” a new sequence
• Find optimal set of additional edges• Transportation problem / min cost flow
instance
20
C3 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2Cost: k
#in-#out #in-#out
21
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
• ACDEHAC, ACDFHAC, ACDGHACD
22
• C3: ACDEHACDFHAC, ACDGHACD
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
$ACD
23
• C2: ACDEHACDFHACDGHAC
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
D
24
C2 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2
4
7
10
“Shortcut paths”
#in-#out #in-#out
25
C3 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2
#in-#out #in-#out
Cost: k
0 0
Cost: 0 Cost: 0
26
Sample Preparation for Peptide Identification
Enzymatic Digestand
Fractionation
27
Single Stage MS
MS
m/z
28
Tandem Mass Spectrometry(MS/MS)
Precursor selection
m/z
m/z
29
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
m/z
m/z
30
Peptide Identification
• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well
• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...
• Automated, high-throughput peptide identification in complex mixtures
31
Novel Splice Isoform
• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.
• LIME1 gene:• LCK interacting transmembrane adaptor 1
• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
32
Novel Splice Isoform
33
Novel Splice Isoform
34
Novel Mutation
• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female
healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)
• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
35
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
36
Novel Mutation
37
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Inclusive gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
38
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
39
Sequence Databases & CSBH-graphs
• All k-mers represented by an edge have the same count
2 2
1
2
1
40
CSBH-graph subgraphs
• Quickly determine those that occur twice
2 2
1
2
41
k-mer (Sub-)Problems
• Enumerate:• For all (distinct) k-mers, do
• Existence:• ...with exact (& inexact) count ¸ x
• Uniqueness:• ...exact & inexact match
• Near-neighbors:• ...inexact match
• Representation:• Represent (distinct) k-mers for other tools• Fast annotation of k-mer counts on original sequences
42
Large scale instances!
• CSBH-graph instances• Partition set of all k-mers, determine non-trivial nodes• Days on condor grid (250 CPUs) to construct• ¸ 100,000,000 nodes and edges (sparse & dense)
• Min-cost flow instances• ¸ 500,000 nodes and edges
• Algorithms must be linear in problem size• Out-of-core Eulerian path algorithm?
• Currently testing out-of-core connected-components
43
Grid computing
• Heterogeneous machines• Varying disk/memory/MHz/cores capabilities
• Centralized scheduler• Jobs started asynchronously• Other jobs may preempt current job
• Input files may need to be staged• 250 simultaneous requests for a 3Gb file?• How to guarantee integrity of input files?
• Problem decomposition may be non-trivial• Jobs sizes need to fit the least capable machine• Sometimes need to “game” the scheduler
• Need to ensure the integrity of job output
44
Uniqueness Oracles
• Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb)• Count occurrences in the genome: 0,1,2+• Construct 20-mer superstring for 20-mers with
count 1• Construct 20-mer superstring for 20-mers with
count > 1• Easy(-ish) for exact sequence match: O(n)
• Fast automata, hash tables, suffix trees.
45
Polymerase Chain Reaction
46
Polymerase Chain Reaction
47
Inexact sequence match
• Inexact sequence matching O(n*m*k)• Errors/Mismatches (k): 1,2,3• # distinct 20-mers (m): O(n)
• Achieve expected linear time using a hybrid approach (blastn):• Exact search for short chunks of primers• Expensive alignment only where chunks match• Large chunks ) Fast, but miss occurrences• Small chunks ) Slow, find all matches
48
Baeza-Yates Perleberg: • Correct and O(n) for small k
• At least 1 chunk is observed with no error.• Small k → Large chunks → Fast and correct• Form of locality sensitive hashing
Inexact sequence match
≠ = ≠q
g
49
Locality Sensitive Hashing
• For each primer:• store a (set of) hash(es) in hash-table
• At each position in the genome:• look-up a (set of) hash(es) in hash-table• if any hash is found, do more expensive check
• Need to weigh• sensitivity (false negatives) vs• specificity (false positives)
• Our application requires speed and no false negatives!
50
Random Projection
• Choose T templates of l random “care” positions
q
g
51
Random Projection
• Choose T templates of l random “care” positions
t1
g
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0
52
Random Projection
• Choose T templates of l random positions
t1
t2
g
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
53
Random Projection
• Choose T templates of l random positions
t1
t2
g
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
54
Random Projection
• Choose T templates of l random positions
t1
t2
g
t1: 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 t2: 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1
55
Gapped seed-set design problem
• Given:• mer-size: m ( = 20 )• # errors: k ( = 1,2,3)• # cares: l ( = 10,12,14 )
• Find the smallest set of templates with no false negatives.• Minimize running time.
56
Gapped seed set design formulation (for k = 2)
• Cover the edges of Km with copies of Km-l
• How many triangles to cover K6?(m = 6, k = 2, l = 3)
• Some instances of (m,2,m-3) cover each edge exactly once:• Steiner triple systems
57
How many triangles cover K6?
• 15 edges total
• Is 5 triangles possible?
58
How many triangles cover K6?
• 15 edges total
• Is 5 triangles possible? NO!
59
How many triangles cover K6?
• 15 edges total
• Is 5 triangles possible? NO!
• Each node requires 3 triangles
• Triangles must account for at least 18 “edges”
60
How many triangles cover K6?
• 15 edges total
• Is 5 triangles possible? NO!
• Each node requires 3 triangles
• Triangles must account for at least 18 “edges”
61
Gapped seed set design formulation #2
• Set cover instance:• Ground set: all possible placements of the
k errors (alignments)• Covering sets: all possible placements of
the l care positions
• For (m=20,k=2,l=10),• 190 elements, 184,756 sets!• Greedy approximation algorithm works
62
Gapped seed set design formulation #3
Templates Positions (m)
l Remove any kposition nodes,
at least 1 templatemust have degree l.
63
Gapped seed set design formulation #3
• Polynomial size in terms of number of templates
• Select T in advance and test whether sufficient.
• Greedily add 1,2,3,... templates.• Apply iteratively to achieve feasible
solution
64
Solution for (20,2,10)
.................... Positions
********** t1
********** t2
***** ***** t3
***** ***** t4
***** ***** t5
********** t6• Need > 4 templates, 6 is optimal
65
Remember the application!
• We are checking some templates twice!• We compute hash(es) at each position in
the genome
• Any template that is a shift of another will be computed at some nearby genomic position!
66
Solution for (20,2,10)
.................... Positions
********** t1
********** t2
***** ***** t3
***** ***** t4
***** ***** t5
********** t6• Need at most 3 templates...can we do better?
67
Solution for (20,2,10) w/ shift
.................... Positions
**** ** **** t1
**** * ***** t2
• Optimal is 2 templates...
68
Gapped seed set designSolution strategies
• Randomized algorithms• Greedy algorithm
• Directly to set cover instance• Indirectly to bipartite instance
• Integer programming• On set cover and bipartite instances• Solution of greedy algorithm subproblem• ...in parallel, using COIN-OR SYMPHONY
• Branch-and-bound enumeration• Solution of greedy algorithm subproblem• ...in parallel, using COIN-OR ALPS library
69
What about edit-distance?
• Formulations can be generalized• Similar solution strategies can be applied
• (All) symmetry lost!• This may actually be helpful
• Much harder to solve• Is greedy still good?• Solutions typically require more templates
70
Uniqueness Oracles
• Integrated with CSBH-graph construction algorithm• Ensure edge-count property is preserved
• Sequence database of unique / non-unique 20-mers for small genomes• D. melanogaster, up to edit-distance 2
• Currently working to scale to human...
71
Other Projects / Interests
• HMMs for Peptide Spectrum Matching• with UMd, CS
• Rapid Microorganism Identification Database• www.RMIDb.org
• Pathogen detection using Spectral Matching• with USDA
• Locality sensitive hashing• spectra, peptide sequence
• Statistical techniques• statistical significance • importance sampling
• CSBH-graph applications• genome assembly
• Grid computing• Web-applications• Relational databases
72
Future Research Directions
• Extend k-mer superstring algorithms• Range of word sizes, variable length words• Other sequence properties (Tryptic peptides, Tm)
• Identification of protein isoforms:• Optimize proteomics workflow for isoform detection• Identify splice variants in cancer cell-lines (MCF-7)
and clinical brain tumor samples• Aggressive peptide sequence enumeration• dbPep for genomic annotation• Open, flexible informatics infrastructure for peptide
identification
73
Future Research Directions
• Proteomics for Microorganism Identification• Specificity of tandem mass spectra• Revamp RMIDb prototype• Incorporate spectral matching
• Primer design• Uniqueness oracle for inexact match in human• Integration with Primer3• Tiling, multiplexing, pooling, & tag arrays
74
Acknowledgements
• Chau-Wen Tseng, Xue Wu• UMCP Computer Science
• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry
• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: National Cancer Institute