Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N....
-
Upload
marshall-domenic-king -
Category
Documents
-
view
217 -
download
0
Transcript of Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N....
Clustering Clustering Sequences in a Sequences in a Metric SpaceMetric Space
Clustering Clustering Sequences in a Sequences in a Metric SpaceMetric Space
The MoBIoS Project
Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu
Department of Computer Sciences, University of Texas
Austin, TX 78712, USA
{rmao, miranker}@cs.utexas.edu
Research supported in part by the Texas Higher Education Coordinating Board, Texas Advanced Research Program.
2
Immediate Goal:Use Metric Space Indexing to Support Homology Search
Immediate Goal:Use Metric Space Indexing to Support Homology Search
1. Develop tree-based index structure to speed homology search.
2. Maintain the use of an evolutionary model of similarity.
3. Deliver full Smith-Waterman sensitivity.
3
Metric SpaceMetric Space A metric space[CPRZ97a] is a pair, M=(D,d), where D is
a domain of indexing keys, and d is distance function with the following properties:
d(Ox,Oy) = d (Oy,Ox) (symmetry)
d(Ox,Oy) > 0, d(Ox,Ox) = 0 (non negativity)
d(Ox,Oy) <= d(Ox,Oz) + d(Oz,Oy) (triangle inequality)
4
Metric Space IndexingMetric Space Indexing
Metric space indexing exploits intrinsic clustering of the data.
Hierarchical structureavoids linear scan of entire database.in the best case leads to search time logarithmic to database size.
5
ChallengesChallenges
Local alignment does not form a metric.Local Alignment produces a set of answers, a distance function produce a single number.
Popular evolutionary models (PAM) are not metrics.PAM Matrices are based on log-odds
– Negative Values
PAM Matices are Asymmetric– Let Pr(x,y) be the probability that amino acid x, mutates to amino acid y.– Pr(x,y) Pr(y,x)
More similar sequences score higher, not lower– Identical sequences must be distance 0 apart
6
From PAM to mPAM - SymmetryFrom PAM to mPAM - Symmetry PAM: The computation of PAM
matrix computed frequency of one amino acid mutating to another.
mPAM: We model that a pair of amino acids, one in each sequence, evolved from a common ancestor [Gonnet & Korostensky]
The probability that amino acid Y and amino acid Z are from same ancestor amino acid x is:
Pr(y,z)= f(x)Pr(x,y)Pr(x,z)
X
Y Z
7
From PAM to mPAM – Distance vs. SimilarityFrom PAM to mPAM – Distance vs. Similarity PAM computed log-
odds based on frequency of mutations
mPAM: Compute the expected time for a particular mutation to occur.
More frequent mutations will occur, on average, in less time.
mPAM matrix
8
Computing Local Alignments from an IndexComputing Local Alignments from an Index Divide the database into small fixed size pieces
Build a metric-space index based on global alignment Divide the query into small fixed size pieces
For each query piece, use index to find results based on global alignment.
– Like BLASTS hot spot index, but is fully sensitive Chain the results together
Intuitively like BLASTS extension of hot-spotsBest algorithm is the last step of “A Sublinear Algorithm for Approximate Keyword Searching” [Myers94]
9
Initial Results: M-TreeInitial Results: M-Tree
M-tree [CPRZ97b]is an open-source Metric-space indexing package.
Results for global alignment of Yeast peptide sequences of length 10.
Compare M-tree clustering result with farthest-first traversal bulk load clustering result.
10
fraction of leaves visited to all the leaves vs. database size
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0 2000 4000 6000 8000 10000 12000
database size
frac
tion
of le
aves
vis
ited
to a
ll th
e le
aves
For a set of queries, the average fraction of number of leaves visited in the searching to the total leaf number decreases while the database size goes up. This shows that is the database is clustered well.
11
Covering radius vs. tree level of M-tree
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8
tree lev el
radi
i of
rout
ing
obje
cts
average radius
min radius
max radius
The covering radius of routing objects of one level of M-tree decreases while descending the tree. This shows the database the hierarchically clustered well.
12
Covering radius vs. tree level of farthest-firsttraversal bulk load
0
50
100
150
200
250
300
350
0 2 4 6 8tree level
radi
i of r
outin
g ob
ject
s
average radius
min radius
max radius
The covering radius of routing objects of one database tree level created by farthest-first traversal bulk load decreases while descending the tree. This shows the database the hierarchically clustered well. The radii here are significantly smaller than those of M-tree, which means that we can build a new index structure that is better than M-tree.
13
Long-term Goal:Biologists Need a New Kind of DBMS
Long-term Goal:Biologists Need a New Kind of DBMS
Traditional relational databases: Data is dynamic Workload:
Regular, exact, periodic queries
– Billing– Customer service
Transactional – inventory– bank accounts
Biological databases Data is write-once. Workload:
Ad-hoc queries based Data clustering (mining)
Biological data types are non-relational
Biological data types do cluster in metric-spaces
Genomic/proteomic sequencesMass-Spectrometer signaturesMolecular Models
14
MoBIoS Architecture(Molecular Biological Information System)
MoBIoS Architecture(Molecular Biological Information System)
15
Storage ManagerStorage Manager
Metric-Space Index Structure: Persistent representation
Multiple hierarchical trees.Choice of metric distance functions, including user defined
Results in:Efficient clustering of the databaseSearch time logarithmic to the database size
16
Query EngineQuery Engine
MoBIoS SQL (M-SQL) Built-in biological data
typesSequenceMass-spectra data
Embodies evolutionary semantics of bioinformatic investigation
Examples:Homology look-upGene fusion experiment
17
M-SQL Program for MS/MS Protein IdentifcationM-SQL Program for MS/MS Protein Identifcation
SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHEREMS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id and
DS.ms_peak = P and
MPAM250(PS, DS.sequence, range2)
// Return proteins in the intersection of recorded spectra sufficiently // similar, range1, to the measured spectra of the first MS, and proteins// which have a digested fragment computed to be sufficiently similar in// sequence to the sequencing determined by the second MS// Database is loaded with genomic and proteomic informationCreate table protein_sequences (accesion_id int, sequence peptide, …,
primary metrickey(sequence, mPAM250); Create table digested_sequences(accession_id int, fragment peptide,
enzyme varchar, ms_peak int…, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment)metric(mPAM250); Create table mass_spectra(accession_id int, enzyme varchar,
spectrum spectrum, primary metrickey(spectrum, cosine_distance);
18
Mining Engine(No much idea???)Mining Engine(No much idea???)
Primitives for relating clusters to each other.
Gene expression Protein family
Possible syntax
19
ApplicationsApplications
Homology search Proteomics
MS/MS and Ion-Trap MS need both MS signature and sequence data to analyze results
Gene ExpressionBuilt in clustering algorithms
Sequence Assembly
20
Properties of Biological DatabasesProperties of Biological Databases Data is write-once. Workload:
Ad-hoc retrieval queries based on evolutionary criteriaData clustering and categorization (mining)
Many biological data types are non-relationalGenomic/proteomic sequencesStructural and functional annotations to sequences.Mass-Spectrometer signaturesMolecular Models
21
Existing MethodsExisting Methods BLAST
Build index of the query sequence, linear scan the database
BLATBuild index of the database, search the database based on exact match of fixed length segments
SSTTree-structured index for vector space object
22
From PAM matrix to mPAM• PAM matrix is one of most
commonly used substitution matrix to compute the similarity between two peptide sequence under an evolutional model.
• PAM matrix can not be used directly for metric distance indexing technique.– Similarity score don’t have
reflexivity properties.
– There are negative values.
– Doesn’t satisfy triangular inequality rules
Figure-2 Log odds matrix for 250 PAMs. (DayHoff 1978)
23
Metric-Space Indexing to Speed Homology SearchMetric-Space Indexing to Speed Homology Search
1. Split the database and build metric space index structure
2. Split the query sequence3. Search the query segments in the metric
indexing database4. Chain the search results
24
ResultsResultsDrosophi l a di stance di stri buti on on mPAM
0
200
400
600
800
0 5000 10000 15000Di stance
#pai
rs
25
Max. Abs. Log- odds of Sub- seq. Lengths
0
1
2
3
4
5
6
7
8
0 20 40 60 80Length
M.A.L.
Min length: 3 Max length: 80 Threshold: ln10 (2.303) Segment number: 1MTrial number: 1M Bucket number: 100 Sequential search range: 80
26
Length: 30 #Di stCal cu VS I ndex si ze
0
10000
20000
30000
40000
50000
0 10000 20000 30000 40000
si ze
#cal
cu
Radius: 05
10
15
20
25
30
35
40
45
50
27
ReferencesReferences• [CPRZ97a] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric
spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy, June 1997.
• [CPRZ97b] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces”. Proc. VLDB, 1997.
• [DSO78] Dayhoff M.O., Schwartz R. and Orcutt B.C. (1978) Atlas of protein sequence and structure. Vol. 5, Suppl. 3, Ed. M. O. Dayhoff.
• [MT] The M-Tree Project Homepage, http://www-db.deis.unibo.it/Mtree/index.html
• [SW81] Temple F. Smith and Mchael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.