Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N....

Clustering Clustering Sequences in a Sequences in a Metric SpaceMetric Space

Clustering Clustering Sequences in a Sequences in a Metric SpaceMetric Space

The MoBIoS Project

Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu

Department of Computer Sciences, University of Texas

Austin, TX 78712, USA

{rmao, miranker}@cs.utexas.edu

Research supported in part by the Texas Higher Education Coordinating Board, Texas Advanced Research Program.

2

Immediate Goal:Use Metric Space Indexing to Support Homology Search

Immediate Goal:Use Metric Space Indexing to Support Homology Search

1. Develop tree-based index structure to speed homology search.

2. Maintain the use of an evolutionary model of similarity.

3. Deliver full Smith-Waterman sensitivity.

3

Metric SpaceMetric Space A metric space[CPRZ97a] is a pair, M=(D,d), where D is

a domain of indexing keys, and d is distance function with the following properties:

d(Ox,Oy) = d (Oy,Ox) (symmetry)

d(Ox,Oy) > 0, d(Ox,Ox) = 0 (non negativity)

d(Ox,Oy) <= d(Ox,Oz) + d(Oz,Oy) (triangle inequality)

4

Metric Space IndexingMetric Space Indexing

Metric space indexing exploits intrinsic clustering of the data.

Hierarchical structureavoids linear scan of entire database.in the best case leads to search time logarithmic to database size.

5

ChallengesChallenges

Local alignment does not form a metric.Local Alignment produces a set of answers, a distance function produce a single number.

Popular evolutionary models (PAM) are not metrics.PAM Matrices are based on log-odds

– Negative Values

PAM Matices are Asymmetric– Let Pr(x,y) be the probability that amino acid x, mutates to amino acid y.– Pr(x,y) Pr(y,x)

More similar sequences score higher, not lower– Identical sequences must be distance 0 apart

6

From PAM to mPAM - SymmetryFrom PAM to mPAM - Symmetry PAM: The computation of PAM

matrix computed frequency of one amino acid mutating to another.

mPAM: We model that a pair of amino acids, one in each sequence, evolved from a common ancestor [Gonnet & Korostensky]

The probability that amino acid Y and amino acid Z are from same ancestor amino acid x is:

Pr(y,z)= f(x)Pr(x,y)Pr(x,z)

X

Y Z

7

From PAM to mPAM – Distance vs. SimilarityFrom PAM to mPAM – Distance vs. Similarity PAM computed log-

odds based on frequency of mutations

mPAM: Compute the expected time for a particular mutation to occur.

More frequent mutations will occur, on average, in less time.

mPAM matrix

8

Computing Local Alignments from an IndexComputing Local Alignments from an Index Divide the database into small fixed size pieces

Build a metric-space index based on global alignment Divide the query into small fixed size pieces

For each query piece, use index to find results based on global alignment.

– Like BLASTS hot spot index, but is fully sensitive Chain the results together

Intuitively like BLASTS extension of hot-spotsBest algorithm is the last step of “A Sublinear Algorithm for Approximate Keyword Searching” [Myers94]

9

Initial Results: M-TreeInitial Results: M-Tree

M-tree [CPRZ97b]is an open-source Metric-space indexing package.

Results for global alignment of Yeast peptide sequences of length 10.

Compare M-tree clustering result with farthest-first traversal bulk load clustering result.

10

fraction of leaves visited to all the leaves vs. database size

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0 2000 4000 6000 8000 10000 12000

database size

frac

tion

of le

aves

vis

ited

to a

ll th

e le

aves

For a set of queries, the average fraction of number of leaves visited in the searching to the total leaf number decreases while the database size goes up. This shows that is the database is clustered well.

11

Covering radius vs. tree level of M-tree

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8

tree lev el

radi

i of

rout

ing

obje

cts

average radius

min radius

max radius

The covering radius of routing objects of one level of M-tree decreases while descending the tree. This shows the database the hierarchically clustered well.

12

Covering radius vs. tree level of farthest-firsttraversal bulk load

0

50

100

150

200

250

300

350

0 2 4 6 8tree level

radi

i of r

outin

g ob

ject

s

average radius

min radius

max radius

The covering radius of routing objects of one database tree level created by farthest-first traversal bulk load decreases while descending the tree. This shows the database the hierarchically clustered well. The radii here are significantly smaller than those of M-tree, which means that we can build a new index structure that is better than M-tree.

13

Long-term Goal:Biologists Need a New Kind of DBMS

Long-term Goal:Biologists Need a New Kind of DBMS

Traditional relational databases: Data is dynamic Workload:

Regular, exact, periodic queries

– Billing– Customer service

Transactional – inventory– bank accounts

Biological databases Data is write-once. Workload:

Ad-hoc queries based Data clustering (mining)

Biological data types are non-relational

Biological data types do cluster in metric-spaces

Genomic/proteomic sequencesMass-Spectrometer signaturesMolecular Models

14

MoBIoS Architecture(Molecular Biological Information System)

MoBIoS Architecture(Molecular Biological Information System)

15

Storage ManagerStorage Manager

Metric-Space Index Structure: Persistent representation

Multiple hierarchical trees.Choice of metric distance functions, including user defined

Results in:Efficient clustering of the databaseSearch time logarithmic to the database size

16

Query EngineQuery Engine

MoBIoS SQL (M-SQL) Built-in biological data

typesSequenceMass-spectra data

Embodies evolutionary semantics of bioinformatic investigation

Examples:Homology look-upGene fusion experiment

17

M-SQL Program for MS/MS Protein IdentifcationM-SQL Program for MS/MS Protein Identifcation

SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,

mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E and

Cosine_Distance(S, MS.spectrum, range1) and

DS.accession_id = MS.accession_id = Prot.accesion_id and

DS.ms_peak = P and

MPAM250(PS, DS.sequence, range2)

// Return proteins in the intersection of recorded spectra sufficiently // similar, range1, to the measured spectra of the first MS, and proteins// which have a digested fragment computed to be sufficiently similar in// sequence to the sequencing determined by the second MS// Database is loaded with genomic and proteomic informationCreate table protein_sequences (accesion_id int, sequence peptide, …,

primary metrickey(sequence, mPAM250); Create table digested_sequences(accession_id int, fragment peptide,

enzyme varchar, ms_peak int…, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment)metric(mPAM250); Create table mass_spectra(accession_id int, enzyme varchar,

spectrum spectrum, primary metrickey(spectrum, cosine_distance);

18

Mining Engine(No much idea???)Mining Engine(No much idea???)

Primitives for relating clusters to each other.

Gene expression Protein family

Possible syntax

19

ApplicationsApplications

Homology search Proteomics

MS/MS and Ion-Trap MS need both MS signature and sequence data to analyze results

Gene ExpressionBuilt in clustering algorithms

Sequence Assembly

20

Properties of Biological DatabasesProperties of Biological Databases Data is write-once. Workload:

Ad-hoc retrieval queries based on evolutionary criteriaData clustering and categorization (mining)

Many biological data types are non-relationalGenomic/proteomic sequencesStructural and functional annotations to sequences.Mass-Spectrometer signaturesMolecular Models

21

Existing MethodsExisting Methods BLAST

Build index of the query sequence, linear scan the database

BLATBuild index of the database, search the database based on exact match of fixed length segments

SSTTree-structured index for vector space object

22

From PAM matrix to mPAM• PAM matrix is one of most

commonly used substitution matrix to compute the similarity between two peptide sequence under an evolutional model.

• PAM matrix can not be used directly for metric distance indexing technique.– Similarity score don’t have

reflexivity properties.

– There are negative values.

– Doesn’t satisfy triangular inequality rules

Figure-2 Log odds matrix for 250 PAMs. (DayHoff 1978)

23

Metric-Space Indexing to Speed Homology SearchMetric-Space Indexing to Speed Homology Search

1. Split the database and build metric space index structure

2. Split the query sequence3. Search the query segments in the metric

indexing database4. Chain the search results

24

ResultsResultsDrosophi l a di stance di stri buti on on mPAM

0

200

400

600

800

0 5000 10000 15000Di stance

#pai

rs

25

Max. Abs. Log- odds of Sub- seq. Lengths

0

1

2

3

4

5

6

7

8

0 20 40 60 80Length

M.A.L.

Min length: 3 Max length: 80 Threshold: ln10 (2.303) Segment number: 1MTrial number: 1M Bucket number: 100 Sequential search range: 80

26

Length: 30 #Di stCal cu VS I ndex si ze

0

10000

20000

30000

40000

50000

0 10000 20000 30000 40000

si ze

#cal

cu

Radius: 05

10

15

20

25

30

35

40

45

50

27

ReferencesReferences• [CPRZ97a] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric

spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy, June 1997.

• [CPRZ97b] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces”. Proc. VLDB, 1997.

• [DSO78] Dayhoff M.O., Schwartz R. and Orcutt B.C. (1978) Atlas of protein sequence and structure. Vol. 5, Suppl. 3, Ed. M. O. Dayhoff.

• [MT] The M-Tree Project Homepage, http://www-db.deis.unibo.it/Mtree/index.html

• [SW81] Temple F. Smith and Mchael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.

Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N....

Documents

Transcript of Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N....