Post on 31-Mar-2015
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1
-ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING APPROXIMATE NEAREST NEIGHBOR OF
COMPLEX VAGUE QUERIES
DANG Tran Khanh, KÜNG Josef, WAGNER Roland
Institute for Applied Knowledge Processing (FAW)
Johannes Kepler University of Linz
Austria
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 2
OUTLINE
Complex Vague Queries in the Vague Query System (VQS)
Similarity search problem of the VQS in the conventional DBMSs
Incremental hyper-Sphere Approach (ISA)
Overcome shortcomings of Incremental hyper-Cube Approach (ICA)
-ISA: Finding Approximate Nearest Neighbors of Complex
Vague Queries
The issue of the dimensionality curse
The issue of increasing the query condition number
Experimental Results
Conclusions
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 3
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
The VQS:
Introduced by Kueng and Palkoska 1997
Support similarity search capabilities in the conventional DBMSs: return
to users records semantically close to a given query
One of the VQS’s basic ideas:
• NCR-Tables (Numeric-Coordinate-Representation-Tables): keep
numeric semantic information of non-numeric attributes
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 4
NCR-Tables – an example
Colors Name red green blue
black 0 0 0 blue 0 0 255 light blue 173 216 230
dark blue 0 0 139
... ... ...
Car Nr Typ Col
L-1234 VW blue W-5679 Opel black ... ... ...
fuzzy field NCR-key NCR - columns
NCR-table
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
SELECT FROM CarWHERE
Col IS ‘dark blue‘INTO
myResultTable;
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 5
Complex Vague Queries in VQS: A simplified view of the problem
NCR-Table 1 NCR-Table n…
Index 1 … Index n
Value_nk…Value_1k...
…………
Value_n1…Value_11...
Attribute n…Attribute 1...Query relation
Vague query processing module
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 6
The issue of the dimensionality curse [Weber et al 1998; Beyer
et al 1999]
NCR-Tables with high-dimensional data:
• The probability of overlaps between a query and data regions is very
high, and thus the performance of multidimensional access methods
(MAMs) is decreased significantly
• A linear scan over the whole data set would perform better than
MAMs
Approximate nearest neighbor problem:
dist(Q, P) (1+)dist(Q, P’) (1)
• Almost for single data sets: single–feature nearest neighbor (S-FNN)
queries [Arya et al 1998, Kleinberg 1997, Amato et al 2000, Ciaccia
and Patella 2000, etc.]
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 7
Solving Complex Vague Queries in VQS: “Random access“ [Fagin 1996] is impossible
……
y1x2
y2x1
y1x1
Attr2Attr1Query
relation
……
…y2
…y1
[Values]Domain1Attr1
……
…x2
…x1
[Values]Domain1Attr1
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 8
Incremental hyper-Cube Approach (ICA) [Kueng and Palkoska 1999]
Issues with the ICA: see [Dang et al 2002a, Dang et al 2002b] for the details
How to determine the initial hyper-cubes ? How to extend the hyper-cubes in necessary case Accessing unnecessary disk pages and objects Repeated disk accesses Only best match record is returned (not top-k records)
COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 9
INCREMENTAL HYPER-SPHERE APPROACH (ISA)
Input: A query relation/view S A complex vague query Q with n query conditions qi (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed
by a multidimensional index structure Fi
Output: Best match record/tuple Tmin for Q, TminS. Ties are arbitrarily broken.
Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries.
Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1.
Step 3: Compute total distances/scores for the found records using formula 2 below and find a record Tmin with the minimum total distance TDcur. Ties are arbitrarily broken.
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 10
INCREMENTAL HYPER-SPHERE APPROACH (ISA)
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 11
INCREMENTAL HYPER-SPHERE APPROACH (ISA)
Step 4: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 below and continue doing the search as steps 1, 2 and 3 until one of two following conditions holds: (a) the current searching radius of each qi is greater than or equal to its maximum searching radius; (b) found a new appropriate record Tnew with the total distance TDnew<TDcur
Step 5: If condition (a) holds then return Tmin as the best match for Q. Otherwise, i.e. condition (b) holds, replace Tmin with Tnew, i.e. TDcur is also replaced with a smaller value TDnew, and go back to step 4
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 12
INCREMENTAL HYPER-SPHERE APPROACH (ISA)
Modifying ISA to retrieve top-k records: see [Dang et al 2002b]
High-dimensional feature spacesand/or
Query condition number increases
ISA performance is decreased
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 13
-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX
VAGUE QUERIES
CVQ = M-FNN (Multi-Feature Nearest Neighbor) query
Using lower bound total distance (LBTD)
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 14
-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX
VAGUE QUERIES Input:
A query relation/view S A complex vague query Q with n query conditions qi (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed
by a multidimensional index structure Fi
A real >0 used as a tolerant error
Output: (1+)-approximate NN record/tuple Tapp for Q, TappS. Ties are
arbitrarily broken.
Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries.
Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1.
Step 3: Compute total distances/scores for the found records using formula 2 and find a record Tapp with the minimum total distance TDcur. Ties are arbitrarily broken.
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 15
-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX
VAGUE QUERIES
Step 4: Let di be distance from query condition qi to the last NCR-Value returned in the corresponding feature space, which is being managed by Fi. Compute LBTD as follows:
LBTD = min {TDcur, di}, i=1,2…n (5)
Step 5: If TDcur <= (1+)LBTD, return Tapp as a (1+)-approximate NN record for Q. Otherwise, go to step 6
Step 6: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 and continue doing the search as steps from 1 to 5 until the algorithm is stopped at step 5. If the current searching radius of a certain qi is greater than or equal to its maximum searching radius then searching on Fi is stopped
See next slice
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 16
-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX
VAGUE QUERIESLower Bound Total Distance - An example
A B
C D
QR Attr1 Attr2
A B
C q2
q1 D
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 17
-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX
VAGUE QUERIES
Approximate k-nearest neighbors
See our paper for more details
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 18
EXPERIMENTAL RESULTS
Data sets:
Uniformly distributed: 2, 4, and 8 dimensions (100K objects for each
of them)
Real: 9 and 16 dimensions (more than 64K feature vectors of
images, URL: http://kdd.ics.uci.edu/)
Using the SH-tree [Dang et al 2001a] to manage
multidimensional data
Page size: 8KB
100 query points were randomly selected from each
corresponding data set
...
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 19
EXPERIMENTAL RESULTS
2-condition (4-d and 8-d) NN queries, different values
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 20
EXPERIMENTAL RESULTS
2-condition (4-d) k-NN queries, = 0.2
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 21
EXPERIMENTAL RESULTS3-condition (2-d) NN queries, different values
2-condition NN queries (9-d and 16-d real data sets), =1
=1 means tolerant error is permitted up to 100% -ISA saved about 4.5 % and 1% of the affected object and disk access
number, individually, for 16-d data set while it remained the accuracy at 71%
One notable fact here is that the effective epsilon calculated as introduced in (Arya et al. 1998) is quite low, only 0.23. This is a very promising result.
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 22
CONCLUSIONS
-ISA: An Incremental Lower Bound Approach for Efficiently Finding Approximate Nearest Neighbor of Multi-Feature Queries in VQS
-ISA is one of the vanguard solutions to dealing with this problem
-ISA is very useful for application domains that the returned results need not to be exact but similar or approximate similar (with a certain tolerant error) to a given query. The experimental results have proven this. With a suitable value, the -ISA can save a very high percentage of the costs including both IO-cost and CPU-cost while it still preserves the accuracy of the returned results at a particularly very high value
-ISA is applicable to not only numeric domains such as NCR-tables, but also any ranked input
Application areas: TIS (tourist information systems), GIS, digital libraries, multimedia systems, etc.
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 23
More information
• URL: http://www.faw.uni-linz.ac.at/• E-mail: {khanh, jkueng, rwagner}@faw.uni-linz.ac.at
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 24
Research related to dealing with complex vague queries
The A0 algorithm [Fagin 1996] (There are some improvements of Fagin‘s algorithm, see the paper for more details): Finding top-k matches for a user query involving several
multimedia attributes Problem: this algorithm assumes that random access is
possible in the system. This assumption is correct only three following conditions hold:
1. there is at least a key for each subsystem,2. there is a mapping between the keys,3. and we must ensure that the mapping is one-to-one
In VQS: condition (1) is always satisfied (each fuzzy field are the key for the corresponding NCR-table), but there is no the mapping one-to-one between the fuzzy fields
Cannot be applied to our problem
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 25
Other approaches for multimedia databases: [Ortega et al 1997, Chaudhuri et al 1996, Boehm K. et al 2001] (see our paper)
Chaudhuri et al. 1999 introduced a solution to translate a top-k multi-feature query to a range query that the conventional DBMS can process. This approach employs information in the histograms kept by a relational system
…
Research related to dealing with complex vague queries (cont.)
Hagenberg -Linz -Prague-Vienna
iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 26
ISA and J* algorithm
The ISA The J* algorithmThe input is ranked with support of the incremental algorithm adapted for range queries
Assume that the ranked input is available, do not show how to deal with it
Reduce the database access cost first; this cost and the processed states are reduced by taking into account the hyper-sphere range queries and computing the maximum searching radii
Reduce the processed states first, the database access cost is alleviated by iterative deepening technique (S. Russell and P. Norvig: Artificial Inteligence: A Modern Approach. Prentice Hall, Inc., 1995)
Derived from the ICA that had been introduced earlier and had the same overall goals as the J* alg.
Claimed to be the first alg. that can process “joins” of ranked input and multi-level joins