N-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Aug. 31, 2005...
-
Upload
jeffrey-davis -
Category
Documents
-
view
217 -
download
0
Transcript of N-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Aug. 31, 2005...
n-Gram/2L: A Space and Time Efficient
Two-Level n-Gram Inverted Index Structure
Aug. 31, 2005
Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee
Department of Computer Science
Korea Advanced Institute of Science and Technology (KAIST)
VLDB 2005
Aug. 31, 2005 Dept. of Computer Science, KAIST
2
Contents
Introduction
Motivation and Goals
Structure of the n-Gram/2L Index
Analysis of the n-Gram/2L Index
Performance Evaluation
Conclusions
Aug. 31, 2005 Dept. of Computer Science, KAIST
3
Inverted Index
A term-oriented index structure for quickly searching documents containing a given term [BR1999] Most actively used for text searching
Classification (depending on the kind of terms) [WMB1999] Word-based inverted index
n-gram inverted index (simply, the n-gram index) the scope of this talk
d: document identifieroi: offset where term t occurs in document df: frequency of occurrence of term t in document d
B+-Tree index
on terms
posting lists of terms
a posting d, [o1, …, of]
…
Aug. 31, 2005 Dept. of Computer Science, KAIST
4
n-Gram Index
n-Gram Definition: a string of fixed length n Extraction method
① Sliding a window of length n by one character in the text
② Recording a sequence of characters in the window
(We call it the 1-sliding technique)
Example
2-gram posting lists of 2-grams
AB
BB
BC
CD
DA
DD
0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8]
0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5]
0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6]
0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7]
0, [3] 2, [6] 4, [0]
0, [6] 2, [3] 5, [0]
A B C D D A B B C D
D A B C D A B C D A
C D A B B C D D A B
B C D A B C D A B C
D D A B C D A B C D
B B C D A B C D A B
<The document collection> <2-gram inverted index>
...document 0
document 1
document 2
document 3
document 4
document 5
0 1 2 8 9... ...5
Aug. 31, 2005 Dept. of Computer Science, KAIST
5
Pros and Cons of the n-Gram Index [BR1999,MM2003]
Pros Language-neutral
Allowing us to disregard the characteristics of the language
Being widely used for Asian languages or DNA and protein databases
Error tolerant Allowing us to retrieve documents with some errors in the query
result
Being widely used for applications that allow errors
(e.g., approximate matching)
Cons The size tends to be large, and the query performance tends
to be bad
Aug. 31, 2005 Dept. of Computer Science, KAIST
6
Motivation
We note that the large size of the n-gram index is due to the redundancy in the position information If a subsequence is repeated multiple times in documents, the
relative offsets (within the subsequences) of the n-grams extracted from that subsequence would also be indexed multiple times
o1 o2
o3
o4 o5
a1 a2 a3 a4 b1 b2 b3 b4...... ...
a1 a2 a3 a4... ...
a1 a2 a3 a4...... b1 b2 b3 b4
...
A
A
B
A
B
...
<The document collection>
document 1
document 2
document N
2-gram posting list
1, [o1+0]
1, [o1+2]
1, [o2+0]
1, [o2+1]
1, [o2+2]
1, [o1+1]
a1a2
a2a3
a3a4
b1b2
b2b3
b3b4
2, [o3+0]
2, [o3+2]
N, [o4+0]
N, [o4+1]
N, [o4+2]
2, [o3+1]
N, [o5+0]
N, [o5+1]
N, [o5+2]
<2-gram index>
Aug. 31, 2005 Dept. of Computer Science, KAIST
7
We find out that the two-level construction eliminates that
redundancy
If the relative offsets of n-grams extracted from a subsequence are
indexed only once, the index size would be reduced since such
repetition is eliminated
o1 o2
o3
o4 o5
a1 a2 a3 a4 b1 b2 b3 b4...... ...
a1 a2 a3 a4... ...
a1 a2 a3 a4...... b1 b2 b3 b4
...
A
A
B
A
B
...
<The document collection>
document 1
document 2
document N
The two-level construction of 2-gram index
2-gram
a1a2
a2a3
a3a4
b1b2
b2b3
b3b4
A, [0]
A, [2]
B, [0]
B, [1]
B, [2]
A, [1]
A
B
1, [o1]
1, [o2]
2, [o3]
N, [o4]
N, [o5]
subsequence
posting list
posting list
Aug. 31, 2005 Dept. of Computer Science, KAIST
8
Goals
We propose the two-level n-gram inverted index
(simply, n-gram/2L)
We show that the n-gram/2L index significantly reduces
the index size and improves the query performance over
the conventional n-gram index
Aug. 31, 2005 Dept. of Computer Science, KAIST
9
Structure of the n-Gram/2L Index
Two-level structure Back-end index: storing the offsets of m-subsequences within
documents
Front-end index: storing the offsets of n-grams within m-subsequences
(m-subsequence: a subsequence of length m)
<front-end index> <back-end index>
…
B+-Tree on n-grams
…
B+-Tree on m-subsequences
a posting: v, [o1, …, of(v,t)] a posting: d, [o1, …, of(d,s)]
posting lists of n-grams posting lists of m-subsequences
d: document identifieroi: offset where m-subsequence s occurs in document df(d,s):frequency of occurrence of m-subsequence s in document d
v: m-subsequence identifieroi: offset where n-gram t occurs in m-subsequence vf(v,t): frequency of occurrence of n-gram t in m-subsequence v
Aug. 31, 2005 Dept. of Computer Science, KAIST
10
Building of the n-Gram/2L Index
Algorithm
Step 1 (back-end index)
Extracting m-subsequences from a set of documents such that
consecutive subsequences overlap with each other by n-1
Building the back-end index using the m-subsequences
Step 2 (front-end index)
Extracting n-grams from the set of m-subsequences
Building the front-end index using the n-grams
Aug. 31, 2005 Dept. of Computer Science, KAIST
11
Theorem 1: If m-subsequences are extracted such that consecutive
ones overlap with each other by n-1, no n-gram is missed or duplicated
Proof (sketch):
m-subsequences
n-1
n-grams
document
n n-2
missedduplicated
Aug. 31, 2005 Dept. of Computer Science, KAIST
12
Query Processing Using the n-Gram/2L Index
Algorithm
Step 1 (front-end index)
Finding the m-subsequences that cover a query string by searching
the front-end index
Step 2 (back-end index)
Finding the documents that have a set of m-subsequences {Si}
containing the query string by searching the back-end index
Aug. 31, 2005 Dept. of Computer Science, KAIST
13
Definition 1: Cover
S covers Q if an m-subsequence S and a query string Q satisfy one of the following four conditions:
① A suffix of S matches a prefix of Q
② The whole string of S matches a substring of Q
③ A prefix of S matches a suffix of Q
④ A substring of S matches the whole string of Q
Example
Q
S
B C
B CA B
D D A B Q
S
B
C D D A
BC D D A
Q
S
B C D D
B CA B
A B Q
S C D D
C D D
A
1 2
3 4
Aug. 31, 2005 Dept. of Computer Science, KAIST
14
Definition 2 (brief): Expand
The expand function expands a sequence of overlapping
character sequences into one character sequence
Definition 3: Contain
A set of m-subsequences {Si} contains a query string Q if {Si}
and Q satisfy the following condition:
Let SlSl+1...Sm be a sequence of m-subsequences overlapping with
each other in {Si}. A substring of expand(SlSl+1...Sm) matches the
whole string of Q
Aug. 31, 2005 Dept. of Computer Science, KAIST
15
Cases of containment
Q
...
case1: {Si, Si+1, ... Sj} contains Q.
...
... Si Si+1Sj
for Len(Q) m
...
...
...
...SkSp
Sq
for Len(Q) < m Q Q
case2: {Sk} contains Q. case3: {Sp, Sq} contains Q.
Aug. 31, 2005 Dept. of Computer Science, KAIST
16
Lemma 1: A document that has a set of m-subsequences {Si} containing
the query string Q includes at least one m-subsequence covering Q
Algorithm (revisited)
Step 1 (front-end index)
Finding the m-subsequences that cover a query string by searching the front-
end index for retrieving candidate results satisfying the necessary condition
Step 2 (back-end index)
Finding the documents that have a set of m-subsequences {Si} containing the
query string by searching the back-end index for refining candidate results
A document d has a set of m-subsequences {Si}
containing Q
A document d has at least one m-subsequence
covering Q
<A necessary condition>
Aug. 31, 2005 Dept. of Computer Science, KAIST
17
Formalization of the n-Gram/2L Index
We observe that the redundancy in the position
information existing in the n-gram index is caused by
non-trivial MultiValued Dependencies (MVDs)
We show that the n-gram/2L index can be derived by
eliminating that redundancy through relational
decomposition to the Fourth Normal Form (4NF)
Aug. 31, 2005 Dept. of Computer Science, KAIST
18
MultiValued Dependency (MVD)
Definition [Ull1988]
Suppose we are given a relation schema R, and X and Y are subsets of R. X→→Y holds in R if whenever r is a relation for R, and and are two tuples in r, with [X] = [X] (that is, and agree on the attributes of X), then r also contains tuples and , where
1. [X] = [X] = [X] = [X]
2. [Y] = [Y] and [R-X-Y] = [R-X-Y]
3. [Y] = [Y] and [R-X-Y] = [R-X-Y]
Non-trivial MVD: Y X and X Y R
Example
a1a1a1a1a2a2a2a2
a1a1a2a2
a1a1a2a2
b1b2b1b2b3b3b4b4
c1c2c2c1c3c4c3c4
b1b2b3b4
c1c2c3c4decompose
X Y R-X-Y X Y X R-X-Y
MVD
MVD
(4NF)
R R1 R2
Aug. 31, 2005 Dept. of Computer Science, KAIST
19
Relational Representation for Theoretical Analysis
NDO relation Converting the n-gram index so that obeys the First Normal Form (1NF) Having three attributes N, D, and O
N: n-grams D: document identifiers O: offsets of n-grams within documents
SNDO1O2 relation
Adding a new attribute S and splitting the attribute O into two attributes O1 and O2
Having five attributes S, N, D, O1, and O2
S : m-subsequences in which n-grams appear
O1: offsets of n-grams within m-subsequences
O2: offsets of m-subsequences within documents
n-gram index NDO relation SNDO1O2 relation
<NDO relation (1NF) >
AB 0
N D
AB 0AB 1AB 1AB 2AB 2AB 3AB 3AB 4AB 4AB 5AB 5BB 0BB 2BB 5BC 0BC 0BC 1BC 1BC 2BC 3BC 3BC 3BC 4BC 4BC 5BC 5
CD 0CD 0CD 1CD 1CD 2CD 2CD 3CD 3CD 4CD 4CD 5CD 5DA 0DA 1DA 1DA 1DA 2DA 2DA 3DA 3DA 4DA 4DA 5DA 5DD 0DD 2DD 4
0
O
51528372648630172640483715
283705154826404817261537360
2-gram posting lists of 2-grams
AB
BB
BC
CD
DA
DD
0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8]
0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5]
0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6]
0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7]
0, [3] 2, [6] 4, [0]
0, [6] 2, [3] 5, [0]
<2-gram index>
D A B C D A B C D A
C D A B B C D D A B
B C D A B C D A B C
D D A B C D A B C D
B B C D A B C D A B
<The document collection>
document 0
document 1
document 2
document 3
document 4
document 5
A B C D D A B B C D
normalize
Example of Relational Representation N: n-gramsD: document identifiersO: offsets
ABCD
S N D
ABCDABCDABCDABCDABCDABCDABCDABCDBBCD BBBBCD BBBBCD BBBBCD BCBBCD BCBBCD BCBBCD CDBBCD CDBBCD CDBCDA BCBCDA BCBCDA BCBCDA CDBCDA CDBCDA CDBCDA DABCDA DABCDA DA
CDAB CDCDAB CDCDAB CDCDAB DACDAB DACDAB DACDAB ABCDAB ABCDAB ABDABC DADABC DADABC DADABC ABDABC ABDABC ABDABC BCDABC BCDABC BCDDAB DDDDAB DDDDAB DDDDAB DADDAB DADDAB DADDAB ABDDAB ABDDAB AB
<NDO relation>
000111222000111222
O1
000111222000111222000111222
AB 0
N D
AB 0AB 1AB 1AB 2AB 2AB 3AB 3AB 4AB 4AB 5AB 5BB 0BB 2BB 5BC 0BC 0BC 1BC 1BC 2BC 3BC 3BC 3BC 4BC 4BC 5BC 5
CD 0CD 0CD 1CD 1CD 2CD 2CD 3CD 3CD 4CD 4CD 5CD 5DA 0DA 1DA 1DA 1DA 2DA 2DA 3DA 3DA 4DA 4DA 5DA 5DD 0DD 2DD 4
0
O
515283726486301726404837 (1+6)15
283705154826404817261537360
<SNDO1O2 relation>
025025025134134134
125125125135135135024024024
630630630603603603
O2
306306306063063063360360360
ABABABBCBCBCCDCDCD
000111222
034034034
036036036
MVDMVD
We see a Cartesian product of NO1 and DO2 in the SNDO1O2 relation
S: m-subsequencesO1: offsets of n-gramsO2: offsets of m-subsequences
Aug. 31, 2005 Dept. of Computer Science, KAIST
22
Normalization of the n-Gram Index
Lemma 2: Non-trivial MVD’s S→→NO1 and S→→DO2 hold in the SNDO1O2
relation
Proof (sketch):
The set of documents, where an m-subsequence occurs, and the set of n-grams,
which are extracted from that m-subsequence, are independent of each other
Due to this independence, there exist the tuples corresponding to all possible
combinations of documents and n-grams for a given m-subsequence
Lemma 3: The decomposition (SNO1, SDO2) is in 4NF
Proof: See the paper
Theorem 2: The 4NF decomposition (SNO1, SDO2) of the SNDO1O2
relation is identical to the front-end and back-end indexes of the
n-gram/2L index
Proof: See the paper
<SNO1 relation> <SDO2 relation>
0
S N
34510124012323455
ABCD (0)
S D
ABCD (0)ABCD (0)BBCD (1)BBCD (1)BBCD (1)BCDA (2)BCDA (2)BCDA (2)CDAB (3)CDAB (3)CDAB (3)DABC (4)DABC (4)DABC (4)DDAB (5)DDAB (5)DDAB (5)
021201102221021010
O1 036630603306063360
O2
ABABABABBBBCBCBCBCCDCDCDCDDADADADAAB
034025134125135024
4-subsequence posting list
ABCD
BBCD
BCDA
CDAB
DABC
DDAB
0, [0]
1, [6]
1, [3]
1, [0]
0, [3]
0, [6]
3, [3] 4, [6]
2, [3] 5, [0]
3, [0] 4, [3]
2, [0] 5, [6]
3, [6] 5, [3]
2, [6] 4, [0]
2-gram posting list
0, [0] 3, [2] 4, [1] 5, [2]
0, [1] 1, [1] 2, [0] 4, [2]
0, [2] 1, [2] 2, [1] 3, [0]
2, [2] 3, [1] 4, [0] 5, [1]
5, [0]
1, [0]
AB
BB
BC
CD
DA
DD
<The front-end index> <The back-end index>
denormalize denormalize
the m-subsequence identifierExample of Normalization Using Theorem 2
Aug. 31, 2005 Dept. of Computer Science, KAIST
24
Analysis of the n-Gram/2L Index
Symbols Definitions
avgdoc
the average number of occurrences of an m-subsequence in the documents
avgngram
the average number of the n-grams extracted from an m-subsequence
an m-subsequence
... avgngram
avgdoc
documents
n-grams
...
Notation
Optimal length mo
Length of the m-subsequence that minimizes the size of the n-gram/2L index
Aug. 31, 2005 Dept. of Computer Science, KAIST
25
Index Size
Space complexities n-gram index: O(avgdoc avgngram)
n-gram/2L index: O(avgdoc + avgngram)
Properties mo is obtained by finding the length m that makes avgdoc = avgngram
Both avgdoc and avgngram increase as the database size gets larger
Analytical results Size of the n-gram/2L index is significantly reduced compared with
that of the n-gram index for a large database
Reduction of the index size becomes more marked as the database size increases
See the paper for the detailed analysis
Aug. 31, 2005 Dept. of Computer Science, KAIST
26
Formulas for the index size
sizengram
sizefront + sizeback
|SS| (avgngram(SS) avgdoc(SS))
|SS| (avgngram(SS) + avgdoc(SS))
sizengram = (kngram(s) kdoc(s))s SS (1)
sizefront = kngram(s)s SS
sizeback = kdoc(s)s SS
(2)
(3)
(kngram(s) kdoc(s))s SS
(kngram(s) + kdoc(s))s SS
(4)
=
Aug. 31, 2005 Dept. of Computer Science, KAIST
27
Query Performance
Time complexities n-gram index: O(avgdoc avgngram)
n-gram/2L index: O(avgdoc + avgngram)
Analytical results n-gram/2L index significantly improves the query performance
over the n-gram index for a large database
Improvement of the query performance gets better as the database size increases
Query processing time increases only very slightly as the query length gets longer
It has been pointed out that the query performance of the n-gram index for long queries tends to be bad [Wil2003]
See the paper for the detailed analysis
Aug. 31, 2005 Dept. of Computer Science, KAIST
28
timengram = ( Len(Q) – n + 1) (5)
i = 0
sizengram
n
timefront = ( Len(Q) – n + 1) (6)sizefront
n
timeback =
sizeback
m Len(Q) – m + 1 + 2 m – n - 1 m – n - i),(
i = 0sizeback
m (m – Len(Q) + 1 ) Len(Q) – n - 1 m – n - i),( m – Len(Q)+ 2
if Len(Q) m
if Len(Q) < m
(7)
timefront + timeback
timengram=
sizengram ( Len(Q) – n + 1)
(sizefront ( Len(Q) – n + 1)) + sizeback (Len(Q) – m + 1
m – n + c ))(,
sizengram ( Len(Q) – n + 1)
(sizefront ( Len(Q) – n + 1)) + sizeback (m - Len(Q) + 1
Len(Q) - n + d ))(,
if Len(Q) m
if Len(Q) < m
(8)
where c = 2 i = 0m – n - 1
i1
( ), d = 2 i = 0Len(Q) – n - 1
i1
( )
Formulas for the query performance
Aug. 31, 2005 Dept. of Computer Science, KAIST
29
Experiments
Measures Index size
Query performance Number of page accesses Wall clock time (ms)
Data sets PROTEIN-DATA: the set of protein sequence databases used in bioinformatics TREC-DATA: the set of English text databases used in information retrieval
Parameters Data size = 10 MBytes, 100 MBytes, and 1 GBytes n = 3 (n-gram length) [Kuk1992,WZ2002] m = 4 ~ 6 (m-subsequence length) Len(Q) = 3, 6, 9, 12, 15, and 18 (query length)
index size ratio = the number of pages allocated for the n-gram index
the number of pages allocated for the n-gram/2L index
Aug. 31, 2005 Dept. of Computer Science, KAIST
30
Index Size (PROTEIN-DATA)
The size of the n-gram/2L index is significantly reduced compared with that of the n-gram index By up to 2.7 times in PROTEIN-1G
The reduction of index size become more marked as the database size increases Approximately 25% for the PROTEIN-DATA as the database size is increased by ten fold (10
MBytes 100MBytes 1 GBytes)
0
0.5
1
1.5
2
2.5
3
4 5 6m-subsequence length m
ind
ex s
ize
ratio
PROT- 1G PROT- 100M PROT- 10M
optimal length mo
Aug. 31, 2005 Dept. of Computer Science, KAIST
31
Query Performance (PROTEIN-DATA)
n-gram/2L significantly improves the query performance over the n-gram index Up to 13.1 times in wall clock time (PROTEIN-1G)
Improvement gets better as the database size increases 1.37 times in PROTEIN-100M; 6.65 times in PROTEIN-1G
Query processing time increases only very slightly as the query length gets longer n-gram/2L index: 53%, Len(Q): 3 18
(c.f. n-gram index: 32.9 times)
0
5000
10000
15000
20000
3 6 9 12 15 18
query length Len (Q )
Wal
l Clo
ck T
ime
(ms)
3- gram index 3- gram/2L index (m=4)
10
100
1000
10000
100000
10M 100M 1G
data s ize (Byte)
Wal
l Clo
ck T
ime
(ms)
<Query processing time>(Len(Q): 3~18)
<No. of page accesses>(data set: PROTEIN-1G)
0
5000
10000
15000
20000
25000
3 6 9 12 15 18
query length Len (Q )#
of
pag
e ac
cess
es
<Query processing time>(data set: PROTEIN-1G)
Aug. 31, 2005 Dept. of Computer Science, KAIST
32
Conclusions
We have shown that the redundancy in the position information
existing in the n-gram index is due to non-trivial MVDs
We have proposed the two-level structure of the n-gram index
We have shown that the n-gram/2L index is derived by the relational
normalization process that decomposes the n-gram index into 4NF
We have provided a formal analysis of the space and time
complexities of n-gram/2L index
Finally, through extensive experiments, we have shown that the n-
gram/2L significantly reduces the size and improves the query
performance compared with the n-gram index
Aug. 31, 2005 Dept. of Computer Science, KAIST
33
References
[BR1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.
[Coh1997] Jonathan D. Cohen, “Recursive Hashing Functions for n-Grams,” ACM Trans. on Information
Systems, Vol. 15, No. 3, pp. 291-320, July 1997.
[EN2003] Ramez Elmasri and Shamkant B. Navathe, Fundamentals of Database Systems, Addison Wesley,
4th ed., 2003.
[Kuk1992] Karen Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys,
Vol. 24, No. 4, pp. 377-439, Dec. 1992.
[LA1996] Joon Ho Lee and Jeong Soo Ahn, “Using n-Grams for Korean Text Retrieval,” In Proc. Int'l Conf. on
Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp. 216-224, 1996.
[MM2003] James Mayfield and Paul McNamee, “Single N-gram Stemming,” In Proc. Int'l Conf. on Information
Retrieval, ACM SIGIR, Toronto, Canada, pp. 415-416, July/Aug. 2003.
[MSL+2000] Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, “Performance and Scalability of
a Large-Scale N-gram Based Information Retrieval System,” Journal of Digital Information 1(5), pp.
1-25, Jan. 2000.
[MZ1996] Alistair Moffat and Justin Zobel, “Self-indexing inverted files for fast text retrieval,” ACM Trans. on
Information Systems, Vol. 14, No. 4, pp. 349-379, Oct. 1996.
[Nav2001] Gonzalo Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, Vol.
33, No. 1, pp. 31-88, Mar. 2001.
[Ram1998]Raghu Ramakrishnan, Database Management Systems, McGraw-Hill, 1998.
Aug. 31, 2005 Dept. of Computer Science, KAIST
34
[SKS2001] Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, Database Systems Concepts, McGraw-Hill, 4th
ed., 2001.
[SWY+2002] Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, “Compression of Inverted
Indexes for Fast Query Evaluation,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere,
Finland, pp. 222-229, Aug. 2002.
[Ull1988] Jeffery D. Ullman, Principles of Database and Knowledge-Base Systems Vol. I, Computer Science Press,
USA, 1988.
[Wil2003] Hugh E. Williams, “Genomic Information Retrieval,” In Proc. the 14th Australasian Database Conferences,
2003.
[WLL+2005] Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Min-soo Kim, and Wook-Shin Han, “Odysseus:a
High-Performance ORDBMS Tightly-Coupled with IR Reatures,” In Proc. the 21th IEEE Int'l Conf. on Data
Engineering (ICDE), Tokyo, Japan, Apr. 2005.
[WMB1999] I. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents
and Images, Morgan Kaufmann Publishers, Los Altos, California, 2nd ed., 1999.
[WVT1990]Kyu-Young Whang, Brad T. Vander-Zanden, and Howard M. Taylor, “A Linear-Time Probabilistic Counting
Algorithm for Database Applications,” ACM Trans. on Database Systems, Vol. 15, No.2, pp. 208-229, June
1990.
[WZ2002] Hugh E. Williams and Justin Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. on
Knowledge and Data Engineering, Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002.
[YT1998] Ogawa Yasushi and Matsuda Toru, “Optimizing query evaluation in n-gram indexing,” In Proc. Int'l Conf.
on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367-368, 1998.