N-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Aug. 31, 2005...

n-Gram/2L: A Space and Time Efficient

Two-Level n-Gram Inverted Index Structure

Aug. 31, 2005

Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee

Department of Computer Science

Korea Advanced Institute of Science and Technology (KAIST)

VLDB 2005

Aug. 31, 2005 Dept. of Computer Science, KAIST

2

Contents

Introduction

Motivation and Goals

Structure of the n-Gram/2L Index

Analysis of the n-Gram/2L Index

Performance Evaluation

Conclusions


3

Inverted Index

A term-oriented index structure for quickly searching documents containing a given term [BR1999] Most actively used for text searching

Classification (depending on the kind of terms) [WMB1999] Word-based inverted index

n-gram inverted index (simply, the n-gram index) the scope of this talk

d: document identifieroi: offset where term t occurs in document df: frequency of occurrence of term t in document d

B+-Tree index

on terms

posting lists of terms

a posting d, [o1, …, of]

…


4

n-Gram Index

n-Gram Definition: a string of fixed length n Extraction method

① Sliding a window of length n by one character in the text

② Recording a sequence of characters in the window

(We call it the 1-sliding technique)

Example

2-gram posting lists of 2-grams

AB

BB

BC

CD

DA

DD

0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8]

0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5]

0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6]

0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7]

0, [3] 2, [6] 4, [0]

0, [6] 2, [3] 5, [0]

A B C D D A B B C D

D A B C D A B C D A

C D A B B C D D A B

B C D A B C D A B C

D D A B C D A B C D

B B C D A B C D A B

<The document collection> <2-gram inverted index>

...document 0

document 1

document 2

document 3

document 4

document 5

0 1 2 8 9... ...5


5

Pros and Cons of the n-Gram Index [BR1999,MM2003]

Pros Language-neutral

Allowing us to disregard the characteristics of the language

Being widely used for Asian languages or DNA and protein databases

Error tolerant Allowing us to retrieve documents with some errors in the query

result

Being widely used for applications that allow errors

(e.g., approximate matching)

Cons The size tends to be large, and the query performance tends

to be bad


6

Motivation

We note that the large size of the n-gram index is due to the redundancy in the position information If a subsequence is repeated multiple times in documents, the

relative offsets (within the subsequences) of the n-grams extracted from that subsequence would also be indexed multiple times

o1 o2

o3

o4 o5

a1 a2 a3 a4 b1 b2 b3 b4...... ...

a1 a2 a3 a4... ...

a1 a2 a3 a4...... b1 b2 b3 b4

...

A

A

B

A

B

...

<The document collection>

document 1

document 2

document N

2-gram posting list

1, [o1+0]

1, [o1+2]

1, [o2+0]

1, [o2+1]

1, [o2+2]

1, [o1+1]

a1a2

a2a3

a3a4

b1b2

b2b3

b3b4

2, [o3+0]

2, [o3+2]

N, [o4+0]

N, [o4+1]

N, [o4+2]

2, [o3+1]

N, [o5+0]

N, [o5+1]

N, [o5+2]

<2-gram index>


7

We find out that the two-level construction eliminates that

redundancy

If the relative offsets of n-grams extracted from a subsequence are

indexed only once, the index size would be reduced since such

repetition is eliminated

o1 o2

o3

o4 o5

a1 a2 a3 a4 b1 b2 b3 b4...... ...

a1 a2 a3 a4... ...

a1 a2 a3 a4...... b1 b2 b3 b4

...

A

A

B

A

B

...


document 1

document 2

document N

The two-level construction of 2-gram index

2-gram

a1a2

a2a3

a3a4

b1b2

b2b3

b3b4

A, [0]

A, [2]

B, [0]

B, [1]

B, [2]

A, [1]

A

B

1, [o1]

1, [o2]

2, [o3]

N, [o4]

N, [o5]

subsequence

posting list

posting list


8

Goals

We propose the two-level n-gram inverted index

(simply, n-gram/2L)

We show that the n-gram/2L index significantly reduces

the index size and improves the query performance over

the conventional n-gram index


9

Structure of the n-Gram/2L Index

Two-level structure Back-end index: storing the offsets of m-subsequences within

documents

Front-end index: storing the offsets of n-grams within m-subsequences

(m-subsequence: a subsequence of length m)

<front-end index> <back-end index>

…

B+-Tree on n-grams

…

B+-Tree on m-subsequences

a posting: v, [o1, …, of(v,t)] a posting: d, [o1, …, of(d,s)]

posting lists of n-grams posting lists of m-subsequences

d: document identifieroi: offset where m-subsequence s occurs in document df(d,s):frequency of occurrence of m-subsequence s in document d

v: m-subsequence identifieroi: offset where n-gram t occurs in m-subsequence vf(v,t): frequency of occurrence of n-gram t in m-subsequence v


10

Building of the n-Gram/2L Index

Algorithm

Step 1 (back-end index)

Extracting m-subsequences from a set of documents such that

consecutive subsequences overlap with each other by n-1

Building the back-end index using the m-subsequences

Step 2 (front-end index)

Extracting n-grams from the set of m-subsequences

Building the front-end index using the n-grams


11

Theorem 1: If m-subsequences are extracted such that consecutive

ones overlap with each other by n-1, no n-gram is missed or duplicated

Proof (sketch):

m-subsequences

n-1

n-grams

document

n n-2

missedduplicated


12

Query Processing Using the n-Gram/2L Index

Algorithm


Finding the m-subsequences that cover a query string by searching

the front-end index


Finding the documents that have a set of m-subsequences {Si}

containing the query string by searching the back-end index


13

Definition 1: Cover

S covers Q if an m-subsequence S and a query string Q satisfy one of the following four conditions:

① A suffix of S matches a prefix of Q

② The whole string of S matches a substring of Q

③ A prefix of S matches a suffix of Q

④ A substring of S matches the whole string of Q

Example

Q

S

B C

B CA B

D D A B Q

S

B

C D D A

BC D D A

Q

S

B C D D

B CA B

A B Q

S C D D

C D D

A

1 2

3 4


14

Definition 2 (brief): Expand

The expand function expands a sequence of overlapping

character sequences into one character sequence

Definition 3: Contain

A set of m-subsequences {Si} contains a query string Q if {Si}

and Q satisfy the following condition:

Let SlSl+1...Sm be a sequence of m-subsequences overlapping with

each other in {Si}. A substring of expand(SlSl+1...Sm) matches the

whole string of Q


15

Cases of containment

Q

...

case1: {Si, Si+1, ... Sj} contains Q.

...

... Si Si+1Sj

for Len(Q) m

...

...

...

...SkSp

Sq

for Len(Q) < m Q Q

case2: {Sk} contains Q. case3: {Sp, Sq} contains Q.


16

Lemma 1: A document that has a set of m-subsequences {Si} containing

the query string Q includes at least one m-subsequence covering Q

Algorithm (revisited)


Finding the m-subsequences that cover a query string by searching the front-

end index for retrieving candidate results satisfying the necessary condition


Finding the documents that have a set of m-subsequences {Si} containing the

query string by searching the back-end index for refining candidate results

A document d has a set of m-subsequences {Si}

containing Q

A document d has at least one m-subsequence

covering Q

<A necessary condition>


17

Formalization of the n-Gram/2L Index

We observe that the redundancy in the position

information existing in the n-gram index is caused by

non-trivial MultiValued Dependencies (MVDs)

We show that the n-gram/2L index can be derived by

eliminating that redundancy through relational

decomposition to the Fourth Normal Form (4NF)


18

MultiValued Dependency (MVD)

Definition [Ull1988]

Suppose we are given a relation schema R, and X and Y are subsets of R. X→→Y holds in R if whenever r is a relation for R, and and are two tuples in r, with [X] = [X] (that is, and agree on the attributes of X), then r also contains tuples and , where

1. [X] = [X] = [X] = [X]

2. [Y] = [Y] and [R-X-Y] = [R-X-Y]

3. [Y] = [Y] and [R-X-Y] = [R-X-Y]

Non-trivial MVD: Y X and X Y R

Example

a1a1a1a1a2a2a2a2

a1a1a2a2

a1a1a2a2

b1b2b1b2b3b3b4b4

c1c2c2c1c3c4c3c4

b1b2b3b4

c1c2c3c4decompose

X Y R-X-Y X Y X R-X-Y

MVD

MVD

(4NF)

R R1 R2


19

Relational Representation for Theoretical Analysis

NDO relation Converting the n-gram index so that obeys the First Normal Form (1NF) Having three attributes N, D, and O

N: n-grams D: document identifiers O: offsets of n-grams within documents

SNDO1O2 relation

Adding a new attribute S and splitting the attribute O into two attributes O1 and O2

Having five attributes S, N, D, O1, and O2

S : m-subsequences in which n-grams appear

O1: offsets of n-grams within m-subsequences

O2: offsets of m-subsequences within documents

n-gram index NDO relation SNDO1O2 relation

<NDO relation (1NF) >

AB 0

N D

AB 0AB 1AB 1AB 2AB 2AB 3AB 3AB 4AB 4AB 5AB 5BB 0BB 2BB 5BC 0BC 0BC 1BC 1BC 2BC 3BC 3BC 3BC 4BC 4BC 5BC 5

CD 0CD 0CD 1CD 1CD 2CD 2CD 3CD 3CD 4CD 4CD 5CD 5DA 0DA 1DA 1DA 1DA 2DA 2DA 3DA 3DA 4DA 4DA 5DA 5DD 0DD 2DD 4

0

O

51528372648630172640483715

283705154826404817261537360

2-gram posting lists of 2-grams

AB

BB

BC

CD

DA

DD

0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8]

0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5]

0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6]

0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7]

0, [3] 2, [6] 4, [0]

0, [6] 2, [3] 5, [0]

<2-gram index>

D A B C D A B C D A

C D A B B C D D A B

B C D A B C D A B C

D D A B C D A B C D

B B C D A B C D A B


document 0

document 1

document 2

document 3

document 4

document 5

A B C D D A B B C D

normalize

Example of Relational Representation N: n-gramsD: document identifiersO: offsets

ABCD

S N D

ABCDABCDABCDABCDABCDABCDABCDABCDBBCD BBBBCD BBBBCD BBBBCD BCBBCD BCBBCD BCBBCD CDBBCD CDBBCD CDBCDA BCBCDA BCBCDA BCBCDA CDBCDA CDBCDA CDBCDA DABCDA DABCDA DA

CDAB CDCDAB CDCDAB CDCDAB DACDAB DACDAB DACDAB ABCDAB ABCDAB ABDABC DADABC DADABC DADABC ABDABC ABDABC ABDABC BCDABC BCDABC BCDDAB DDDDAB DDDDAB DDDDAB DADDAB DADDAB DADDAB ABDDAB ABDDAB AB

<NDO relation>

000111222000111222

O1

000111222000111222000111222

AB 0

N D

AB 0AB 1AB 1AB 2AB 2AB 3AB 3AB 4AB 4AB 5AB 5BB 0BB 2BB 5BC 0BC 0BC 1BC 1BC 2BC 3BC 3BC 3BC 4BC 4BC 5BC 5

CD 0CD 0CD 1CD 1CD 2CD 2CD 3CD 3CD 4CD 4CD 5CD 5DA 0DA 1DA 1DA 1DA 2DA 2DA 3DA 3DA 4DA 4DA 5DA 5DD 0DD 2DD 4

0

O

515283726486301726404837 (1+6)15

283705154826404817261537360

<SNDO1O2 relation>

025025025134134134

125125125135135135024024024

630630630603603603

O2

306306306063063063360360360

ABABABBCBCBCCDCDCD

000111222

034034034

036036036

MVDMVD

We see a Cartesian product of NO1 and DO2 in the SNDO1O2 relation

S: m-subsequencesO1: offsets of n-gramsO2: offsets of m-subsequences


22

Normalization of the n-Gram Index

Lemma 2: Non-trivial MVD’s S→→NO1 and S→→DO2 hold in the SNDO1O2

relation

Proof (sketch):

The set of documents, where an m-subsequence occurs, and the set of n-grams,

which are extracted from that m-subsequence, are independent of each other

Due to this independence, there exist the tuples corresponding to all possible

combinations of documents and n-grams for a given m-subsequence

Lemma 3: The decomposition (SNO1, SDO2) is in 4NF

Proof: See the paper

Theorem 2: The 4NF decomposition (SNO1, SDO2) of the SNDO1O2

relation is identical to the front-end and back-end indexes of the

n-gram/2L index

Proof: See the paper

<SNO1 relation> <SDO2 relation>

0

S N

34510124012323455

ABCD (0)

S D

ABCD (0)ABCD (0)BBCD (1)BBCD (1)BBCD (1)BCDA (2)BCDA (2)BCDA (2)CDAB (3)CDAB (3)CDAB (3)DABC (4)DABC (4)DABC (4)DDAB (5)DDAB (5)DDAB (5)

021201102221021010

O1 036630603306063360

O2

ABABABABBBBCBCBCBCCDCDCDCDDADADADAAB

034025134125135024

4-subsequence posting list

ABCD

BBCD

BCDA

CDAB

DABC

DDAB

0, [0]

1, [6]

1, [3]

1, [0]

0, [3]

0, [6]

3, [3] 4, [6]

2, [3] 5, [0]

3, [0] 4, [3]

2, [0] 5, [6]

3, [6] 5, [3]

2, [6] 4, [0]

2-gram posting list

0, [0] 3, [2] 4, [1] 5, [2]

0, [1] 1, [1] 2, [0] 4, [2]

0, [2] 1, [2] 2, [1] 3, [0]

2, [2] 3, [1] 4, [0] 5, [1]

5, [0]

1, [0]

AB

BB

BC

CD

DA

DD

<The front-end index> <The back-end index>

denormalize denormalize

the m-subsequence identifierExample of Normalization Using Theorem 2


24

Analysis of the n-Gram/2L Index

Symbols Definitions

avgdoc

the average number of occurrences of an m-subsequence in the documents

avgngram

the average number of the n-grams extracted from an m-subsequence

an m-subsequence

... avgngram

avgdoc

documents

n-grams

...

Notation

Optimal length mo

Length of the m-subsequence that minimizes the size of the n-gram/2L index


25

Index Size

Space complexities n-gram index: O(avgdoc avgngram)

n-gram/2L index: O(avgdoc + avgngram)

Properties mo is obtained by finding the length m that makes avgdoc = avgngram

Both avgdoc and avgngram increase as the database size gets larger

Analytical results Size of the n-gram/2L index is significantly reduced compared with

that of the n-gram index for a large database

Reduction of the index size becomes more marked as the database size increases

See the paper for the detailed analysis


26

Formulas for the index size

sizengram

sizefront + sizeback

|SS| (avgngram(SS) avgdoc(SS))

|SS| (avgngram(SS) + avgdoc(SS))

sizengram = (kngram(s) kdoc(s))s SS (1)

sizefront = kngram(s)s SS

sizeback = kdoc(s)s SS

(2)

(3)

(kngram(s) kdoc(s))s SS

(kngram(s) + kdoc(s))s SS

(4)

=


27

Query Performance

Time complexities n-gram index: O(avgdoc avgngram)

n-gram/2L index: O(avgdoc + avgngram)

Analytical results n-gram/2L index significantly improves the query performance

over the n-gram index for a large database

Improvement of the query performance gets better as the database size increases

Query processing time increases only very slightly as the query length gets longer

It has been pointed out that the query performance of the n-gram index for long queries tends to be bad [Wil2003]

See the paper for the detailed analysis


28

timengram = ( Len(Q) – n + 1) (5)

i = 0

sizengram

n

timefront = ( Len(Q) – n + 1) (6)sizefront

n

timeback =

sizeback

m Len(Q) – m + 1 + 2 m – n - 1 m – n - i),(

i = 0sizeback

m (m – Len(Q) + 1 ) Len(Q) – n - 1 m – n - i),( m – Len(Q)+ 2

if Len(Q) m

if Len(Q) < m

(7)

timefront + timeback

timengram=

sizengram ( Len(Q) – n + 1)

(sizefront ( Len(Q) – n + 1)) + sizeback (Len(Q) – m + 1

m – n + c ))(,

sizengram ( Len(Q) – n + 1)

(sizefront ( Len(Q) – n + 1)) + sizeback (m - Len(Q) + 1

Len(Q) - n + d ))(,

if Len(Q) m

if Len(Q) < m

(8)

where c = 2 i = 0m – n - 1

i1

( ), d = 2 i = 0Len(Q) – n - 1

i1

( )

Formulas for the query performance


29

Experiments

Measures Index size

Query performance Number of page accesses Wall clock time (ms)

Data sets PROTEIN-DATA: the set of protein sequence databases used in bioinformatics TREC-DATA: the set of English text databases used in information retrieval

Parameters Data size = 10 MBytes, 100 MBytes, and 1 GBytes n = 3 (n-gram length) [Kuk1992,WZ2002] m = 4 ~ 6 (m-subsequence length) Len(Q) = 3, 6, 9, 12, 15, and 18 (query length)

index size ratio = the number of pages allocated for the n-gram index

the number of pages allocated for the n-gram/2L index


30

Index Size (PROTEIN-DATA)

The size of the n-gram/2L index is significantly reduced compared with that of the n-gram index By up to 2.7 times in PROTEIN-1G

The reduction of index size become more marked as the database size increases Approximately 25% for the PROTEIN-DATA as the database size is increased by ten fold (10

MBytes 100MBytes 1 GBytes)

0

0.5

1

1.5

2

2.5

3

4 5 6m-subsequence length m

ind

ex s

ize

ratio

PROT- 1G PROT- 100M PROT- 10M

optimal length mo


31

Query Performance (PROTEIN-DATA)

n-gram/2L significantly improves the query performance over the n-gram index Up to 13.1 times in wall clock time (PROTEIN-1G)

Improvement gets better as the database size increases 1.37 times in PROTEIN-100M; 6.65 times in PROTEIN-1G

Query processing time increases only very slightly as the query length gets longer n-gram/2L index: 53%, Len(Q): 3 18

(c.f. n-gram index: 32.9 times)

0

5000

10000

15000

20000

3 6 9 12 15 18

query length Len (Q )

Wal

l Clo

ck T

ime

(ms)

3- gram index 3- gram/2L index (m=4)

10

100

1000

10000

100000

10M 100M 1G

data s ize (Byte)

Wal

l Clo

ck T

ime

(ms)

<Query processing time>(Len(Q): 3~18)

<No. of page accesses>(data set: PROTEIN-1G)

0

5000

10000

15000

20000

25000

3 6 9 12 15 18

query length Len (Q )#

of

pag

e ac

cess

es

<Query processing time>(data set: PROTEIN-1G)


32

Conclusions

We have shown that the redundancy in the position information

existing in the n-gram index is due to non-trivial MVDs

We have proposed the two-level structure of the n-gram index

We have shown that the n-gram/2L index is derived by the relational

normalization process that decomposes the n-gram index into 4NF

We have provided a formal analysis of the space and time

complexities of n-gram/2L index

Finally, through extensive experiments, we have shown that the n-

gram/2L significantly reduces the size and improves the query

performance compared with the n-gram index


33

References

[BR1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999.

[Coh1997] Jonathan D. Cohen, “Recursive Hashing Functions for n-Grams,” ACM Trans. on Information

Systems, Vol. 15, No. 3, pp. 291-320, July 1997.

[EN2003] Ramez Elmasri and Shamkant B. Navathe, Fundamentals of Database Systems, Addison Wesley,

4th ed., 2003.

[Kuk1992] Karen Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys,

Vol. 24, No. 4, pp. 377-439, Dec. 1992.

[LA1996] Joon Ho Lee and Jeong Soo Ahn, “Using n-Grams for Korean Text Retrieval,” In Proc. Int'l Conf. on

Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp. 216-224, 1996.

[MM2003] James Mayfield and Paul McNamee, “Single N-gram Stemming,” In Proc. Int'l Conf. on Information

Retrieval, ACM SIGIR, Toronto, Canada, pp. 415-416, July/Aug. 2003.

[MSL+2000] Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, “Performance and Scalability of

a Large-Scale N-gram Based Information Retrieval System,” Journal of Digital Information 1(5), pp.

1-25, Jan. 2000.

[MZ1996] Alistair Moffat and Justin Zobel, “Self-indexing inverted files for fast text retrieval,” ACM Trans. on

Information Systems, Vol. 14, No. 4, pp. 349-379, Oct. 1996.

[Nav2001] Gonzalo Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, Vol.

33, No. 1, pp. 31-88, Mar. 2001.

[Ram1998]Raghu Ramakrishnan, Database Management Systems, McGraw-Hill, 1998.


34

[SKS2001] Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, Database Systems Concepts, McGraw-Hill, 4th

ed., 2001.

[SWY+2002] Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, “Compression of Inverted

Indexes for Fast Query Evaluation,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere,

Finland, pp. 222-229, Aug. 2002.

[Ull1988] Jeffery D. Ullman, Principles of Database and Knowledge-Base Systems Vol. I, Computer Science Press,

USA, 1988.

[Wil2003] Hugh E. Williams, “Genomic Information Retrieval,” In Proc. the 14th Australasian Database Conferences,

2003.

[WLL+2005] Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Min-soo Kim, and Wook-Shin Han, “Odysseus:a

High-Performance ORDBMS Tightly-Coupled with IR Reatures,” In Proc. the 21th IEEE Int'l Conf. on Data

Engineering (ICDE), Tokyo, Japan, Apr. 2005.

[WMB1999] I. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents

and Images, Morgan Kaufmann Publishers, Los Altos, California, 2nd ed., 1999.

[WVT1990]Kyu-Young Whang, Brad T. Vander-Zanden, and Howard M. Taylor, “A Linear-Time Probabilistic Counting

Algorithm for Database Applications,” ACM Trans. on Database Systems, Vol. 15, No.2, pp. 208-229, June

1990.

[WZ2002] Hugh E. Williams and Justin Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. on

Knowledge and Data Engineering, Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002.

[YT1998] Ogawa Yasushi and Matsuda Toru, “Optimizing query evaluation in n-gram indexing,” In Proc. Int'l Conf.

on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367-368, 1998.

N-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Aug. 31, 2005...

Documents

Transcript of N-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure Aug. 31, 2005...