Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...

Substring statistics : Algorithm

0.0001

0.001

0.01

0.1

1

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

df2/

df

df/N

NTCIR2G(any ngrams)

There are N(N-1)/2 substrings in a corpus of lenth N. 　Substring statistics are useful to find word boundary in Japanese

１９

Kyoji Umemura and Kenneth W. Church CICLing 2009

April　１４,　２００９

How it works

0.0001

0.001

0.01

0.1

1

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

df2/

df

df/N

NTCIR2G(any ngrams)

First prepare the table of all statistics. Then obtain each statistics using binary search

１９

Result: PreparationO(N), Access: O(log(N))

Size of Table is O(N), not O(N^2) There are N(N-1)/2 substrings in a corpus of lenth N. But many of them have same statistics.

１９

N(N-1)/2 is too large for memory.

Strings with same statistics are grouped into a class

Saving Preparation Time If each table may required O(N) computation to get value, the preparation time would be O(N^2) because table size (N).

１９

obtaining statistics of a shorter substring using statistics of longer substring which share the begining part.

definition of cf(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus cf(x) : number of occurrences of string X in a corpus

cf (abc) =5 #4 #5 #6

#1 #2 #3

Ex.

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus cf(x) : number of occurrences of string (not word) x. The word "abc" contains a string "ab".

cf (abc) =5 #4 #5 #6

#1 #2 #3

Ex.

cf (ab) =11

definition of cf(x)：x is string

definition of tf(d,x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

ex.

definition of tf(d,x):x is string

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

ex. tf (6, ab) =4

definition of df(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df(x) : number of documents which contain the string x, at least once.

df (abc) =4 #4 #5 #6

#1 #2 #3

ex.

definition of df(x): x is string

abc abc ab

ab abc ab

abc

ab ab abc ab

テキストコーパス df(x) : number of documents which contain the string x . The word "abc" contains a string "ab". df (abc) =4 #4 #5 #6

#1 #2 #3

ex. df (ab) =5

definition of df2(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

df2 (abc) =1

#4 #5 #6

#1 #2 #3

ex.

definition of df2(x)　

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

#4 #5 #6

#1 #2 #3

ex. df2 (ab) =3

definition of dfk(x)　

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus dfk(x) : the number of documents which contain string x k times or more.

#4 #5 #6

#1 #2 #3

ex. df3 (ab) =2

Suffix Array １

a c a b d b b

c c b

a c b a c b b a c a b b a c a b b b a c a b d b b

c c b

a c b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

sorted as dictionary

Suffix Array

Suffix Array １

Suffix Array ２

a c a b d b b

c c b

a c b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

A suffix is expressed by one integer

suffix[6] suffix[5]

suffix[2]

suffix[0] suffix[1]

suffix[3] suffix[4]

Space 　O(N )

Suffix Array ２

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes: Set of strings which starts with same suffixes

c d b a

c b

a c b a a

b c

a b

a

c a d

d d

b c d b a a c e b c d

a a b e e d a a b e e e e

d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

The set of string, "aabb", "aabbc", and "aabbcc" is a class because they start with from suffix1 to suffix2 and not other suffix.

"aabb", "aabbc", and "aabbcc"

"aab"

"a" and "aa"

Members of a class shares statistics

For class C, x, y in C satifies following, because they start with same suffixes.

A df (x) = df (y) df2 (x) = df2 (y) dfn(x) = dfn(y)

cf (x) = cf (y)

tf (d,x) = tf (d,y)

A

c d b a

c b

a c b a a

b c

a b

a

c a d

d d



d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

lcp[0] = 2, since suffix[0] and suffix[1] have common prefix "aa", and the length of "aa" is 2.

LCP(length of Common Prefix)

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes on Suffix Array

c d b a

c b

a c b a a

b c

a b

a

c a d

d d



d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

A class corresponds to an suffix interval which satisfies ... outgoing(1,4)=max(lcp[0],lcp[4])=2

　　　　 inner(1,4) =min (lcp[k]) =3 k=1 3

[1,4]

[0,0]

[1,1]

[2,2]

[3,3]

[4,4]

[5,5]

[0,4]

[1,2]

[3,4]

Upper and Lower Class

Lower class

Upper class

if C1 contains all C2 suffixes, C1 is upper class of C2 C2 is lower class of C1

suffix array

Property of Class

Lower class

Upper class

Classes form one tree. Therefore total number of classes are less than 2N-1 where N is size of Corpus

suffix array

Corpus

#4 #5 #6

#1 #2 #3

cf (abc) cf (abx) cf (ab) =3+6

=3 =6

=9 abx

abc

abc

abc

abx

abx

abx

abx

abx

number of occurrence of a string are obtained by longer strings

Corpus

#4 #5 #6

#1 #2 #3

number of documents are not

cf (abc) =3 cf (abx) =6 cf (ab) =3+6

=9 df (abc) =3 df (abx) df (ab) =3+4？

=4

=5

abx

abc

abc

abc

abx

abx

abx

abx

abx

numbering string "ab" in a document by suffix order.

Document　#１

abc

abd

abx

aby

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

b a e

a y b a x b

abx suffixes of doc #1 abe

abd

abx aac

for string "ab", the numbering of suffix[89]

(abx..)　is 3.

1

3

4

2

numbering string "a" in a document by suffix order.

Document　#１

abc

abd

abx

aby

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

b a e

a y b a x b

abx suffixes of doc #1 abe

abd

abx aac

for string "a", the numbering of suffix[89]

(abx..)　is 4.

2

4

5

3

1

k-frequency of string x

cfk (x)

tfk (d,x)

:sum of tfk (d,x) in all d of corpus

: number of string x in document d, where numbering of x is k or larger.

For class C, x, y in C satifies following, because they start with same suffixes. tfk (d,x) = tfk (d,y)

Property

k-frequency and document frequency

document　#1 document　#2 document　#3

ab

ab ab

ab

abc

abc

abc

ab

ab

ab

abc abc

abc abc

abc

abc abc

abc

ab

ab ab

dfk(x)=cfk(x)-cfk+1(x) property

cf8(ab)=1

ab ab

ab ab ab

ab ab

ab ab ab

ab ab

ab ab

ab

ab

ab ab

ab ab

ab

1 1 1 2 2

2 3 3 3 4 4

4 5 5

5 6 6

6 7 7 8

cf7(ab)=3

7 7 ab ab

cf6(ab)=6

ab ab ab

6

6

6

df7(ab)=cf7(ab)-cf8(ab)=3-1=2 df6(ab)=cf6(ab)-cf7(ab)=6-3=3

enumerating classes scanning suffix array and LCP

suffix array

suffix[0] suffix[1] suffix[2] suffix[3]

LCP for each suffix from begin to end

1. LCP increase : begining of class

…

stack

2. LCP decrease: end of class

assing start point , and lcp of class push class record to stack

pop class record assign end point of class

suffix[1] suffix[2]

detecting class

suffix[2]

detecting class

class record in stack incomplete class

suffix position

Note: needs more care for the classes sharing start or end

Data structure for k-frequency -document link-

Doc

umen

t　#

234

Doc

umen

t　#

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t　#

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t　#

235

Marking the numbering is 3 or more for "abcx", "abcy" and "abcz"

Doc

umen

t　#

234

Doc

umen

t　#

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t　#

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t　#

235

Marking the numbering is 3 or more for "abc"

Doc

umen

t #

234

Doc

umen

t　#

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abc

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t　#

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t　#

235

Counting k-frequency n  Deside Range by Document Link, n  Count up variables in class in stack where start is smaller than range

Suffix Position

number of record in class

Range

Reducing computation by Class*

simple method　:　counting up variables in all incomplete classes where the class contains specified range. 　　　　　　　

→multiple counting for one suffix

There is a way to get same result by single counting with propagation.

Suffixes marked for "abcx" "abcy" and "abcz", are also marked in "abc"

Doc

umen

t　#

234

Doc

umen

t　#

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

abc -1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t　#

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t　#

235

Ranges by document link show that they belongs to "abc", and not "abcx", "abcy"...

Doc

umen

t　#

234

Doc

umen

t　#

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

abc -1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t　#

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t　#

235

definition Class* of interval n  The class which has smallest interval among

the classes which contains the interval. n  i.e. The lowest class among the class which

contains the interval.

suffix position

number record instack

Range

Reducing computation by Class*

simple method　:　counting up variables in all incomplete classes where the class contains specified range. 　　　　　　　

noting the following property

Class forms a tree structure. if a suffix has number K for one class, the suffix has K or more number in upper class.

Property

counting up variables of class* of range, then propergate count to upper class

→multiple counting for one suffix

How to find Class* n  Class* is a class on stack n  where its start is smaller than range start

and closest one

suffix position

number of stack in a class

Range

How to find Class* n  using binary search, search class of

appropriate start position. Note: class in stack are ordered by start.

suffix position

number of stack in a class

Range

Preparation

abcabcabc abcd abcde bcde

Corpus class[4,8] 　 df=3 df2=1 S=“abc” class[5,6] 　df=1 df2=1 S=“abcabc” class[7,8] 　df=2 df2=0 S=“abcd” class[9,14] 　 df=4 df2=1 S=“bc” class[10,11] 　df=1 df2=1 S=“bcabc” ・・・

Tables

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8] 　df=2 df2=0 S=“abcd”

abcd

class[9,14] df=4 df2=1 S=“bc”

b, bc

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

Sort the class table by dictionary order of the longest string of Class

Access to table

“abc” ??

abcabcabc abcd abcde bcde

コーパス class[4,8] 　 df=3 df2=1 S=“abc” class[5,6] 　df=1 df2=1 S=“abcabc” class[7,8] 　df=2 df2=0 S=“abcd” class[9,14] 　 df=4 df2=1 S=“bc” class[10,11] 　df=1 df2=1 S=“bcabc” ・・・

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8] 　df=2 df2=0 S=“abcd”

abcd

class[9,14] df=4 df2=1 S=“bc”

b, bc

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

df=3 df2=1

bBinary search to find first class that maches query. note: other ordering is also possible

Conclusion

n  Various document frequencies can be computed by K-frequency of string

n  K-frequecy of short string is sum of K-frequency of longer strings, that is not the case for document frequency.

n  K-frequency is efficiently computed using the suffix array and document link.

Download & Try

n  http://www.ss.cs.tut.ac.jp/umemura/cicling2009/

n  Try ./show-class for your favorite string. n  suggesting short string with patterns n  verify the various frequencies

Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...

Documents

Transcript of Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...