Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...
Transcript of Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1...
![Page 1: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/1.jpg)
Substring statistics : Algorithm
0.0001
0.001
0.01
0.1
1
1e-006 1e-005 0.0001 0.001 0.01 0.1 1
df2/
df
df/N
NTCIR2G(any ngrams)
There are N(N-1)/2 substrings in a corpus of lenth N. Substring statistics are useful to find word boundary in Japanese
19
Kyoji Umemura and Kenneth W. Church CICLing 2009
April 14, 2009
![Page 2: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/2.jpg)
How it works
0.0001
0.001
0.01
0.1
1
1e-006 1e-005 0.0001 0.001 0.01 0.1 1
df2/
df
df/N
NTCIR2G(any ngrams)
First prepare the table of all statistics. Then obtain each statistics using binary search
19
Result: PreparationO(N), Access: O(log(N))
![Page 3: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/3.jpg)
Size of Table is O(N), not O(N^2) There are N(N-1)/2 substrings in a corpus of lenth N. But many of them have same statistics.
19
N(N-1)/2 is too large for memory.
Strings with same statistics are grouped into a class
![Page 4: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/4.jpg)
Saving Preparation Time If each table may required O(N) computation to get value, the preparation time would be O(N^2) because table size (N).
19
obtaining statistics of a shorter substring using statistics of longer substring which share the begining part.
![Page 5: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/5.jpg)
definition of cf(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus cf(x) : number of occurrences of string X in a corpus
cf (abc) =5 #4 #5 #6
#1 #2 #3
Ex.
![Page 6: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/6.jpg)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus cf(x) : number of occurrences of string (not word) x. The word "abc" contains a string "ab".
cf (abc) =5 #4 #5 #6
#1 #2 #3
Ex.
cf (ab) =11
definition of cf(x):x is string
![Page 7: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/7.jpg)
definition of tf(d,x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus tf(d,x) : number of occurrence of string x in document # d
tf (1, abc) =2
#4 #5 #6
#1 #2 #3
ex.
![Page 8: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/8.jpg)
definition of tf(d,x):x is string
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus tf(d,x) : number of occurrence of string x in document # d
tf (1, abc) =2
#4 #5 #6
#1 #2 #3
ex. tf (6, ab) =4
![Page 9: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/9.jpg)
definition of df(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df(x) : number of documents which contain the string x, at least once.
df (abc) =4 #4 #5 #6
#1 #2 #3
ex.
![Page 10: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/10.jpg)
definition of df(x): x is string
abc abc ab
ab abc ab
abc
ab ab abc ab
テキストコーパス df(x) : number of documents which contain the string x . The word "abc" contains a string "ab". df (abc) =4 #4 #5 #6
#1 #2 #3
ex. df (ab) =5
![Page 11: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/11.jpg)
definition of df2(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df2 (x) : the number of documents which contain string x twice or more.
df2 (abc) =1
#4 #5 #6
#1 #2 #3
ex.
![Page 12: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/12.jpg)
definition of df2(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus df2 (x) : the number of documents which contain string x twice or more.
#4 #5 #6
#1 #2 #3
ex. df2 (ab) =3
![Page 13: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/13.jpg)
definition of dfk(x)
abc abc ab
ab abc ab
abc
ab ab abc ab
Corpus dfk(x) : the number of documents which contain string x k times or more.
#4 #5 #6
#1 #2 #3
ex. df3 (ab) =2
![Page 14: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/14.jpg)
Suffix Array 1
a c a b d b b
c c b
a c b a c b b a c a b b a c a b b b a c a b d b b
c c b
a c b
a c b b
a c a b b
a c a b b b
a c a b d b b
suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]
sorted as dictionary
Suffix Array
Suffix Array 1
![Page 15: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/15.jpg)
Suffix Array 2
a c a b d b b
c c b
a c b
a c b b
a c a b b
a c a b b b
a c a b d b b
suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]
A suffix is expressed by one integer
suffix[6] suffix[5]
suffix[2]
suffix[0] suffix[1]
suffix[3] suffix[4]
Space O(N )
Suffix Array 2
![Page 16: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/16.jpg)
common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])
k=1 j -1
Classes: Set of strings which starts with same suffixes
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
The set of string, "aabb", "aabbc", and "aabbcc" is a class because they start with from suffix1 to suffix2 and not other suffix.
"aabb", "aabbc", and "aabbcc"
"aab"
"a" and "aa"
![Page 17: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/17.jpg)
Members of a class shares statistics
For class C, x, y in C satifies following, because they start with same suffixes.
A df (x) = df (y) df2 (x) = df2 (y) dfn(x) = dfn(y)
cf (x) = cf (y)
tf (d,x) = tf (d,y)
A
![Page 18: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/18.jpg)
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
lcp[0] = 2, since suffix[0] and suffix[1] have common prefix "aa", and the length of "aa" is 2.
LCP(length of Common Prefix)
![Page 19: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/19.jpg)
common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])
k=1 j -1
Classes on Suffix Array
c d b a
c b
a c b a a
b c
a b
a
c a d
d d
b c d b a a c e b c d
a a b e e d a a b e e e e
d d d
suffix[0]
suffix[1]
suffix[2]
suffix[3]
suffix[4]
suffix[5]
A class corresponds to an suffix interval which satisfies ... outgoing(1,4)=max(lcp[0],lcp[4])=2
inner(1,4) =min (lcp[k]) =3 k=1 3
[1,4]
[0,0]
[1,1]
[2,2]
[3,3]
[4,4]
[5,5]
[0,4]
[1,2]
[3,4]
![Page 20: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/20.jpg)
Upper and Lower Class
Lower class
Upper class
if C1 contains all C2 suffixes, C1 is upper class of C2 C2 is lower class of C1
suffix array
![Page 21: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/21.jpg)
Property of Class
Lower class
Upper class
Classes form one tree. Therefore total number of classes are less than 2N-1 where N is size of Corpus
suffix array
![Page 22: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/22.jpg)
Corpus
#4 #5 #6
#1 #2 #3
cf (abc) cf (abx) cf (ab) =3+6
=3 =6
=9 abx
abc
abc
abc
abx
abx
abx
abx
abx
number of occurrence of a string are obtained by longer strings
![Page 23: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/23.jpg)
Corpus
#4 #5 #6
#1 #2 #3
number of documents are not
cf (abc) =3 cf (abx) =6 cf (ab) =3+6
=9 df (abc) =3 df (abx) df (ab) =3+4?
=4
=5
abx
abc
abc
abc
abx
abx
abx
abx
abx
![Page 24: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/24.jpg)
numbering string "ab" in a document by suffix order.
Document #1
abc
abd
abx
aby
suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]
a d b c a a
b a e
a y b a x b
abx suffixes of doc #1 abe
abd
abx aac
for string "ab", the numbering of suffix[89]
(abx..) is 3.
1
3
4
2
![Page 25: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/25.jpg)
numbering string "a" in a document by suffix order.
Document #1
abc
abd
abx
aby
suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]
a d b c a a
b a e
a y b a x b
abx suffixes of doc #1 abe
abd
abx aac
for string "a", the numbering of suffix[89]
(abx..) is 4.
2
4
5
3
1
![Page 26: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/26.jpg)
k-frequency of string x
cfk (x)
tfk (d,x)
:sum of tfk (d,x) in all d of corpus
: number of string x in document d, where numbering of x is k or larger.
For class C, x, y in C satifies following, because they start with same suffixes. tfk (d,x) = tfk (d,y)
Property
![Page 27: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/27.jpg)
k-frequency and document frequency
document #1 document #2 document #3
ab
ab ab
ab
abc
abc
abc
ab
ab
ab
abc abc
abc abc
abc
abc abc
abc
ab
ab ab
dfk(x)=cfk(x)-cfk+1(x) property
cf8(ab)=1
ab ab
ab ab ab
ab ab
ab ab ab
ab ab
ab ab
ab
ab
ab ab
ab ab
ab
1 1 1 2 2
2 3 3 3 4 4
4 5 5
5 6 6
6 7 7 8
cf7(ab)=3
7 7 ab ab
cf6(ab)=6
ab ab ab
6
6
6
df7(ab)=cf7(ab)-cf8(ab)=3-1=2 df6(ab)=cf6(ab)-cf7(ab)=6-3=3
![Page 28: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/28.jpg)
enumerating classes scanning suffix array and LCP
suffix array
suffix[0] suffix[1] suffix[2] suffix[3]
LCP for each suffix from begin to end
1. LCP increase : begining of class
…
stack
2. LCP decrease: end of class
assing start point , and lcp of class push class record to stack
pop class record assign end point of class
suffix[1] suffix[2]
detecting class
suffix[2]
detecting class
class record in stack incomplete class
suffix position
Note: needs more care for the classes sharing start or end
![Page 29: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/29.jpg)
Data structure for k-frequency -document link-
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
![Page 30: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/30.jpg)
Marking the numbering is 3 or more for "abcx", "abcy" and "abcz"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
![Page 31: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/31.jpg)
Marking the numbering is 3 or more for "abc"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abc
-1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
![Page 32: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/32.jpg)
Counting k-frequency n Deside Range by Document Link, n Count up variables in class in stack where start is smaller than range
Suffix Position
number of record in class
Range
![Page 33: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/33.jpg)
Reducing computation by Class*
simple method : counting up variables in all incomplete classes where the class contains specified range.
→multiple counting for one suffix
There is a way to get same result by single counting with propagation.
![Page 34: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/34.jpg)
Suffixes marked for "abcx" "abcy" and "abcz", are also marked in "abc"
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
abc -1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
![Page 35: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/35.jpg)
Ranges by document link show that they belongs to "abc", and not "abcx", "abcy"...
Doc
umen
t #
234
Doc
umen
t #
235
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
abcz abcz
abcz
abcx
abcx abcx
abcy
abcy
abcy
12301 12308
12311
12331
12313
12345 12357
12379
12388 12405
12415 12536
12444 12456
12465 12476 12481
12500 12494
abcx
abcy
abcz
abc -1
-1
abcx abcy
abcz abcy abcy
abcx abcx
abcx
abcx abcz
12301 12345 12357 12379 12388 12415 12465 12481 12308
12331 12336
12500
12444
12311
12476
12313 12305
12456 12494
-1
-1
Doc
umen
t #
234
abcz abcz
abcz abcy
abcy
abcy abcx
abcx
abcx
Doc
umen
t #
235
![Page 36: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/36.jpg)
definition Class* of interval n The class which has smallest interval among
the classes which contains the interval. n i.e. The lowest class among the class which
contains the interval.
suffix position
number record instack
Range
![Page 37: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/37.jpg)
Reducing computation by Class*
simple method : counting up variables in all incomplete classes where the class contains specified range.
noting the following property
Class forms a tree structure. if a suffix has number K for one class, the suffix has K or more number in upper class.
Property
counting up variables of class* of range, then propergate count to upper class
→multiple counting for one suffix
![Page 38: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/38.jpg)
How to find Class* n Class* is a class on stack n where its start is smaller than range start
and closest one
suffix position
number of stack in a class
Range
![Page 39: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/39.jpg)
How to find Class* n using binary search, search class of
appropriate start position. Note: class in stack are ordered by start.
suffix position
number of stack in a class
Range
![Page 40: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/40.jpg)
Preparation
abcabcabc abcd abcde bcde
Corpus class[4,8] df=3 df2=1 S=“abc” class[5,6] df=1 df2=1 S=“abcabc” class[7,8] df=2 df2=0 S=“abcd” class[9,14] df=4 df2=1 S=“bc” class[10,11] df=1 df2=1 S=“bcabc” ・・・
Tables
df=3 df2=1 S=“abc”
abc, ab, a class[4,8]
class[5,6] df=1 df2=1 S=“abcabc”
abca,abcab, abcabc
class[7,8] df=2 df2=0 S=“abcd”
abcd
class[9,14] df=4 df2=1 S=“bc”
b, bc
bca,bcab, bcabc class[10,11]
df=1 df2=1 S=“bcabc”
Sort the class table by dictionary order of the longest string of Class
![Page 41: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/41.jpg)
Access to table
“abc” ??
abcabcabc abcd abcde bcde
コーパス class[4,8] df=3 df2=1 S=“abc” class[5,6] df=1 df2=1 S=“abcabc” class[7,8] df=2 df2=0 S=“abcd” class[9,14] df=4 df2=1 S=“bc” class[10,11] df=1 df2=1 S=“bcabc” ・・・
df=3 df2=1 S=“abc”
abc, ab, a class[4,8]
class[5,6] df=1 df2=1 S=“abcabc”
abca,abcab, abcabc
class[7,8] df=2 df2=0 S=“abcd”
abcd
class[9,14] df=4 df2=1 S=“bc”
b, bc
bca,bcab, bcabc class[10,11]
df=1 df2=1 S=“bcabc”
df=3 df2=1
bBinary search to find first class that maches query. note: other ordering is also possible
![Page 42: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/42.jpg)
Conclusion
n Various document frequencies can be computed by K-frequency of string
n K-frequecy of short string is sum of K-frequency of longer strings, that is not the case for document frequency.
n K-frequency is efficiently computed using the suffix array and document link.
![Page 43: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings](https://reader034.fdocuments.us/reader034/viewer/2022050715/5f2f38559dbddc027010828f/html5/thumbnails/43.jpg)
Download & Try
n http://www.ss.cs.tut.ac.jp/umemura/cicling2009/
n Try ./show-class for your favorite string. n suggesting short string with patterns n verify the various frequencies