Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of...

36
Combinatorics of Finite Words and Suffix Automata Gabriele Fici Dipartimento di Informatica e Applicazioni Universit ` a di Salerno (Italy) CAI 2009 - Thessaloniki 20 May 2009 Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Transcript of Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of...

Page 1: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Wordsand Suffix Automata

Gabriele Fici

Dipartimento di Informatica e ApplicazioniUniversita di Salerno (Italy)

CAI 2009 - Thessaloniki

20 May 2009

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 2: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

A is a finite set of letters (the alphabet).

A finite word w is an element of A∗.

Its length |w | is the number of its letters.

The empty word ε has length 0.

Let w = a1a2 . . . an be a word.a1 . . . ai , with 1 ≤ i ≤ n, and ε are the prefixes of w .aj . . . an, with 1 ≤ j ≤ n, and ε are the suffixes of w .aj . . . ai , with 1 ≤ i , j ≤ n, and ε are the factors of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 3: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Example

A = {a,n,b, c}, w = banana

|banana| = 6

ba is a prefix of banana

nana is a suffix of banana

a, ba, ε, banana are factors of banana

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 4: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .

balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 5: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.

differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 6: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211

finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 7: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...

many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 8: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 9: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Combinatorics of Finite Words

Some famous classes of finite words:

palindromes: wR = w . Ex. level .balanced words over two letters (say a and b): all thefactors of the same length have the same number of a’s(and of b’s) up to 1. Ex. abaababaabaab.differentiable words: words over {1,2} such that their RunLength Encoding is still a word over {1,2}.Ex. 2211212212211finite prefixes of (right) infinite words: Thue-Morse,Fibonacci, Kolakoski,...many many others.

Intersections: 12112121121 is a balanced differentiablepalindromic prefix of the Fibonacci word over {1,2}...

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 10: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

What’s the target?

Target

Classify the words through their combinatorial properties.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 11: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The suffix automaton

Definition (Blumer et al. 1985 - Crochemore 1986)The suffix automaton of the word w is the minimal deterministicautomaton recognizing the suffixes of w .

ExampleThe suffix automaton of aabbabb:

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 12: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Algorithmically

Theorem (Blumer et al. 1985 - Crochemore 1986)The suffix automaton of a word w over a fixed alphabet A canbe built in time and space O(|w |).

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 13: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

One way to build the SA

Build a non-deterministic automaton:

w = aabbabb

0 1 2 3 4 5 6 7a a b b a b b

Determinize by subset construction:

{0, 1, 2, . . . , 7} {1, 2, 5} {2} {3} {4} {5} {6} {7}

{3, 6} {4, 7}

{3, 4, 6, 7}

a a b b a b b

b

b

a

b

b

a

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 14: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

One way to build the SA

Build a non-deterministic automaton:

w = aabbabb

0 1 2 3 4 5 6 7a a b b a b b

Determinize by subset construction:

{0, 1, 2, . . . , 7} {1, 2, 5} {2} {3} {4} {5} {6} {7}

{3, 6} {4, 7}

{3, 4, 6, 7}

a a b b a b b

b

b

a

b

b

a

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 15: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 16: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 17: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Ending Positions

We associate to each factor v of w the set of ending positionsof v in w .

Examplew = a a b b a b b

1 2 3 4 5 6 7

Endset(b) = {3,4,6,7}, Endset(abb) = Endset(bb) = {4,7}.

We define on Fact(w) the equivalence:

u ∼ v ⇔ Endset(u) = Endset(v)

Then Fact(w)/ ∼ is the set of states of the SA of w .

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 18: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 19: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 20: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 21: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

The number of states (classes) of the SA is noted |Qw |.

The bounds on |Qw | are well known:

|w |+ 1 ≤ |Qw | ≤ 2|w | − 1

The upper bound is reached for w = ab|w |−1, with a 6= b.

And for the lower bound?

ProblemCharacterize the class of words for which |Qw | = |w |+ 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 22: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Special Factors

Definitionv is a left special factor of w if there exist a 6= b such thatav and bv are factors of w .

v is a right special factor of w if there exist a 6= b such thatva and vb are factors of w .

v is a bispecial factor of w if it is both left and right special.

Example (w = aabbabb)

LS = {ε,a,b,ab,abb}, RS = {ε,a,b}, BIS = {ε,a,b}

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 23: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Special Factors

Definitionv is a left special factor of w if there exist a 6= b such thatav and bv are factors of w .

v is a right special factor of w if there exist a 6= b such thatva and vb are factors of w .

v is a bispecial factor of w if it is both left and right special.

Example (w = aabbabb)

LS = {ε,a,b,ab,abb}, RS = {ε,a,b}, BIS = {ε,a,b}

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 24: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

Theorem (Sciortino, Zamboni 2007)

If |A| = 2 then the following conditions are equivalent for a wordover A:

|Qw | = |w |+ 1Every left special factor of w is a prefix of ww is a prefix of a standard sturmian word.

Without restriction on the cardinality of A we have the formula:

Lemma

|Qw | = |w |+ 1 + |D(w)|

where D(w) is the set of left special factors which are notprefixes.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 25: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of states

Theorem (Sciortino, Zamboni 2007)

If |A| = 2 then the following conditions are equivalent for a wordover A:

|Qw | = |w |+ 1Every left special factor of w is a prefix of ww is a prefix of a standard sturmian word.

Without restriction on the cardinality of A we have the formula:

Lemma

|Qw | = |w |+ 1 + |D(w)|

where D(w) is the set of left special factors which are notprefixes.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 26: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 27: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 28: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Property LSP

DefinitionA word has property LSP if every left special factor is a prefix.

Corollary

|Qw | = |w |+ 1 ⇐⇒ w has property LSP

ProblemCharacterize the class of words having the property LSP, overan arbitrary fixed alphabet A.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 29: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The binary case

For binary words we have the formula:

|Qw | = 2|w | − Hw − Pw

Hw is the minimal length of a prefix of w occurring only once,Pw is the maximal length of a left special prefix of w.

As a corollary we obtain a new characterization of standardsturmian words:

Corollary

w is a prefix of a stand. sturm. word⇔ |w | = Hw + Pw + 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 30: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The binary case

For binary words we have the formula:

|Qw | = 2|w | − Hw − Pw

Hw is the minimal length of a prefix of w occurring only once,Pw is the maximal length of a left special prefix of w.

As a corollary we obtain a new characterization of standardsturmian words:

Corollary

w is a prefix of a stand. sturm. word⇔ |w | = Hw + Pw + 1.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 31: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Example

Example (w = aabbabb)

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

Hw = 2 since aa occurs only once.Pw = 1 since a is left special.

|Qw | = 2 · 7− 2− 1 = 11

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 32: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 33: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 34: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

The number of edges

What about the number of edges Ew?

The bounds on Ew are well known:

|w | ≤ Ew ≤ 3|w | − 4

For binary words we give the formula:

Lemma

Ew = |Qw |+ |G(w)| − 1

G(w) is the union of the sets of bispecial factors and rightspecial prefixes of w.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 35: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Example

Example (w = aabbabb)

0 1 2 3 4 5 6 7

3′′ 4′′

3′

a a b b a b b

b

b

a

b

b

a

0-0

G(w) = BIS(w) ∪ (Pref (w) ∩ RS(w)) = {ε,a,b} ∪ {ε,a}

|G(w)| = 3 ⇒ Ew = 11 + 3− 1 = 13.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata

Page 36: Combinatorics of Finite Words and Suffix Automatamath.unipa.it/fici/pdf/CAI09.pdfCombinatorics of Finite Words A is a finite set of letters (thealphabet). Afinite word w is an element

Further Research

ProblemDoes this approach can be applied to other data structures(factor oracle, suffix tree/trie, suffix array, etc.)?

ProblemCharacterize the words having property LSP (i.e. every leftspecial factor is a prefix).

ProblemCompute the average size of the SA for particular class ofwords.

Gabriele Fici Combinatorics of Finite Words and Suffix Automata