Regular expression that produce parse trees

Post on 21-May-2015

848 views 3 download

Tags:

description

Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.

Transcript of Regular expression that produce parse trees

Efficient Regular Expressions that produce Parse Trees

Aaron Karper Niko Schwarz

University of Bern

January 7, 2014

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38

Regular expressions so far

Regular expressions

https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain

((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38

Regular expressions so far

Regular expressions

https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain

((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments

http : // www︸ ︷︷ ︸domain

. reddit︸ ︷︷ ︸domain

. com︸︷︷︸domain

/ r︸︷︷︸path

/ computerscience︸ ︷︷ ︸path

/ comments︸ ︷︷ ︸path

/ 1sg69d︸ ︷︷ ︸path

/

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38

Regular expressions so far

Regular expressions are greedy by default:(a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38

Regular expressions so far

Regular expressions so far

Posix gives only one match.Regular languages are recognized, but parsing with combinatorical parserstakes O(n3).Backtracking implementations (Java, python, perl, . . . ) are exponentiallyslow in the worst case.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38

Benchmarks

Parsing with https?://(([a-z]+\.)+([a-z]+))((/[a-z0-9]+)/?)

2http:// www. reddit. com /r /computerscience /comments /1sg69d

143

0

Figure : Posix

http:// www. reddit. com /r /computerscience /comments /1sg69d2

0

221 3

4 4 4 4

Figure : Our approach

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38

Benchmarks

Benchmarks

Matching ((a+b)+c)+ against(a200bc)2000.

Tool Time

JParsec 4,498java.util.regex 1,992

Ours 5,332

Extract all class names from our projectwith complex regular expression1.

Tool Time

java.util.regex 11,319Ours 8,047

1(.*?([a-z]+\.)*([A-Z][a-zA-Z]*))*.*?Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38

Benchmarks Optimizations of the algorithm

Benchmarks – Optimizations of the algorithm

Typically most time is spent in long repetitions, we optimize for that case by:Lazily compile deterministic FA.Avoiding to recreate state if seen similar state.Use compressed representation if in static repetition.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Parse(a?(a)b)+

over”a0a1b2a3b4”

a a b a b0 1 2 3 4

1 122

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2 q3 q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3 q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

[[0], [], [0], []]

q9

q5 q6 q7 q8

-

-

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38

Benchmarks NFA interpretation

Threads

h1h1 h2 h3 h4 h5 h6

State:

Histories:

qCopy of thread is modified.Copy of array of histories makesreading a character O(m2)

Need faster persistent datastructure to get O(m logm).

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 13 / 38

Benchmarks NFA interpretation

Optimized thread forking

Set entry 2 to 20:

1

2

3

4 5

6

7 8

9

10

11 12

13

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 14 / 38

Benchmarks NFA interpretation

Optimized thread forking

Set entry 2 to 20:

1

2

3

4 5

6

7 8

9

10

11 12

13

1

20

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 15 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

[[0], [], [0], []]

q9

q5 q6 q7 q8

-

-

For each character read, threads start hungry and must eat immediately.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5

[[0], [], [0], []]

q6 q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3

[[0], [], [], []]

q4

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[], [], [], []]

q2

[[0], [], [], []]

q3 q4

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0], [], [], []]

q4

[[0], [], [1], []]

q9

q5

[[0], [], [0], []]

q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

[[0], [], [1], []]

q9

q5 q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

[[0], [], [1], []]

q9

q5 q6

[[0], [], [0], [0]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

q9

q5

[[0], [], [1], []]

q6

[[0], [], [1], [1]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3 q4

q9

q5 q6

[[0], [], [1], [1]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0], [2], [1], [1]]

q2

[[0,2], [2], [1], [1]]

q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,3], [1]]

q9

[[0], [2], [1], [1]]

q5 q6 q7

[[0], [], [1], [1]]

q8

[[0], [2], [1], [1]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0], [2], [1], [1]]

q2

[[0,2], [2], [1], [1]]

q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,3], [1]]

q9

[[0], [2], [1], [1]]

q5 q6 q7

[[0], [], [1], [1]]

q8

[[0], [2], [1], [1]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,4], [1]]

q9

q5

[[0,2], [2], [1,3], [1]]

q6

[[0,2], [2], [1,3], [1,3]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1 q2 q3

[[0,2], [2], [1], [1]]

q4

[[0,2], [2], [1,4], [1]]

q9

q5

[[0,2], [2], [1,3], [1]]

q6

[[0,2], [2], [1,3], [1,3]]

q7 q8

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q1

[[0,2], [2,4], [1,3], [1,3]]

q2

[[0,2,5], [2,4], [1,3], [1,3]]

q3

[[0,2,5], [2,4], [1,3], [1,3]]

q4

[[0,2,5], [2,4,5], [1,3], [1,3]]

q9

[[0,2], [2,4], [1,3], [1,3]]

q5 q6 q7

[[0,2], [2], [1,3], [1,3]]

q8

[[0,2], [2,4], [1,3], [1,3]]

-

-

For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38

Benchmarks NFA interpretation

Example: (a?(a)b)+

Reading "a0a1b2a3b4"

q9

[[0,2], [2,4], [1,3], [1,3]]

a a b a b0 1 2 3 4

1 122

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38

Download

https://github.com/nes1983/tree-regex

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38

NFA construction

S2

S1

-

AlternationS1|S2

S

-

OptionalS?

S

Capture group(S)

S

-

Star operationS*?

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38

Backtracking’s nightmare

(a + a+) + b

against”anb”

will backtrack Θ(2n) times.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38

Backtracking’s nightmare

Extract the first cell in a CSV that starts with "P"1:

∧(.∗?, ) + (P.∗?),

failing against”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”

is exponential.

1From http://www.regular-expressions.info/catastrophic.htmlAaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38

Thread execution order matters

.*(a?)

q1start

q2

q3 q4 q5

any

τ1 ↑ a τ1 ↓

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38

Priority matters

(a)|(a)

q1start

q2

q3

q4

q5

q6

τ1 ↑

τ2 ↑

a

a

τ1 ↓

τ2 ↓

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 36 / 38

Optimization Pipeline

1 Convert to nondeterministic FA2 Interpret nondeterministic FA, building deterministic FA lazily.3 Find similar/mappable states to avoid creating infinite DFA.4 Run on DFA if possible5 Compactify DFA if creation of new states wasn’t necessary for a while.

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38

NFA interpretation

Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38