UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology...

31
UPA and Restriction for All-Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology [email protected] Allen Brown, PhD Microsoft

Transcript of UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology...

Page 1: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

UPA and Restriction for All-Groups and Numeric Exponents

Matthew Fuchs, PhDWestbridge Technology

[email protected] Brown, PhD

Microsoft

Page 2: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Why Bother?• Numeric Exponents introduced by the

W3C XML Schema WG.

• Restriction is a subsumption relation among content models.

• And-groups long cherished by Markup Community.

• UPA is an old constraint on content models in WXS.

• What is the cost of combination?

Page 3: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Naïve Algorithms

• Exponential or worse:– All-groups try all exponential cases.– Numeric exponents – unroll - doubly

exponential:• First unroll:

(a{0,3} | b){10, 20} => ((a | aa | aaa | b)…(a |…)….• Then determinise.

– Used by XSV, Xerces, Sun.

• To not try to do better is simply remiss.

Page 4: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

UPA Testing

• Generally just need to check follow sets.

• Problem for numeric exponents for {m,m}.

• For example:– (a1,b2){2,2},a3 => ababa

– ((a1, b2){1,3},a3) => aba or ababa or abababa

• Is a1 in follow(b2)?

Page 5: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Problem for All-groups

• Again, are different branches in each others’ follow groups?

• (a & b & c) => follow(a) = {b, c}

• (a & b? & c) => follow(a) = {b, c} union follow(b) => {a, b, c}

• ((a,b?) & b &c) => violates UPA

Page 6: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Five properties of particles

• particles(p) => all particles within p, recursively defined.

• opaque(p) => a particle is opaque if it can’t match the empty string.

• first(p) => particles in p that can match first letter in a string matching p

• follow(p) => particles in the outer expression that can match a letter in a string after substring matched by p.

• confusion(p) => particles in p which could conflict with follow(p)(a, b?) => b is in confusion((a, b?))

Page 7: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Special Considerations

• follow(p) restricted as follows:– (((a?,b){m,m}),c) => follow(b) = {c}– (((a?,b){m, n}),c) => follow(b) = {c, a, b}– ((a & b & c), d) => follow(c) = {d}

Page 8: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Sources of UPA Violation

• Consider P in– (, {0,1}, P, )– (, ( | P), )– (, ( & P), )

• UPA violation requires 2 terminals:– One before P, one inside P – need first(P)– Both inside P – in a moment– One inside P, one after P – need confusion(P)– One before P, one after P – opacity(P) is false

Page 9: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Internal Consistency

• P{m, m} – if P obeys UPA, then confusion(P) intersection first(P) != {}

• If P is ( & & ) then– overlap in first sets– confusion() intersects (first() U first() != {}– And so on for and

Page 10: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

UPA Algorithm

UPA() => = a then if bi, bj in follow(a), then i=j

= {m,n} the UPA() and first() # confusion() = {}

= (1 |…| n) and #1n first(i) = {} then /\1

n UPA(n).

=(1 & … & n) and #1n first(i) = {} then

• /\1n(UPA(i)) and (confusion(i) # (Uj!=Ifirst(j)) = {}

=(1 , … , n) then UPA(1) /\ UPA((2, …, n))

Page 11: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Subsumption for Exponents

• Two steps– For fixed exponents– For exponent ranges

• Most equipment carries over

• Will use B or b to refer to base model, and R or r to refer to restricted model

Page 12: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Traditional

• Subsumption through transformation into automaton.

• Calculate intersection of automata (R intersects not(B)) should be empty (not(B) is the inversion of the accepting states of B).

• Once again, too huge when everything is unrolled.

Page 13: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Our Machines

• Represent regex as graph.

• Forward edges, matching terminals, form a DAG

• Back edges, matching exponents, form connected components.

• Each back edge marked with its arity.

Page 14: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Execution Model

• Letters are matched going forward by edges.

• Machine is “trapped” when a back-edge is entered.

• Can’t leave until obligation (value of back edges) fulfilled.

• Edge constraints fulfilled in lifo order.

• Stack maintains current iterations.

Page 15: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Example

• (a,((a,b)2|b))2

a a b

b

2

2

Page 16: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Subsumption Checking

• Start as usual.

• When entering head of a back edge, add entry to machine’s stack.

• When both reach repeated state:– Tail of a back edge– Previously seen in list of traversed states

• Determine if there is a matched component• Maximally reduce exponents for matched edges

Page 17: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

For Example

• (a,(a,b,a,b)6,b3,c) <= (a,((a,b)2|b)9,c)• (r, b) let (r, b) r b• (0,0) a (1,1) [], []• (1,1) a (2,2) [0], [0,0]• (2,2) b (3,3) [0], [1,0] a a b a b b c• (3,1) a (4,2) [0], [1,0]• (4,2) b (5,3) [1], [2,0]• (5,3) X (5,1) [], [6] b c• (5,1) b (6,3) [1], [] a c• (6,3) X (6,3) [], [] b• (6,3) c (7,4) [], []

2

9

Page 18: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Reducing Exponents

• Find cross-product back-edge (startr and startb)

• Get r and b (number iterations each)

• Get leftover (totalr – startr) = lr• lr div r = quotr and remr, etc.

• newr = lr – (r * min(quotr, quotb)) +startr

Page 19: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Why So Complicated

• Compare (a,a,a)7 and (a, a)12 • Must go 3 rounds of (a,a) for 2 rounds of (a,a,a).

• lr = 7 lb = 12

• dr = 2 db = 3

• lr div dr = 3 rem 1 lb div db = 4 rem 0

• newr=7–(2*3)+0=1 newb=12-(3*3)+1=3

• Hence, max 6 rounds of (a, a, a) and 9 of (a, a).

Page 20: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Generalized Exponents

• Must keep track of minimum and maximum possible transitions.

• Edges can contribute to both min or max.

• Can’t exit until max > min allowed.

• Must exit before min > max allowed.

Page 21: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

So….

• Generate as few minr/b as possible.

– If they exceed maxr/b, you’re screwed

• Generate as many maxr/b as possible

– Means you can use a forward transition– Use parsimoniously to maximize the amount

matched

Page 22: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

More Complex Machinery

• Back edge constraints have min and max.

• Some back edges increment just max value

• Back edges increment both min and max values.

• Max means maximun possible match.

• Min means minimum possible match.

Page 23: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Example

• ((a, b?){3, 5}, c)

ab

c c

3,5

3,5

Page 24: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Four Kinds of Pairs

• When hitting a min-edge/min-edge:– Calculate min/min values (prev. algorithm with min exponents)– Calculate max/max values (prev. algorithm with max exponents)– Move forward when possible– If min ever exceeds max, fail.

• When hitting a max-edge/max-edge– Calculate min/min values– Calculate max/max values– When max > min, you can progress (when leaving a cycle set

min to passing value)– Else fail.

• Etc.

Page 25: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

• After exiting loop, some iterations remain.

• As all “unabsorbed” transitions attempted, all possibilities tried.

• Given ( ){mb

,nb

}

• And ( ) {m’r,n’

r} ,( ) {m”

r,n”

r}

• Ensure m’r+m”r > mb and n’r+n”r < nb

Page 26: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

• If “rest of expression” matches longest and shortest (i.e., matched m or matched n) then will match all iterations.

• Matching longest will try all alternatives.

• Matching shortest will try least alternatives.

• As first sets repeat, UPA shows there must be optionality or iteration.

Page 27: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Nested Exponents

• ({m,n}{m’,n’}• (a{m,n} | b){m’, n’}• Edges in machine have multiple

exponents.• Depth of n makes 2(n-1) ranges• Each must be tried• Requires tracking scope.• Requires lookahead.

Page 28: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Cost

• Without nesting, algorithm is exponential in number of exponents – each exponent requires testing min and max.

• With nesting, remains exponential, as this doesn’t affect the number of exponents.

• Still a huge improvement over unrolling.

Page 29: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Example

• ((a?,b{8,9}){2,3},c) > (a,(b,b){3,3},(b,b){6,6},c)

• First 6 b’s at level 2, remaining 12 iterate both levels

• At higher levels ranges overlap – need to check all possibilities

a1

b2

b2c0

c0

{8,9}{2,3}

{8,9}{2,3}

a b b b b c

3 6

Page 30: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

• ((a?,b{8,9}){2,9},c) > (a,(b,b){3,3},(b,b){6,6},c)

• 8*9=72, 9*8=72

• Need to check ending of 8 and start of 9

• Need lookahead to choose.

• Represented as ranges at all levels.

a1

b2

b2c0

c0

{8,9}{2,9}

{8,9}{2,9}

a b b b b a

Page 31: UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft.

Conclusions

• Numeric exponents are hard to work with for subsumption.

• All-groups are not that difficult.

• Interaction will be even more annoying.

• Need to implement and test.