OPAL: A Passe-partout for Web Forms Poster

1
DIADEM data extraction methodology domain-centric intelligent automated DOM tree Field Scope Fields & Labels Segment tree Segment Scope Segments & Labels Layout Scope Layout tree Visual Labels Domain Scope Form Model Schema tree 1 1 1 2 3 2 3 4 4 8 5 5 6 7 8 7 6 Figure 4: Example for Segment Scope Labeling Segmentation: Find “logical” structure of the form segmentation tree with only elds as leaves and form segments s as inner nodes such that s has at least degree two and all elds in s are style-equivalent segmentation labeling: distribute labels to elds, if there is a regular structure to segments, if single, prex label style-equivalence: two nodes n and nif same class or same type and CSS style complexity: linear in document size and depth Visual heuristics: Find labels in visual proximity of a eld — but: not “overshadowed” by another eld — but: preference for labels preceding a eld in reading order t visible from a point p if north-west of p f overshadows ffor t if t visible from bottom-right corner of f and f, f’ unaligned bottom-left corner of f and f, f’ aligned bottom-right corner of f, f, f' aligned and f has no other visual label F 1 F 3 F 2 T 4 T 2 T 3 T 1 W E S N SE NE NW SW TEMPLATE segment<C>{ 2 segment<C>(G) ( outlier<C>(G),child(N 1 ,G), ¬ ( child(N 2 ,G), ¬ (concept<C>(N 2 ) _ segment<C>(N 2 )) ) } 4 TEMPLATE segment _ range<C,C M >{ 6 segment<C>(G) ( outlier<C>(G),concept<C M >(N 1 ),concept<C M >(N 2 ), N 1 6= N 2 ,child(N 1 ,G),child(N 2 ,G) } 8 TEMPLATE segment _ with _ unique<C,U> { 10 segment<C>(G) ( outlier<C>(G),child(N 1 ,G), concept<U>(N 1 ,G), ¬ ( child(N 2 ,G),N 1 6= N 2 , ¬ (concept<C>(N 2 )_segment<C>(N 2 )) ) .} 12 TEMPLATE outlier<C>{ 14 outlier<C>(G) ( root(G)_child(G,P),child(G 0 ,P), ¬ (segment<C>(G 0 )) } A A A A B B C 3 4 2 1 Real-estate Used-car 0.6 0.7 0.8 0.9 1 field segment layout domain 0.9 0.92 0.94 0.96 0.98 1 Airfare Auto Book Job US R.E. 0.94 0.955 0.97 0.985 1 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) Domain Scope: OPAL-TL OPAL-TL extends Datalog ¬,with templates and annotation queries templates = common pattern in web forms, e.g., range specication rules = constraints for domain-specific element and segment types annotation queries: labels annotated with certain type direct? look for direct only or also group labels? proper? look for labels (‘price’) or also values (‘£10’) exclusive? majority of labels are of the type template: limited second-order, same data complexity template variables quantify over predicates or types instantiation reduces to Datalog INSTANTIATE basic_concept<D,A> using {<RADIUS, radius>} Form Filling: A Passe-partout for the Web Applications of OPAL URL Visualization Controller Web page Visualization Master form Ⓐ Field Segment filled automatically according to master form provides passe- partout for filling individual forms Master form serves as template for lling individual forms (passe-partout) currently: free text values used directly for free text inputs — but using approximate matching for lling select boxes Experiments Search Data Extraction Assistive Devices Multi-modal Input (Touch-Screens) Web Automation & Testing form filling —‘key’ to the web — all types of web automation need it OPAL: Automated Form Understanding for the Deep Web OPAL combines multi-scope domain-independent analysis with domain knowledge multi-scope domain-independent analysis: eld, segment, and layout scope — integration of three scopes yields more robust, simpler heuristics than single scope — strict preference for disambiguation for quality and performance reasons OPAL: A Passe-partout for the Web Domain knowledge on top of the domain-independent analysis for classifying form fields and segments according to the domain ontology verifying and repairing the form model to be consistent with domain constraints Datalog-based template language for easy denition of domain knowledge Automatic form lling based on domain-specific master form — automatic (approximate) matching of master form values to values of concrete elds visualization of form and segment concepts — automatically detects forms of the given domain and lls them — works on nearly any page of the domain OPAL outperforms previousapproaches even without domain knowledge with domain knowledge we achieve nearly perfect accuracy in multiple domains Digital Home diadem.cs.ox.ac.uk/opal/ Authors Xiaonan Guo, Jochen Kranzdorf, Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart OPAL: A Passe-partout for Web Forms Sponsors [email protected] Segment Scope Layout Scope

description

OPAL: A Passe-partout for Web Forms Poster

Transcript of OPAL: A Passe-partout for Web Forms Poster

Page 1: OPAL: A Passe-partout for Web Forms Poster

DIADEM data extraction methodologydomain-centric intelligent automated

b-node

DOM tree

Field Scope

Fields & Labels

Segment tree

Segment Scope

Segments & Labels

Layout Scope

Layout tree

Visual Labels

Domain Scope

Form Model

Schema tree

Figure 3: OPAL Interface

(e). Next, at segment scope, we successfully find the segments forareas (a), (c), and (e). For (a), by recognizing the interleaving patternbetween the radio buttons and texts, OPAL associates these textscorrectly with the respective radio buttons. However, in (c), no cleartext-field pattern arises to guide the label assignment.

Thus, the first two scopes leave the fields in areas (b) and (c)unassigned. OPAL continues with its analysis by exploiting visualinformation. At the layout scope (see Figure 4b), “Town / City”in area (b) is the only visible text for the drop down list, since itappears in the top left region of the field and does not belong tothe visible region of any other field. Similarly, in area (c) we assign“Price” to both drop down lists.

Finally at domain scope, OPAL classifies the elements and veri-fies that the obtained results comply with our domain specific formmodel. There are no ambiguous or superfluous segments in thiscase.

By leveraging form interpretation, the filling phase requiresOPAL to pick the closest matches from all the select boxes. If nomatch is found, such as for the town/city here (as Holbrook Morandoes not serve the London area), the wildcard value is selected andthe user is notified.

Evaluation. The demonstration proceeds with several other ex-amples form the UK real estate and used car domain. Figure 5ashows precision, recall, and F-score (accuracy) of OPAL for 100forms from each of these domains. The contribution of each anal-ysis scopes in this experiment is depicted in Figure 5b. In bothcases, OPAL achieves nearly perfect form understanding with F-scores close to 99%. We evaluate the form filling on top of thisevaluation and observed no case where OPAL understands a formcorrectly, but does not fill it successfully, except for a few caseswhere the form changes dynamically or uses heavily scripted UIelements. Figure 5b emphasizes that the combination of the threeform labeling scopes and the domain dependent form interpreta-tion are indeed necessary to achieve such a high accuracy. In par-ticular, it is worth pointing out that though it may appear that weachieve high accuracy with the simple form scope only (> 80% in

(a)

(b)

(c)

(d)

(f)

(e)

(a) web page (b) page scope

Figure 4: Holbrook Moran form and page-scope labeling

1

0.99

Precision0.98

PrecisionRecall

0.97 F-score

0.96

0.95Real-estate Used-carReal estate Used car

(a) Accuracy

0.8

1

domain layout segment

Real-estate Used-car0.6

segmentfield

(b) Scopes

Figure 5: OPAL evaluation

the used car domain), that observation overlooks that the hard taskin form understanding are the last 10%. To underline this, we alsoevaluated OPAL on two publicly available benchmarks, ICQ andTEL-8 (http://metaquerier.cs.uiuc.edu/repository/), usingonly OPAL’s domain independent form labeling. Even in this case,OPAL easily outperforms existing form understanding systems with> 95% on average for ICQ (where existing systems such as [1]achieve at best 92%).

A screencast of this demonstration is available atdiadem-project.info/opal.

4. REFERENCES[1] E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical

approach to model web query interfaces for web sourceintegration. Proc. VLDB Endow., 2:325–336, 2009.

[2] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, andC. Schallhart. Real Understanding for Real Estate Forms. InProc. of WIMS, 2011.

[3] R. Khare and Y. An. An empirical study on using hiddenmarkov model for search interface segmentation. In Proc. ofCIKM, pages 17–26, 2009.

[4] R. Khare, Y. An, and I.-Y. Song. Understanding Deep WebSearch Interfaces: A Survey. SIGMOD Record, 39(1):33–40,2010.

[5] H. Nguyen, T. Nguyen, and J. Freire. Learning to extract formlabels. Proc. VLDB Endow., 1:684–694, 2008.

[6] W. Wu, A. Doan, C. Yu, and W. Meng. Modeling andExtracting Deep-Web Query Interfaces. In Advances inInformation & Intelligent Systems, pages 65–90, 2009.

[7] K. C.-C. C. Zhen Zhang, Bin He. Understanding Web QueryInterfaces: Best-Effort Parsing with Hidden Syntax. In Proc ofSIGMOD, 2004.

11

1

2

32

3

4

4

85

5

6

7 876

Figure 4: Example for Segment Scope Labeling

ns as representative for s (f(s) = ns). For each segment with reg-ular interleaving of text nodes and field or segment nodes, we usethose text nodes as labels for these nodes, preserving any alreadyassigned labels and fields (from field scope). In detail, we iterateover all descendants c of each segment in document order, skip-ping any nodes that are descendants of another segment or fielditself contained in n (line 13). In the iteration, we collect all field orsegment nodes in Nodes, and all sets of text nodes between field orsegment nodes in Labels, except those text nodes already assignedas labels in field scope (line 14), as we assume that these are outliersin the regular structure of the segment. We assign the i-th text nodegroup to the i-th field, if the two lists have the same size (possiblyusing the first text node as labels of the segment, line 17–19).

Figure 4 illustrates the segment scope labeling with trianglesdenoting text nodes, diamonds fields, black circles segments, andwhite circles DOM nodes not in the segment tree. The numbers in-dicate which text nodes are assigned as labels to which segments orfields. E.g., for the left hand segment, we observe a regular struc-ture of (text node+, field)+ and thus we assign the i-th group oftext nodes to the i-th field. For the right hand segment (4), we finda subsegment (5) and field 8 that is already labeled with text node8 in the field scope. Thus 8 is ignored and only one text node re-mains directly in 4, which becomes the segment label. In 5, we findone more text node group than fields and thus consider the first textnode group as a segment label. The remaining nodes have a regularstructure (field, text node+)+ and get assigned accordingly.

3.3 Layout ScopeAt layout scope, we further refine the form labeling for each

form field not yet labelled in field or segment scope, by exploringthe visible text nodes in the west, north-west, or north quadrant,if they are not overshadowed by any other field. To this end, OPALconstructs a layout tree from the CSS box labels of the DOM nodes:

DEFINITION 9. The layout tree of a given DOM P is a tuple(NP,�,w,nw,n,ne,e,se,s,sw,aligned) where NP is the set of DOMnodes from P, �,w,nw,n, . . . the “belongs to” (containment), west,north-west, north, . . . relations from RCR [12], and aligned(x,y)holds if x and y have the same height and are horizontally aligned.

We call w,nw, . . . the neighbour relations. The layout tree is atmost quadratic in size of a given DOM P and can be computedin O(|P|2). For convenience, we write, e.g., w-nw-n to denote theunion of the relations w, nw, and n.

In cultures with left-to-right reading direction, we observe a strongpreference for placing labels in the w-nw-n region from a field. How-ever, forms often have many fields interspersed with field labels andsegment labels. Thus we have to carefully consider overshadowing.Intuitively, for a field f , a visible text node t is overshadowed byanother field f 0 if t is above f 0 or also visible from, but closer to f 0.In the particular case of aligned fields, the former would preventany labeling for these fields and thus we relax the condition.

DEFINITION 10. Given a text node t, a field f 0 overshadows

another field f if (1) f and f 0 are unaligned, w-nw-n( f 0, f ), and

F3

F2

T4T2

T3

T1

W E

S

N

SE

NENW

SW

F1

Figure 5: Layout Scope Labeling

w-nw-n-ne-e(t, f 0) or (2) f and f 0 are aligned and (i) w(t, f 0) or(ii) nw-n(t, f 0) and there is a text node t 0 not overshadowed by an-other field with ne-e(t 0, f 0) and w-nw-n(t 0, f ).

To illustrate this overshadowing, consider the example in Fig-ure 5. For field F1, T2 and T4 are overshadowed by F2 and T3 by F3,only T1 is not overshadowed, as there is no other text node that issouth-east or south from T3 not overshadowed by another field.

The layout scope labeling is then produced as follows: For eachfield f , we collect all text nodes t with w-nw-n(t, f ) and add themas labels to f if they are not overshadowed by another field and notcontained in a segment that is no ancestor of f . The latter preventsassignment of labels from unrelated form segments.

4. FORM INTERPRETATIONThere is no straightforward relationship between form fields for

domain concepts, such as location or price, and their structure withina form. Even seemingly domain-independent concepts, such asprice, often exhibit domain specific peculiarities, such as “guideprice”, “current offers in excess”, or payment periods in real es-tate. OPAL’s domain schemata allow us to cover these specifics.We recall from Section 2 that a form model (F 0,t) for a schema Sis derived from a form labeling F by extending F with types andrestructuring its inner nodes to fit the structural constraints of S.

OPAL performs form interpretation of a form labeling F in twosteps: (1) the classification of nodes in F according to the domaintypes T to obtain a partial typing tP. This step relies on the anno-tation schema L and its typing of labels in F ; (2) the model repairwhere the segmentation structure derived in the segmentation scope(Section 3.2) is aligned with the structure constraints of S.

4.1 Schema Design: OPAL-TLOPAL provides a template language, OPAL-TL, for easily speci-

fying domain schemata reusing common concepts and their con-straints as well as concept templates. To implement a new do-main, we only need to provide (1) a set of annotators implementingisLabela and isValuea and (2) an OPAL-TL specification of the do-main types and their classification and structural constraints.

OPAL-TL extends Datalog with templates and predefined predi-cates for convenient querying of annotations and DOM nodes. AnOPAL-TL program is executed against a form labeling F and a DOMP. Relations from F and P are mapped in the obvious way to OPAL-TL. We only use child (descendant, resp.) for the child (descen-dant, resp.) relation in F . We extend document and sibling orderfrom P to F : follows(X ,Y ) for X ,Y 2 F , if Rfollowing(f(X),f(Y ))2P and no other node in F occurs between X and Y in document or-der; adjacent(X ,Y ), if Rnext-sibling(f(X),f(Y )) 2 P or vice versa.Finally, we abbreviate labell(f(X)) as l(X).

Annotation types and their queries. Annotations (instancesof annotation types) are characterised by an external specificationof the characteristic functions isLabela and isValuea for each a 2A.In the current version of OPAL, these functions are implementedwith simple GATE (gate.ac.uk) gazetteers and transducers, that

Segmentation: Find “logical” structure of the form

— segmentation tree with only fields as leaves and

• form segments s as inner nodes such that s has

• at least degree two and all fields in s are style-equivalent

— segmentation labeling: distribute labels

• to fields, if there is a regular structure

• to segments, if single, prefix label

— style-equivalence: two nodes n and n’

• if same class or same type and CSS style

— complexity: linear in document size and depth

Visual heuristics: Find labels in visual proximity of a field

— but: not “overshadowed” by another field

— but: preference for labels preceding a field in reading order

— t visible from a point p if north-west of p

— f overshadows f’ for t if t visible from

• bottom-right corner of f and f, f’ unaligned

• bottom-left corner of f and f, f’ aligned

• bottom-right corner of f, f, f' aligned and f

has no other visual label

F1F3

F2

T4T2

T3

T1

W E

S

N

SE

NENW

SW

TEMPLATE segment<C>{2 segment<C>(G)(outlier<C>(G),child(N1,G),¬

�child(N2,G),

¬(concept<C>(N2) _ segment<C>(N2))�}

4

TEMPLATE segment_range<C,CM> {6 segment<C>(G)(outlier<C>(G),concept<CM>(N1),concept<CM>(N2),

N1 6= N2,child(N1,G),child(N2,G) }8

TEMPLATE segment_with_unique<C,U> {10 segment<C>(G)(outlier<C>(G),child(N1,G), concept<U>(N1,G),

¬�child(N2,G),N1 6= N2,¬(concept<C>(N2)_segment<C>(N2))

�. }

12

TEMPLATE outlier<C>{14 outlier<C>(G)(root(G)_child(G,P),child(G0,P),¬(segment<C>(G0)) }

Figure 8: OPAL-TL structural constraints

first two templates). It is the only template with two concept tem-plate parameters, C and CM where CM <C is the “minmax” variantof C. The first locates, adjacent pairs of such nodes or a single suchnode and one that is already classified as C. The second rule locatesnodes where the second follows directly the first (already classifiedwith C), has a range_connector (e.g., “from” or “to”), and is not anno-tated with an annotation type with precedence over A. The last rulealso locates adjacent pairs of such nodes and classifies them withCM if they carry a combination of min and max annotations.

In addition to these templates, there is also a small number ofspecific patterns. In the real estate domain, e.g., we use the follow-ing rule to describe forms that use a links for submission (ratherthan submit buttons). Identifying such a link (without probing andanalysis of Javascript event handlers) is performed based on an an-notation type for typical content, title (i.e., tooltip), or alt at-tribute of contained images. This is mostly, but not entirely domainindependent (e.g., in real estate a “rent” link).

concept<LINK_BUTTON>(N1)(form(F),descendant(N1,F),link(N1),N1@LINK_BUTTON{d},¬

�descendant(N2,F),

(concept<BUTTON>(N2) _ follows(N1,N2))�

4.3 Model RepairWith fields and segments classified, OPAL verifies and repairs

the structure of the form according to structural constraints on thesegments, such that it fits to the patterns prescribed by the domainschema. As for classification constraints, we use OPAL-TL to spec-ify the structural constraints. The actual verification and repair isalso implemented in OPAL-TL, but since it is not domain indepen-dent, it is not exposed to the user for modification. Here, we firstintroduce typical structural constraints and their templates and thenoutline the model repair algorithm, but omit the OPAL-TL rules.

Structural constraints. The structural constraints and templatesin the real estate and used car domains are shown in Figure 8 (omit-ting only the instantiation as in the classification case). All segmenttemplates require that there is an outlier among the siblings of thesegment: outlier<C>(G) holds if at least one of G’s siblings is nota C segment. (1) Basic segment. A segment is a C segment, if itschildren are only other segments or concepts typed with C. Thisis the dominant segmentation rules, used, e.g., for ROOM, PRICE, orPROPERTY_TYPE in the real estate domain. (2) Minmax segment. Asegment is a C segment, if it has at least two field children typedwith CM where CM <C is the minmax type for C. This is used, e.g.,for PRICE and BEDROOM range segments. (3) Segment with mandatoryunique. A segment is a C segment, if its children are only segmentsor concepts typed with C except for one (mandatory) field childtyped with U where U 6< C. This is used for GEOGRAPHY segmentswhere only one RADIUS may occur.

Repairing form interpretations. The classification yields aform interpretation F , that is, however, not necessarily a model un-der S, and may contain violations of structural constraints. Weadapt the types of fields and segments and the segment hierarchyof F with the rewriting rules described below to construct a formmodel compliant with S. OPAL performs the rewriting in a strati-fied manner to guarantee termination and introduces at most n newsegments where n is the number of fields in the form.

(1) Under Segmentation: If there is a segment n with type t suchthat CT (t) requires additional child segments of type t1, . . . , tk 62child-T (n), we try to partition the children of n into k+1 partitionsP1, . . . ,Pk,Pn such that Pi |= CT (ti) and Pn [ {t1, . . . , tk} |= CT (t).For each Pi we add a new segment node as child of n, classify itwith ti, and move all nodes assigned to Pi from n to that segment.In practice, few cases of multiple under segmentations occur at thesame node and we can limit the search space using a total orderon T . Though in general this would require value invention, thenumber of segments is actually bounded by the number of fields inthe form, which is typically between 2–10. Therefore, we providea pool of unused segments in the segmentation.

(2) Over Segmentation: If there is a segment n of type t withchildren c1, . . . ,ck such that

Schild-T (ci)[

Sn02C t(n0) |= CT (t)

where C is the set of children of n without c1 . . .ck, then we movethe children of each ci to n and delete all ci.

(3) Under Classification: If there is a segment n of type t withuntyped children c1, . . . ,ck and corresponding types t1, . . . , tk suchthat child-T (n)[{t1, . . . , tk} |= CT (t) and, for each ci, child-T (ci) |=CT (ti) holds, then we type ci with ti.

(4) Over Classification: If there is a segment node n of type twith child c typed t1 and t2 such that {t1}[

Sc02C t(c0) |= CT (t)

where C is the set of children of n without c, we drop t2 from t(c).(5) Miss Classification: If there is a node n of type t where

child-T (n) 6|= CT (t), then we delete the classification of n as t.

5. EVALUATIONWe perform experiments on several domains across four differ-

ent datasets. Two datasets are randomly sampled from the UK realestate and UK used-car domains, respectively. We compare withexisting approaches via ICQ and TEL-8, two public benchmarksets, on which we only evaluate OPAL’s form labeling for fair com-parison to existing approaches, as they only label forms and do notuse domain knowledge. Even with these limitations, OPAL outper-forms these approaches in most domains by at least 5%. We alsoperform an introspective analysis of OPAL to show (1) the impactof field, segment, layout, and domain scope and (2) OPAL’s perfor-mance and scalability with increasing page size.

For the evaluation, we evaluate the proper assignment of textnodes to form fields using precision, recall and F1-score (harmonicmean F1 = 2PR/(P+R) of precision and recall). Precision P isdefined as the proportion of correctly labeled fields over total la-beled fields, while recall R is the fraction of correctly labeled fieldsover total number of fields. For all considered datasets, we com-pare the extracted result to a manually constructed gold standard.We evaluate segmentation through their impact on classification,see Figure 10b and the improved performance on the two datasetswhere we perform form interpretation (UK real estate and used car)versus the ICQ and TEL-8 datasets.

Datasets. For UK real estate domain, we build a dataset ran-domly selecting 100 real estate agents from the UK yellow pages(yell.com). Similarly, we randomly pick 100 used-car dealersfrom the UK largest aggregator website autotrader.co.uk. Theforms in these two domains have significantly different characteris-

A A

AA

B

B

C

3

42

1

Figure 6: Example Form Labeling

are either provided by human domain experts or derived from ex-ternal sources such as DBPedia and Freebase. The current OPALversion contains a large set of such artefacts for common domaintypes such as price, location, or date.

DEFINITION 11. Given a form labeling F on a DOM P and anannotation schema L, an OPAL-TL annotation query is an expres-sion of the form: X@A{d, p,e} where X is a first-order variable,A 2 A, and d, p, and e are annotation modifiers. An annotationquery X@Aµ with µ ✓ {d, p,e} holds for all X 2 JAµ K with

J@Aµ K = {n 2 P : Allowµ (n)\Matchµ (A) 6= /0}\Blockµ (A)

with Allowµ (n) set to y(n) for d 2 µ , and y(n)[y(parent of n)otherwise. Matchµ (A) is to {l :

SA0<⇤A isLabelA0(l)} for p 2 µ , and

{l :S

A0<⇤A(isLabelA0(l)_ isValueA0(l))} otherwise. Blockµ (A) equals{n : 9A0 �A, |Matchµ (A)|< |Matchµ (A0)|} if e2 µ , and /0 otherwise.

Intuitively, an annotation query X@A returns all nodes labeledwith a label that is annotated with A. If the modifier d (direct) isnot present, we also consider the (direct) segment parents, other-wise only direct labels are considered. If the modifier p (proper) ispresent, only isLabelA is used, otherwise also isValueA. If the modi-fier e (exclusive) is present, a node that fullfils all other conditionsis still not returned, if there are more labels with annotations of atype that has precedence over A.

Consider the form labeling of Figure 6 under a schema withC < B and B � A. Labels are denoted with triangles, fields withdiamonds, segments with circles. Labels are further annotated withmatching annotation types (here always only one). If value labelsare drawn as outlines. Then, X@A{} matches 2,3,4; X@A{e,d}matches 2,4, but not 3 as 3 has more labels of B (or one of its sub-classes) than of A and the exclusive modifier e is present; X@A{e, p}matches 2,3, but not 4 as the proper modifier p prevents the valuelabels in white to be considered. The latter matches 3 despite thepresence of e, as we consider also the labels of the parent of 3 (sincethe direct modifier d is absent) and thus there are two A labels.

OPAL-TL templates. OPAL-TL extends Datalog¬ (Datalog withstratified negation) by templates to define reusable patterns for do-main concepts. Examples of such patterns are basic classificationpatterns that derive a domain type from a conjunction of annotationtypes or min-max range patterns where we look for multiple fieldswith related annotations in a group and some clue that they repre-sent a range. In general, there are two types of template patterns,one for classification constraints, one for structural constraints. Theformer specify patterns for relationships between domain and an-notation types, the latter the abstract structure of domain concepts.

DEFINITION 12. An OPAL-TL template is an expressionTEMPLATE N<D1, . . . ,Dk> { p ( expr } where N names the template,D1, . . . ,Dk are template parameters, p is a template atom, expra conjunction of template atoms and annotation queries. A tem-plate atom p<C1, . . . ,Ck>(X1, . . . ,Xn) consists of first-order predi-cate name p, template variables C1, . . . ,Ck, and first-order vari-ables X1, . . . ,Xn.

Multiple rules with the same head express union as usual. For con-venience, we use _ and ¬ over conjunctions, which are translatedto pure Datalog¬ rules as usual (not effecting data complexity).

TEMPLATE basic_concept<C,A> { concept<C>(N)(N@A{d,e,p} }2

TEMPLATE concept_by_segment<C,A> { concept<C>(N)(N@A{e,p} }4

TEMPLATE concept_minmax<C,CM,A> {6 concept<CM>(N1)(child(N1,G),child(N2,G),adjacent(N1,N2),

N1@A{e,d},(concept<C>(N2) _ N2@A{e,d})8 concept<CM>(N2)(child(N1,G),child(N2,G),follows(N2,N1),

concept<C>(N1),N2@range_connector{e,d},¬(A1 � A, N2@A1{d})10 concept<CM>(N1)(child(N1,G),child(N2,G),adjacent(N1,N2),

N1@A{e,p},N2@A{e,p},�(N1@min{e,p},N2@max{e,p})

12 _ (N1@max{e,p},N2@min{e,p})�

Figure 7: OPAL-TL classification templates

As an example, the following template defines a family of con-straints that associate the domain type D to a node N whenever Nis labeled by an exclusive direct and proper annotation of type A.

TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} }

A template tpl is instantiated to produce a family of rules wherethe formal template variables D1, . . . ,Dk are instantiated using val-ues vi

1, . . . ,vik from a template instantiation expression of the form

INSTANTIATE tpl<D1, . . . ,Dk> using { <v11, . . . ,v

1k> . . . <vn

1, . . . ,vnk> }

For example, the following expression instantiates basic_conceptreplacing D with type RADIUS and A with annotation type radius

INSTANTIATE basic_concept<D,A> using {<RADIUS, radius>}

and produces the following instantiated rule:

concept<RADIUS>(N)(N@radius{e,d,l}

PROP. 1. OPAL-TL has the same data complexity as Datalog¬.

4.2 ClassificationClassification is based on the classification constraints of the do-

main schema. In OPAL these constraints are specified using OPAL-TL to enable reuse of domain concepts and concept patterns. In thereal estate and used car domains, we identify three patterns that suf-fice to describe nearly all classification constraints. These patternseffectively capture very common semantic entities in forms and areparametrized using domain knowledge. The building blocks are adomain type (or concept) C and an annotation type A that is used todefine a classification constraint for C. None of these patterns usesmore than one annotation type as template parameter, though manyquery additional (but fixed) annotation types in their bodies.

Figure 7 shows the classification templates for real-estate andused car: (1) Basic concept. The first template captures direct clas-sification of a node N with type C, if N matches X@A{d,e,p}, i.e.,has more proper labels of type A than of any other type A0 withA0 � A. This template is used by far most frequently, primarily forconcepts with unambiguous proper labels. (2) Concept by segment.The second template relaxes the requirement by considering alsoindirect labels (i.e., labels of the parent segment). In the real estateand used car domains, this template is instantiated primarily forcontrol fields such as ORDER_BY or DISPLAY_METHOD (grid, list, map)where the possible values of the field are often misleading (e.g.,an ORDER_BY field may contain “price”, “location”, etc. as values).(3) Min-max concept. Web forms often show pairs of fields repre-senting min-max values for a feature (e.g., the number of bedroomsof a property). We specify this pattern with three simple rules (line5–12), that describe three configurations of segments with fields as-sociated with value labels only (proper labels are captured by the

Real-estate

Used-car

0.6 0.7 0.8 0.9 1

field segment layout domain

0.9

0.92

0.94

0.96

0.98

1

Airfare Auto Book Job US R.E.

0.94

0.955

0.97

0.985

1

UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)

Domain Scope: OPAL-TLOPAL-TL extends Datalog¬,≠ with templates and annotation queries

— templates = common pattern in web forms, e.g., range specification

— rules = constraints for domain-specific element and segment types

— annotation queries: labels annotated with certain type

• direct? look for direct only or also group labels?

• proper? look for labels (‘price’) or also values (‘£10’)

• exclusive? majority of labels are of the type

— template: limited second-order, same data complexity

• template variables quantify over predicates or types

• instantiation reduces to Datalog

INSTANTIATE basic_concept<D,A> using {<RADIUS, radius>}

Form Filling: A Passe-partout for the Web Applications of OPAL

➊➊ URL

➋ Visualization

Controller➋

➌ Web page ➌

➍ Visualization

➎ Master form

Ⓐ FieldⒷ Segment

Ⓐ➍

Ⓑfilled automatically

according to master

form

provides passe-

partout for filling

individual forms

Master form serves as template for filling individual forms (passe-partout)

— currently: free text values used directly for free text inputs

— but using approximate matching for filling select boxes

ExperimentsSearch

Data Extraction

Assistive

Devices

Multi-modal Input

(Touch-Screens)

Web Automation

& Testing

form filling

—‘key’ to the web

— all types of web

automation need it

OPAL: Automated Form Understanding for the Deep WebOPAL combines multi-scope domain-independent analysis with domain knowledge

— multi-scope domain-independent analysis: field, segment, and layout scope

— integration of three scopes yields more robust, simpler heuristics than single scope

— strict preference for disambiguation for quality and performance reasons

OPAL: A Passe-partout for the Web

Domain knowledge on top of the domain-independent analysis for

— classifying form fields and segments according to the domain ontology

— verifying and repairing the form model to be consistent with domain constraints

— Datalog-based template language for easy definition of domain knowledge

Automatic form filling based on domain-specific master form

— automatic (approximate) matching of master form values to values of concrete fields

— visualization of form and segment concepts

— automatically detects forms of the given domain and fills them

— works on nearly any page of the domain

OPAL outperforms previousapproaches even without domain knowledge

— with domain knowledge we achieve nearly perfect accuracy in multiple domains

Digital Home

diadem.cs.ox.ac.uk/opal/Authors

Xiaonan Guo, Jochen Kranzdorf, Tim Furche, Giovanni Grasso,

Giorgio Orsi, Christian Schallhart

OPAL: A Passe-partout for Web FormsSponsors

[email protected] Scope

Layout Scope