Ex Information Extraction System
description
Transcript of Ex Information Extraction System
![Page 1: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/1.jpg)
Ex Information Extraction System
Martin Labsky
KEG seminar, March 2006
![Page 2: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/2.jpg)
Agenda
• Purpose
• Use cases
• Sources of knowledge
• Identifying attribute candidates
• Parsing instance candidates
• Implementation status
![Page 3: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/3.jpg)
Purpose
• Extract objects from documents– object = instance of a class from an ontology– document = text, possibly with formatting
and other documents from the same source
• Usability– make simple things simple– complex possible
![Page 4: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/4.jpg)
Use Cases• Extraction of objects of a known, well-defined class(es)• From document collections of any size
– Structured, semi-structured, free-text– Extraction should improve if:
• documents contain some formatting (e.g. HTML)
• this formatting is similar within or across document(s)
• Examples– Product catalogues (e.g. detailed product descriptions)– Weather forecast sites (e.g. forecasts for the next day)– Restaurant descriptions (cuisine, opening hours etc.)– Emails on a certain topic– Contact information
![Page 5: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/5.jpg)
Use 3 sources of knowledge
• Ontology– the only mandatory source– class definitions + IE “hooks” (e.g. regexps)
• Sample instances– possibly coupled with referring documents– get to know typical content and context of
extractable items
• Common Formatting structure– of instances presented – in a single document, or– among documents from the same source
![Page 6: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/6.jpg)
Ontology sample
• see monitors.xml
![Page 7: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/7.jpg)
Sample Instances
• see monitors.tsv and *.html
![Page 8: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/8.jpg)
Common Formatting
• If a document or a group of documents have common or similar regular structure, this structure can be identified by a wrapper and used to improve extraction (esp. recall)
![Page 9: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/9.jpg)
Document “understanding”
• Known pattern spotting [4]
• ID of possible wrappers [2]
• ID of attribute candidates [2]
• Parsing attribute candidates [4]
![Page 10: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/10.jpg)
Known pattern spotting (1)
• Sources of known patterns– attribute content patterns
• specified in EOL• induced automatically by generalizing attribute
contents in sample instances
– attribute context patterns• specified in EOL• induced automatically by generalizing attribute
context observed in referring documents
![Page 11: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/11.jpg)
Known pattern spotting (2)
• Known phrases and patterns represented using a single datastructure
LCD monitor VIEWSONIC VP201s
230
211
211
UC
AL
215
215
215
LC
AL
567
456
456
UC
AL
719
718
718
MX
AN
token ID
lwrcase ID
lemma ID
capitalization
token type
TokenInfo
![Page 12: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/12.jpg)
Known pattern spotting (3)
• Known phrases and patterns represented using a single datastructure
989
567
3
phrase ID
lemma phrase ID
cnt as monitor_name content
PhraseInfo
0cnt as monitor_name L-context
0cnt as garbage
LCD monitor VIEWSONIC VP201s
attribute
0cnt as monitor_name R-context
![Page 13: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/13.jpg)
Known pattern spotting (4)
• Pattern generalizing the content of attribute monitor_name
lcd monitor viewsonic AN & MX
-1
-1
-1
211
-1
-1
-1
215
-1
-1
-1
456
-1
-1
-1
MX
AN
token ID
lwrcase ID
lemma ID
capitalization
token type
-1 -1 -1
1-2
![Page 14: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/14.jpg)
Known pattern spotting (5)
• Pattern generalizing the content of attribute monitor_name
lcd monitor viewsonic AN & MX
345
27
pattern ID
cnt as monitor_name content
PhraseInfo
0cnt as monitor_name L-context
0cnt as garbage
0cnt as monitor_name R-context
attribute
1-2
![Page 15: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/15.jpg)
Known pattern spotting (6)
• Data structures– All known tokens stored in Vocabulary (character Trie)
along with their features– All known phrases and patterns stored in PhraseBook
(token Trie), also with features
• Precision and recall of a known pattern– using stored count features, we have:– precision & recall of each pattern with respect to each
attribute content, L-context, R-context:– precision = c(pattern & attr_content) / c(pattern)– recall = c(pattern & attr_content) / c(attr_content)
![Page 16: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/16.jpg)
Document “understanding”
• Known phrase/pattern spotting [4]
• ID of possible wrappers [2]
• ID of attribute candidates [2]
• Parsing attribute candidates [4]
![Page 17: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/17.jpg)
ID of possible wrappers (1)
• Given a collection of documents from the same source:
attribute:– Identify all high-precision phrases (hpp’s)
– Apply a wrapper induction algorithm, specifying hpp’s as labeled samples
– Get n-best wrapper hypotheses
![Page 18: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/18.jpg)
ID of possible wrappers (2)
• Start with a simple wrapper induction algorithm:– attribute:– list L-contexts, R-contexts, and X-PATH (LRP) leading to
labeled attribute samples– find clusters of samples with similar LRPs: cluster with |cluster|>threshold:
• compute the most specific generalization of LRP that covers the whole cluster
• this generalized LRP is hoped to cover also unlabeled attributes
– the (single) wrapper on output is the set of generalized LRPs
• Able to plug-in different wrapper induction algorithms
![Page 19: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/19.jpg)
Document “understanding”
• Known phrase/pattern spotting [4]
• ID of possible wrappers [2]
• ID of attribute candidates [2]
• Parsing attribute candidates [4]
![Page 20: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/20.jpg)
Attribute candidate (CA) generation
known phrases P in document collection:– if P is known as the content of some attribute A:
• create new CA from this P
– if P is known as a high-precision L-(R-)context of some attribute:
• create new CA‘s from phrases P’ on the right (left) of P• in CA, set the following feature: has_context_of_attribute_A = 1
wrapper WA for attribute A: phrase P covered by WA:
if P is a not already an CA, create a new CA
in CA, set the following feature to 1: in_wrapper_of_attribute_A = 1
![Page 21: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/21.jpg)
Attribute candidates
• Properties– many overlapping attribute candidates– maximum recall, precision is low
a b c d e f g h i j k l
Att_X
Att_Y
Att_Z
![Page 22: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/22.jpg)
Document “understanding”
• Known phrase/pattern spotting [4]
• ID of possible wrappers [2]
• ID of attribute candidates [2]
• Parsing attribute candidates [4]
![Page 23: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/23.jpg)
Parsing of attribute candidates• The table below can be converted to a lattice• A parse is a single path through the lattice• Many paths are impossible due to ontology constraints• Many paths still remain possible, we must determine the most
probable one
a b c d e f g h i j k l
Att_X
Att_Y
Att_Z
Garbage
![Page 24: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/24.jpg)
Sample parse tree
a b c d e f g h i j k l
AX
AY
AZ
Garbage
AXAY AZ
ICLASSICLASS
Doc
m n ...
![Page 25: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/25.jpg)
AC parsing algorithm• Left-to-right bottom-up parsing• Decoding phase
– in each step, algorithm selects n most probable non-terminals to become heads for observed (non-)terminal sequence
– we support nested attributes, therefore some ACs may become heads of other ACs
– an instance candidate (IC) may become a head of ACs that do not violate ontology constraints
– the most probable heads are determined using features of ACs in the examined AC sequence
• features = all features assigned directly to the AC or to the underlying phrase
• features have weights assigned during parser training
![Page 26: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/26.jpg)
AC parser training
• Iterative training– initial feature weights are set
• based on counts observed in sample instances• based on parameters defined in ontology
– document collection is parsed (decoded) with current features
• feature weights are modified in the direction that improves the current parsing result
• repeat until time allows or convergence
![Page 27: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/27.jpg)
work-in-progress notes
![Page 28: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/28.jpg)
AC Parser – revised
• Attribute candidates (AC)– AC identification by patterns
• matching pattern indicates AC with some probability• patterns given by user or induced by trainer
– assignment of conditional P(attA|phrase,context)• computation from
– single-pattern conditional probabilities
– single-pattern reliabilities (weights)
• AC parsing– trellis representation– algorithm
![Page 29: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/29.jpg)
Pattern types by area
• Patterns can be defined for attribute:– content
• lcd monitor viewsonic ALPHANUM&CAP
• <FLOAT> <unit>
• a special case of content pattern: list of example attribute values
– L/R context• monitor name :
– content+L/R context (units are better modeled as content)• <int> x <int> <unit>
• <float> <unit>
– DOM context• BLOCK_LEVEL_ELEMENT A
![Page 30: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/30.jpg)
Pattern types by generality
• General patterns – expected to appear across multiple websites, used when parsing new
websites
• Local (site-specific) patterns– all pattern types from previous slide can have their local variants for a
specific website– we can have several local variants plus a general variant of the same
pattern, these will differ in statistics (esp. pattern precision and weight)– local patterns are induced while joint-parsing documents with supposed
similar structure (e.g. from a single website)– for example, local DOM context patterns can get more detailed than
general DOM context patterns, e.g. • TD{class=product_name} A {precision=1.0, weight=1.0}
– statistics for local patterns are only computed based on the local website– local patterns are stored for each website (similar to a wrapper) and used
when re-parsing the website next time. When deleted, they will be induced again the next time the website is parsed.
![Page 31: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/31.jpg)
Pattern match types• Types of pattern matches
– exact match, – approximate phrase match if pattern definition allows, or – approximate numeric match for numeric types (int, float)
• Approximate phrase match– can use any general phrase distance or similarity measure
• phrase distance: dist = f(phrase1, phrase2); 0 dist < • phrase similarity: sim = f(phrase1, phrase2); 0 sim < 1
– now using a nested edit-distance defined on tokens and their types• this distance is a black box for now, returns dist, can compare to a set of [phrase2]
• Approximate numeric match– when searching for values of a numeric attribute, all int or float values found in
analyzed documents are considered, except for those not satisfying min or max constraints. User specifies, or trainer estimates:
– a probability function, e.g. a simple value-probability table (for discrete values) or– a probability density function (pdf), e.g. weighted gaussians (for continuous
values). Each specific number NUM found in document can be further represented either as:
• pdf(NUM)• P(less probable value than NUM | attribute) = t: pdf(t) < pdf(NUM) pdf(NUM)• or, use likelihood relative to pdf max: lik(NUM | attribute) = pdf(NUM) / maxt pdf(t)
![Page 32: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/32.jpg)
AC conditional probability – computing P(attA|pat)
• P(attA|phrase,ctx) = Σpat wpatP(attA|pat) Σpat wpat=1• How do we get P(attA|pat)? (pattern indicates AC)
– exact pattern matches• pattern’s precision estimated by user, or• P(attA|pat)=c(pat indicates attA) / c(pat) in training data
– approximate pattern matches• train a cumulative probability on held-out data (phrase similarity trained on
training data)• P(attA | PHR) = interpolate(examples)• examples:
– scored using similarity to (distance from) pattern, and – classified into positive (examples of attA) or negative.
– approximate numeric matches• for discrete p.d.: user estimates precisions for all discrete values as if they were
separate exact matches, or compute from training data:• P(attA|value) = p.d.(value|attA) * P(attA) / P(value)• for continuous pdf: (also possible for discrete p.d.): train a cumulative
probability on held-out data (pdfs/p.d. trained on training data)• P(attA | NUM) = interpolate(examples)• examples:
– scored using: pdf(NUM), or P(less probable value than NUM|attA), or lik(NUM|attA)– classified into positive or negative
reds must come from training data
examples should be both positive and negative
![Page 33: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/33.jpg)
Approximate matches
View0.300
VP0.310
LCD Monitor0.100
LCD VP201D0.121
LCD Viewsonic V8000.011
Monitor VIEWSONIC V906D0.061
LCD Monitor Viewsonic VP201D0.001
LCD0.500
Phrase Pdist(P,A)Att A
• From the above examples, derive a mapping “distanceP(attA|dist)”:
P(attA|dist)
dist(P, attA)0 0.06 0.12 0.50
1
0.5
other mappings possible: we could fit linear or logarithmic curve e.g. by least squares
analogous approach is taken for numeric approximate matchespdf(NUM) or lik(NUM|attA) or P(less probable value than NUM|attA) will replace dist(P,attA) and the x scale will be reversed
![Page 34: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/34.jpg)
AC conditional probability – computing wpat
• P(attA|phrase,ctx) = Σpat wpatP(attA|pat) Σpat wpat=1• How do we get wpat? (represents pattern reliability)• For general patterns (site-independent):
– user specifies pattern “importance”, or– reliability is initially computed from:
• the number of pattern examples seen in training data (irrelevant whether pattern means attA or not)
• the number of different websites showing this pattern with similar site-specific precision for attA (this indicates pattern’s general usefulness)
– with held-out data from multiple websites, we can re-estimate wpat using the EM algorithm
• we probably could first use held-out data to update pattern precisions, and then, keeping precisions fixed, update pattern weights via EM
• EM: for each labeled held-out instance, accumulate each pattern’s contribution to P(attA|phrase,ctx) in accumulatorpat += wpatP(attA|pat). After a single run through held-out data, new weights are given by normalizing accumulators.
• For site-specific patterns:– since local patterns are established while joint-parsing documents with
similar structure, both their wpat and P(attA|pat) will develop as the joint-parse proceeds. wpat will again be based on the number of times the pattern was seen.
![Page 35: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/35.jpg)
Pattern statistics• Each pattern needs:
– precision P(attA|pat) = a/(a+b)– reliability wpat
• Maybe we need also:– negative precision P(attA|pat) = c/(c+d), or– recall P(pat|attA) = a/(a+c) (this could be rel. easy to enter by users)– these are slightly related, e.g. when recall=1 then negative precision=0
• Conditional model variants– A. P(attA|phrase,ctx) = Σpatmatched wpatP(attA|pat) Σpatmatched wpat=1
• Σ only goes over patterns that match phrase,ctx (uses 2 parameters per pattern)
– B. P(attA|phrase,ctx) = Σpatmatched wpatP(attA|pat) + Σpatnonmatched w_negpatP(attA|pat)• Σpatmatched wpat+ Σpatnonmatched w_negpat =1• Σ goes over all patterns, using negative precision for patterns not matched, and negative
reliability w_negpat (negative reliability of a pattern in general != its reliability). This model uses 4 parameters per pattern)
• Generative model (only for contrast)– assumes independence among patterns (naive bayes assumption, which is never true
in our case)– P(phrase,ctx|attA) = P(attA) pat P(pat|attA) / pat P(pat) (the denominator can be ignored
in argmaxA P(phrase,ctx|attA) search, P(attA) is another parameter)– however, patterns are typically very much dependent and thus the probability produced
by dependent patterns is very much overestimated (and often > 1 ;-) )– smoothing would be necessary, while conditional models (maybe) avoid it
c
a
attA
d
b
attA
pat
pat
![Page 36: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/36.jpg)
Normalizing weights for conditional models
• Need to ensure Σpat wpat=1• Conditional model A (only matching patterns used)
– Σpatmatched wpat=1
• Conditional model B (all patterns are always used)– Σpatmatched wpat+ Σpatnonmatched w_negpat =1
• Both models:– need to come up with an appropriate estimation of pattern reliabilities
(weights), and possibly negative reliabilities, so that we can do normalization with no harm
– it may be problematic that some reliabilities are estimated by users (e.g. in a 1..9 scale) and some are to be computed from observed pattern frequency in training documents and across training websites. How shall we integrate this? First, let’s look at how we can integrate them separately:
• if all weights to be normalized are given by user: wpat’= wpat/ ΣpatX wpatX
• if all weights are estimated from training data counts, then something like:• wpat = log(coccurences(pat)) + log(cdocuments(pat)) + log(cwebsites(pat))
• and then as usual (including user-estimated reliabilities) wpat’= wpat/ ΣpatX wpatX
![Page 37: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/37.jpg)
Parsing
• AC – attribute candidate
• IC – instance candidate (set of ACs)
• the goal is to parse a set of documents into valid instances of classes defined in extraction ontology
![Page 38: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/38.jpg)
AC scoring (1)• Main problem seems to be the integration of:
– conditional probabilities P(attA|phrase,ctx) which we computed in previous slides, with– generative probabilities P(proposition|instance of class C)
• proposition can be e.g.:• “price_with_tax > price_without_tax”, • “product_name is first attribute mentioned“,• “the text in product_picture’s alt attribute is similar to product_name”• “price follows name”• “instance has 1 value for attribute price_with_tax”
– if proposition is not true, then complementary probability 1-P is used– proposition is taken into account whenever its source attributes are present in the
parsed instance candidate (let’s call this proposition set PROPS)
• Combination of proposition probabilities– assume that propositions are mutually independent (seems OK)– then we can multiply their generative probabilities to get an averaged generative
probability of all propositions together, and normalize this probability according to the number of propositions used:
– PAVG( PROPS | instance of C) = (propPROPS P(prop|instance of C))1/|PROPS|
– (computed as logs)
![Page 39: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/39.jpg)
AC scoring (2)• Combination of pattern probabilities
– view the parsed instance candidate IC as a set of attribute candidates– PAVG(instance of class C|phrases,contexts) = ΣAIC P(attA|phrase,ctx) / |IC|– Extension: each P(attA|phrase,ctx) may be further multiplied by the “engaged-ness” of
attribute, P(part_of_instance|attA), since some attributes appear alone (outside of instances) more often than others
• Combination of– PAVG(propositions | instance of class C)– PAVG(instance of class C | phrases,contexts)– into a single probability used as a score for the instance candidate– intuitively, multiplying seems reasonable, but is incorrect – we must justify it somehow
• we use propositions & their generative probabilities to discriminate among possible parse candidates for the assembled instance
• we need probabilities here to compete with the probabilities given by patterns• if PAVG(propositions | instance of class C) = 0 then result must be 0• but finally, we want to see something like conditional P (instance of class C | attributes’
phrases, contexts, and relations between them) as an IC’s score• so let’s take PAVG(instance of class C | phrases,contexts) as a basis, and multiply it by the
portion of training instances that exhibit the observed propositions. This will lower the base probability proportionally to the scarcity of observed propositions.
– result: use multiplication: score(IC) = – PAVG(propositions | instance of class C) * PAVG(instance of class C | phrases,contexts)– but experiments necessary (can be tested in approx. 1 month)
![Page 40: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/40.jpg)
Parsing algorithm (1)
• bottom-up parser• driven by candidates with the highest current scores
(both instance and attribute candidates), not a left-to-right parser
• using DOM to guide search• joint-parse of multiple documents from the same
source• adding/changing local patterns (especially DOM
context patterns) as the joint-parse continues, recalculating probabilities/weights of local patterns
• configurable beam width
![Page 41: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/41.jpg)
Parsing algorithm
(2)
1. treat all documents D from a single source as a single document; identify and score ACs;2. INIT_AC_SET={}; VALID_IC_SET={};3. do {4. BAC=the best AC not in INIT_AC_SET (from atts with card=1 or >=1 if any);5. if(BAC’s score < threshold) break;6. add BAC to INIT_AC_SET;7. INIT_IC={BAC};8. IC_SET={INIT_IC};9. curr_block=parent_block(BAC);10. while(curr_block != top_block) {11. for all AC in curr_block (ordered by linear token distance from BAC) {12. for all IC in IC_SET {13. if(IC.accepts(AC)) { * 14. create IC2={IC+AC};15. add IC2 to IC_SET; ** 16. } 17. }18. if(IC_SET contains a valid IC and too many ACs were refused due to ontology constraints) break;19. }20. curr_block= curr_block.next_parent_block(); ***21. }22. add all valid ICs from IC_SET to VALID_IC_SET;23. find new local patterns in VALID_IC_SET, and if found, recompute scores in VALID_IC_SET24. } while(true)25. find and output the most likely sequence of non-overlapping ICs from VALID_IC_SET;
* accepts() returns true if the IC can accommodate the AC according to ontology constraints and if the AC does not overlap with any other AC already present in IC, with the exception of being embedded in that AC.** adding the new IC2 at the end of the list will prolong the loop going through IC_SET*** next_parent_block() returns a single parent block for most block elements. For table cells,this returns 4 aggregates of horizontally and vertically neighboring cells, and the encapsulating table row and column. Calling next_parent_block() on each of these aggregates yields the next aggregate, call to the last aggregate returns the whole table body.
![Page 42: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/42.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
{AX}
X card=1, may contain YY card=1..nZ card=1..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
![Page 43: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/43.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=1..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AXAY}
![Page 44: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/44.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=1..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AXAY}
{AXAY}
![Page 45: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/45.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=0..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AXAY}
{AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
![Page 46: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/46.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=0..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
![Page 47: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/47.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=0..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}
![Page 48: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/48.jpg)
a b c d e f g h i j k lAX
AY
AZ
Garbage
X card=1, may contain YY card=1..nZ card=0..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}
![Page 49: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/49.jpg)
a b c d e f g h i j k lAX
AY
AZ
Bg
X card=1, may contain Y
Y card=1..n
Z card=0..n
Class C
m ...
TDTD
TRTABLE
A
{AY}
{AX[AY]}
{AZ}
{AX[AY]AZ}
![Page 50: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/50.jpg)
a b c d e f g h i j k ...AX
AY
AZ
Bg
X card=1, contains Y
Y card=1..n
Z card=0..n
Class C
TDTD
TR
TABLE
A
{AY}
{AX[AY]}
{AZ}
{AX[AY]AZ}
...
![Page 51: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/51.jpg)
Aggregation of overlapping ACs
• Performance and clarity improvement: before parsing, aggregate those overlapping ACs that have the same relation to ACs of other attributes, and let the aggregate have the max score of its children ACs. This will prevent some multiplications of new ICs. The aggregate will only break down if other features appear during the parse which only support some of its children. At the end of the parse, all remaining aggregates are reduced to their best child.
![Page 52: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/52.jpg)
Focused or global parsing• The algorithm above is focused since it focuses in detail on
each single AC at a time. All ICs built by the parser in a single loop have the chosen AC as a member. More complex ICs are built incrementally based on existing simpler ICs, as we take into account larger neighboring area of the document. Stopping criterion to taking in further ACs from further parts of document is needed.
• Alternatively, we may do global parsing by first creating a single-member IC={AC} for each AC in document. Then, in a loop, always choose the best-scoring IC and add a next AC that is found in the growing context of the IC. Here the IC’s score is computed without certain ontological constraints that would damage partially populated ICs (e.g. missing mandatory attributes). Again, a stopping criterion is needed to prevent high-scoring ICs from growing all over the document. Validity itself is not a good criterion, since (a) valid ICs may still need further attributes, (b) some ICs will never be valid since they are wrong from the beginning.
![Page 53: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/53.jpg)
Global parsing
• How to do IC merging when extending existing ICs during global parsing? Shall we only merge ICs with single-AC ICs? Should the original ICs be always retained for other possible merges?
![Page 54: Ex Information Extraction System](https://reader035.fdocuments.us/reader035/viewer/2022062518/56814680550346895db3a02f/html5/thumbnails/54.jpg)
References• M. Collins: Discriminative training methods for
hidden markov models: Theory and experiments with perceptron algorithms, 2002.
• M. Collins, B. Roark: Incremental Parsing with the Perceptron Algorithm, 2004.
• D. W. Embley: A Conceptual-Modeling Approach to Extracting Data from the Web, 1998.
• V. Crescenzi, G. Mecca, P. Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites, 2000.
• F. Ciravegna: LP2, an Adaptive Algorithm for Information Extraction fromWeb-related Texts, 2001.