Information extraction with automatic knowledge expansion

26
Information extraction with automatic knowledge expansion Hanmin Jung * , Eunji Yi, Dongseok Kim, Gary Geunbae Lee Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea Received 28 February 2003; accepted 15 July 2003 Available online 6 September 2003 Abstract POSIE (POSTECH Information Extraction System) is an information extraction system which uses multiple learning strategies, i.e., SmL, user-oriented learning, and separate-context learning, in a question answering framework. POSIE replaces laborious annotation with automatic instance extraction by the SmL from structured Web documents, and places the user at the end of the user-oriented learning cycle. Information extraction as question answering simplifies the extraction procedures for a set of slots. We introduce the techniques verified on the question answering framework, such as domain knowledge and instance rules, into an information extraction problem. To incrementally improve extraction performance, a sequence of the user-oriented learning and the separate-context learning produces context rules and generalizes them in both the learning and extraction phases. Experiments on the ‘‘continuing education’’ domain initially show that the F 1-measure becomes 0.477 and recall 0.748 with no user training. However, as the size of the training documents grows, the F 1-measure reaches beyond 0.75 with recall 0.772. We also obtain F -measure of about 0.9 for five out of seven slots on ‘‘job offering’’ domain. Ó 2003 Elsevier Ltd. All rights reserved. Keywords: Information extraction; Question answering; User-oriented learning; Lexico-semantic pattern; Machine learning 1. Introduction Information extraction is a process that takes unseen documents as input and produces a tabular structure as output. As Internet growth accelerates, information extraction is attracting * Corresponding author. Tel.: +82-54-279-5581; fax: +82-54-279-2299. E-mail addresses: [email protected] (H. Jung), [email protected] (E. Yi), [email protected] (D. Kim), [email protected] (G.G. Lee). 0306-4573/$ - see front matter Ó 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0306-4573(03)00066-9 Information Processing and Management 41 (2005) 217–242 www.elsevier.com/locate/infoproman

Transcript of Information extraction with automatic knowledge expansion

Page 1: Information extraction with automatic knowledge expansion

Information Processing and Management 41 (2005) 217–242www.elsevier.com/locate/infoproman

Information extraction with automatic knowledge expansion

Hanmin Jung *, Eunji Yi, Dongseok Kim, Gary Geunbae Lee

Department of Computer Science and Engineering, Pohang University of Science and Technology,

San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea

Received 28 February 2003; accepted 15 July 2003

Available online 6 September 2003

Abstract

POSIE (POSTECH Information Extraction System) is an information extraction system which usesmultiple learning strategies, i.e., SmL, user-oriented learning, and separate-context learning, in a question

answering framework. POSIE replaces laborious annotation with automatic instance extraction by the

SmL from structured Web documents, and places the user at the end of the user-oriented learning cycle.

Information extraction as question answering simplifies the extraction procedures for a set of slots. We

introduce the techniques verified on the question answering framework, such as domain knowledge and

instance rules, into an information extraction problem. To incrementally improve extraction performance,

a sequence of the user-oriented learning and the separate-context learning produces context rules and

generalizes them in both the learning and extraction phases. Experiments on the ‘‘continuing education’’domain initially show that the F 1-measure becomes 0.477 and recall 0.748 with no user training. However,

as the size of the training documents grows, the F 1-measure reaches beyond 0.75 with recall 0.772. We also

obtain F -measure of about 0.9 for five out of seven slots on ‘‘job offering’’ domain.

� 2003 Elsevier Ltd. All rights reserved.

Keywords: Information extraction; Question answering; User-oriented learning; Lexico-semantic pattern; Machine

learning

1. Introduction

Information extraction is a process that takes unseen documents as input and produces atabular structure as output. As Internet growth accelerates, information extraction is attracting

* Corresponding author. Tel.: +82-54-279-5581; fax: +82-54-279-2299.

E-mail addresses: [email protected] (H. Jung), [email protected] (E. Yi), [email protected] (D. Kim),

[email protected] (G.G. Lee).

0306-4573/$ - see front matter � 2003 Elsevier Ltd. All rights reserved.

doi:10.1016/S0306-4573(03)00066-9

Page 2: Information extraction with automatic knowledge expansion

218 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

considerable attention from the Web intelligence community. Traditional information extractiontasks involve locating specific information from a plain text written in a natural language. Thus,the task biases an information extraction only as one area of natural language processing. In-formation extraction, as a fundamental front-end technique for knowledge discovery, data min-ing, and natural language interface to databases, on the Web has transformed a major Webapplication technology (Jung, Lee, Choi, Min, & Seo, 2003; Nahm, 2001; Nahm & Mooney,2000).

One crucial challenge in information extraction as a Web application technology is to acquiredomain portability. Since most previous systems require human annotated data to learn extrac-tion rules or patterns, domain experts manually annotate the training data. Worse, when a newdomain is added, a considerable portion of the time to graft for the domain is poured into la-borious annotation. To circumvent this problem, recent research develops weakly supervised andunsupervised learning algorithms. However, the new techniques do not yet satisfy the back-endapplications (Eikvil, 1999; Zechner, 1997).

Domain portability is greatly affected by the manner in which Web document types are usedand by which point in time domain experts or users are involved. Thus, we propose two strategies:first, replacing laborious annotation with automatic knowledge extraction from structured Webdocuments, 1 and second, placing users at the end of a learning cycle in a deployment phase.To incrementally improve extraction performance, POSIE combines user-oriented learning toproduce context rules with separate-context learning to generalize the rules.

The remainder of the paper is organized as follows. Section 2 reviews important related researchon information extraction. Section 3 proposes our information extraction model based on aquestion answering framework. The detail architecture and the knowledge of the POSIE 2 are re-spectively described in Sections 4 and 5. Section 6 explains the techniques to expand this knowledgethrough user-oriented and separate-context machine learning. Section 7 analyzes experimental re-sults for the practical ‘‘continuing education’’ domain. To conclude this paper, Section 8 discussesthe functional characteristics of information extraction systems and future works.

2. Related research

Information extraction (IE) systems using an automatic training approach (Grishman, 1997;Sasaki, 1999; Yangarber & Grishman, 1998) have a common goal: to formulate effective rules torecognize relevant information. They achieve this goal in the manner of annotating training dataand running a learning algorithm (Knoblock, Lerman, Minton, & Muslea, 2000; Riloff, 1996;Riloff & Jones, 1999; Sudo, Sekine, & Grishman, 2001).

Recent IE research concentrates on the development of trainable information extraction sys-tems for the following reasons. First, annotating texts is simpler and faster than writing rules byhand. The rapid growth of the Web contents increases the need for a series of automatic pro-cessing steps. Second, automatic training ensures domain portability and full coverage of ex-

1 Documents containing attributes which can be correctly extracted based on some uniform syntactic clues, for

example, tables in the form of separated attributes and their contents.2 POSIE (POSTECH Information Extraction System).

Page 3: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 219

amples. However, training data which is expensive to acquire interrupts the predominanceof trainable IE systems over other approaches.

Core machine learning algorithms to reduce the burden of the training data were adopted inmany NLP 3 applications including information extraction. Weakly supervised learning algo-rithms such as co-training and co-EM were developed for text categorization (Blum & Mitchell,1998; Nigam & Ghani, 2000). They reduce the annotated data by using a small set of seed rulesand by adding unlabeled text with the best score. Despite these efforts, error propagation in thetotal learning cycle is too severe to obtain the refined rules. Another strategy is active learning,which is a fully supervised learning algorithm, but the user as a domain expert is required to locatethe examples (RayChaudhuri & Hamey, 1997). Due to this laborious work, knowledge changesare difficult to incorporate into the system.

Pierce and Cardie indicate some limitations of co-training for natural language processing, andpropose user-oriented learning (Pierce & Cardie, 2001a, 2001b). They are concerned about thescalability and portability of information extraction systems. During the learning cycle, the userconfirms the desirability of new examples. Although the user is not an expert, either in machinelearning or in information extraction, he or she is competent to identify the specific information toextract as an end user. To leverage the user�s ability, user-oriented learning puts the user into adeployment cycle. However, the algorithm still requires the user both to annotate seed trainingdata and to select new candidate examples. The user is only sporadically involved in the learningprocess; users would not concentrate further on handcrafted works.

Recently, some systems address automatic constriction of extraction rules without any anno-tation, which is more desirable in practical-level applications (Jones, McCallum, Nigam, & Riloff,1999). For DIPRE system, Brin (1998) uses a bootstrapping method to obtain patterns and re-lations from Web documents without pre-annotated data. The process is initiated with smallsamples, such as the relation of (author, title) pairs. Next, DIPRE searches a large corpus forpatterns in which one such pair appears. Similarly, Yangarber and Grishman apply an automaticbootstrapping to seek patterns for name classification (Yangarber & Grishman, 2000). Thismethod requires a named-entity tagger and a parser to mark all the instances of people�s names,companies, and locations. Kim et al. improves these automatic bootstrapping algorithms usingthe types of the Web documents (Kim, Cha, & Lee, 2002; Kim, Jung, & Lee, 2003). They focusmore on declarative-style knowledge, which can be extended with human interaction for practical-level performance in a deployed commercial system. To generate extraction patterns, this modelcombines declarative DTD-style patterns and an unsupervised learning algorithm: SmL. Theelimination of human pre-processed documents for training produces great portability to newdomains. However, the model sacrifices a portion of extraction precision to acquire a high domainportability. Without a dedicate process by hand, a fully automatic extraction system does notalways ensure stable results. 4

User-oriented learning is a promising strategy which eliminates the deficiency of task coverageand provides feedback to the extraction system in both the learning and extraction phases. On the

3 Natural Language Processing.4 SmL shows above 0.8 F 1-measure for semi-structured ‘‘audio–video’’ domain, but below 0.2 for more difficult and

free-style ‘‘continuing education’’ domain.

Page 4: Information extraction with automatic knowledge expansion

220 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

other hand, the automatic acquisition of rules from structured Web documents is a great benefit tothe WWW. To maximize the efficiency of information extraction on the Web, we propose a nicehybridized technique of automatic bootstrapping and user-oriented learning, i.e., the annotationaspect of user-oriented learning is replaced with bootstrapping. That is, the user becomes involvedin only the confirmation aspect of learning. The combination of user-oriented learning and sep-arate-context learning incrementally expands domain knowledge. As the framework of our IEsystem, a question answering system (Lee et al., 2001) is redesigned and adopted in the followingfour steps: first, automatically extract instances from structuredWeb documents; second, constructinstance rules through sentence-to-LSP 5 transfer; third, confirm context rules by user-orientedlearning; finally, generalize the context rules with separate-context learning.

3. Information extraction as question answering

The goal of question answering is to develop a system that retrieves answers rather than doc-uments in response to a question (Lee et al., 2001). As an ordinary procedure, a question answeringsystem focuses on possible answers, how to determine the answer type and how to select the an-swers for each answer type. The system classifies possible answers, designs a method to determinethe answer types, and searches answer candidates. There are three major steps: question processing,passage selection, and answer processing. Question processing analyzes the input question tounderstand what the user wishes to find. Passage selection ranks the passages in retrieved docu-ments. Answer processing selects and ranks answer candidates matching the answer type.

Question answering is closely related to information extraction in that its purpose concerns theacquisition of user-requested information. However, information extraction as question an-swering is much easier than question answering itself for the following reasons. First, informationextraction has a set of pre-selected questions for a target, which removes the need for questionprocessing. Second, a pre-classified document as input is ready to be directly processed, whilequestion answering should identify related documents in unrestricted open domains. Third, therelation between slots is available in information extraction, i.e., pre-defined slots help to deter-mine their instances by using the relation. Thus, recasting the question answering into informationextraction can produce a better performance than state of the art question answering systems 6

(Harabagiu et al., 2000; Moldovan et al., 1999).Information extraction as a question answering also simplifies the extraction processes. We can

easily introduce the techniques verified by question answering, such as domain knowledge andinstance rules. Fig. 1 shows the similarities between information extraction and question an-swering. A slot in information extraction corresponds to a pre-selected question in question an-swering. Thus, information extraction can exclude the question processing which generates manyambiguities for answer types. Domain knowledge, which is common to the two applications,includes a category dictionary, a thesaurus, and collocation information. As a shared feature

5 LSP (lexico-semantic patterns): knowledge encoding techniques used in SiteQ (Lee et al., 2001) question answering

system.6 0.6–0.7 for reciprocal score. (The score for an individual question is the reciprocal of the rank at which the first

correct response was found, or 0 if no correct response was found in the top five responses.)

Page 5: Information extraction with automatic knowledge expansion

Fig. 1. Information extraction as a question answering. Italics are for information extraction as compared with the

question answering.

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 221

between the two, instance rules are applied to obtain instance hypotheses or answer candidatesfrom the input document.

The IE model on a question answering framework improves the building process of domainknowledge by separately applying the types of Web documents. Structured Web documentsprovide a set of instances for each slot. Instance-to-LSP transfer automatically constructs instancerules for IE from the instances obtained from automatic bootstrapping. The following sectionexplains the system architecture based on the IE as a question answering model described here.

4. System architecture for extraction

Our system, POSIE, consists of three major phases: building, learning, and extraction. Thebuilding phase constructs several classes of extraction knowledge (see Section 5), such as collo-cation DB (database) for NE (named entity) tagging and instance rule DB for instance finding.The learning phase generalizes the rules to enhance the extraction coverage (see Section 6). Fig. 2shows only the system architecture to extract target frames using the knowledge obtained andgeneralized by the building and learning phase.

4.1. HTML pre-processing and morphological analysis

DQTagger, an HTML pre-processor, removes most HTML tags except <title> and <key-word> for an HTML document (Shim, Kim, Cha, Lee, & Seo, 2002). The pre-processor keeps the

Page 6: Information extraction with automatic knowledge expansion

Fig. 2. System architecture in the extraction phase.

Table 1

Examples of error DB (English italics are approximate translation)

Incorrect morpheme sequences Correct morpheme sequences

Art practical trainer

Becoming parents

222 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

layout of tables and determines the boundary of the body. All the processes after this pre-pro-cessing are performed on HTML tag-removed documents, that is, almost plain texts. A mor-phological analyzer (MA) segments and analyzes Korean sentences. Each eojeol 7 in a sentenceproduces pairs of morphemes and the part-of-speech (POS) tag. The MA post-editing of theanalysis restores the incorrect morpheme sequences using an error DB (Table 1).

4.2. Category dictionary and thesaurus

SiteQ, a question answering system, uses a category dictionary and a thesaurus to constructlexico-semantic patterns for both questions and retrieved passages (Lee et al., 2001; Kim, Kim,Lee, & Seo, 2001). Since POSIE shares the concept of language processing with a question an-swering system such as SiteQ, we use a category dictionary and a thesaurus as the main semantic

7 Segmented phrases and words in Korean that become a spacing unit.

Page 7: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 223

information sources. In TREC 10, 8 SiteQ has 66 semantic tags and many user-defined semanticclasses. The semantic tags are now expanded to 83 in POSIE.

The category dictionary has approximately 67,280 entries which consist of four components:semantic tag, user-defined semantic class, part-of-speech tag, 9 and lexical form. The structure ofsemantic tags is a flat form. In a lexico-semantic pattern, each semantic tag follows a ‘‘@’’ symbol.User-defined semantic classes are the tags for syntactically or semantically similar lexical groups.For example, a user-define semantic class ‘‘%each’’ includes the words, such as ‘‘ ,’’ ‘‘ ,’’‘‘ ,’’ ‘‘ ,’’ ‘‘ ,’’ ‘‘ ,’’ ‘‘ ,’’ and ‘‘ ’’ in Korean.

The thesaurus, which is an assistant of the category dictionary, discovers sense codes forgeneral unknown words. The codes are matched with category sense-code mapping table (CSMT)to acquire the most similar semantic tags. Currently, the thesaurus has about 90,000 words.

The vertical bar means ‘‘or.’’

Semantic tags

66 tags for Q/A Artificial language, action, artifact, belief, bird, book, building, city,color, company, continent, country, date, direction, disease, drug,event, family, fish, food, game, god, group, language, living thing,location, magazine, mammal, month, mountain, movie, music,nationality, nature, newspaper, ocean, organization, person, pheno-menon, planet, plant, position, reptile, school, season, sports, state,status, subject area, substance, team, transport, weekday, unit forarea, unit for count, unit for date, unit for length, unit for money,unit for power, unit for rate, unit for size, unit for speed, unit fortemperature, unit for time, unit for volume, unit for weight

Extended 17 tags Address, appliance, art, computer, course, deed, examination, hobby,law, level, living part, method, picture, river, room, sex, unit for age

[Thesaurus entries with sense codes]386DX 03010173091001010o0202

(approval) 010A6M0E090H1a01j0B01010102030c02j0B0K0Q070p0B

[Category sense-code mapping table (CSMT)]@computer 03010173091001010o02@action 0B0Ej0B0K0Q062C04j0B0K0Q07

[Mapping results]386DXfi@computer

(approval)fi@action

8 TREC: Text Retrieval Conference, http://itl.nist.gov/.9 We use 32 part-of-speech tags.

Page 8: Information extraction with automatic knowledge expansion

Table 2

Example of sentence-to-LSP transfer (English italics are approximate translation)

Phrases Lexico-semantic pattern

@hobby @position (@level)

Reading trainer

Fairy tale oral narrator

Fairy tale oral narrator

Recreation coach

224 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

4.3. Sentence-to-LSP transfer

A lexico-semantic pattern is a structure where linguistic entries and semantic types can be usedin combination to make an abstraction of certain sequences of words in a text (Lee et al., 2001;Mikheev & Finch, 1995). Linguistic entries consist of words, phrases, and part-of-speech tags,such as ‘‘YMCA,’’ ‘‘Young Men�s Christian Association,’’ and ‘‘nq_loc.’’ 10 Semantic types in-clude slot name instances, semantic tags (categories), and user-defined semantic classes, for ex-ample, ‘‘#ce_c_teacher,’’ 11 ‘‘@person,’’ and ‘‘%each.’’

Sentence-to-LSP transfer makes a lexico-semantic pattern from a given sentence (Jung et al.,2003). Lexico-semantic patterns enhance the coverage of extraction by information abstractionthrough many-to-one mapping between phrases and a lexico-semantic pattern (Table 2).

The lexico-semantic patterns obtained from structured Web documents become the left-handsides of instance rules. The average compression ratio 12 is about 50%, i.e., about two uniquesentences are transferred into one lexico-semantic pattern. Results show that the distribution ofcompression ratio has a high deviation according to slot names. Experimentally, the type of slotnames influences recall and precision (Table 3, see Section 7).

The transfer consists of two phases: named entity (NE) recognition and NE tagging. NE rec-ognition discovers all possible semantic types for each word by consulting a category dictionaryand a thesaurus (Rim, 2001). When a semantic type for a given word does not exist in the categorydictionary, we attempt to discover the semantic types using the thesaurus. A category sense-codemapping table converts the sense code on the thesaurus into semantic tags used in the categorydictionary. The table consists of pairs of semantic tags and sense codes. Each word without asemantic type becomes the key for the thesaurus search. If the search succeeds and some sensecodes are retrieved, we calculate the semantic distance between the retrieved codes and the codes

10 Part-of-speech tag denoting for location or organization.11 Slot name instance for ‘‘teacher’’ slot in ‘‘continuing education’’ domain.12 Compression ratio suggests the degree of abstraction.

Page 9: Information extraction with automatic knowledge expansion

Table 3

Sentence-to-LSP transfer compression ratio from structured Web documents (the slot names are our target slots on

‘‘continuing education’’ domain in this paper)

Slot name # Sentences # Unique sentences # LSPs Compression ratio

$TEACHER 931 579 228 60.62%

$NAME 1982 1662 1287 22.56%

$START 226 78 15 80.77%

$PERIOD 841 158 32 79.75%

$MONEY 1458 277 43 84.48%

$TIME 1904 964 186 80.71%

$NUMBER 834 68 5 92.65%

Total 8176 3715 1796 51.73%

Table 4

Trigram as collocation information for tagging

Trigram Frequency

NULL num @unit_money 138

@unit_date num @weekday 41

sym_* num @unit_time 25

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 225

contained in each semantic tag in CSMT. The semantic distance 13 (similarity) is defined as fol-lows:

13 C

SimðA;BÞ ¼ 2 � common levelðA;BÞ=ðlevelðAÞ þ levelðBÞÞ

NE tagging selects a semantic type for each word so that a sentence can map into our lexico-

semantic pattern only. Collocation DB has the form of a trigram and is utilized for the tagging.The components of the trigram, like lexico-semantic patterns, are lexical entries and semantictypes. The examples for the trigrams and the frequencies for the ‘‘continuing education’’ domainare given in Table 4.

4.4. Instance finding

To find extractable instances, we apply two major features: instance rules and context rules.The instance rules automatically obtained from structured Web documents discover instancehypotheses in a given document (see Section 6.1). The lexico-semantic pattern and the slot nameare the components of the instance rules, as follows:

num@unit datenum@unit datesym par@living partsym par ! $START

ukall@positionsym par@positionsym par ! $NAME

We match the left-hand sides with a lexico-semantic pattern from the sentence-to-LSP transfer.If matching succeeds, the lexico-semantic pattern becomes an instance hypothesis for the slot

urrent threshold is 0.7.

Page 10: Information extraction with automatic knowledge expansion

226 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

name on the right-hand side. Next, to expand the coverage of instance rules, POSIE merges theinstance hypotheses. Recursively, the instance merge applies to the algorithm given below:

Let A and B be instance hypotheses.[Basic conditions]

1. The two are the instance hypotheses of the same slot name.2. The two are in the same sentence in a document. (We do not consider HTML tags because they

were removed during the previous HTML pre-processing.)The two are merged into a new hypothesis:

1. If the scope 14 of A includes that of B, and vice versa. 15

2. If the scope of A overlaps with that of B, and vice versa.3. If A�s end position meet with B�s start position, and vice versa.4. If there is a symbol between A�s end position and B�s start position, and vice versa.

The following shows some of the examples of the instance merge:

[Sentence for ‘‘course time’’ slot]16

Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day)

[Lexico-semantic pattern]{#ce_c_time sym_: @weekday sym_. num sym_: num sym_- num sym_:

num sym_par @unit_date num @unit_date sym_, num @unit_datenum@unit_time sym_par}

[Instance hypotheses]34 instance hypotheses (8 of them have correct slot name) including

{@weekday} ($TIME) // Saturday{numsym_:numsym_-numsym_:num} ($TIME) // 09:30–16:30{@unit_datenum@unit_date} ($TIME) // 1 day/week{num@unit_datenum@unit_time} ($TIME) // 7 hours/day

[The result of instance merges]{@weekdaysym_.numsym_:numsym_-numsym_:numsym_par

@unit_datenum@unit_datesym_,num@unit_datenum@unit_timesym_par} // Saturday. 09:30–16:30 (1 day/week, 7 hours/day)

After instance search and merge, we have all possible instance hypotheses. The remainingmodules of the extraction phases filter the hypotheses using context rules, dynamic slot grouping,

14 The scope means the length (from the start position to the end position) of an instance hypothesis. For example,

instance hypothesis ‘‘09:30–16:30’’ has the length of 7 because its LSP ‘‘num sym_: num sym_- num sym_: num’’

consists of 7 elements.15 For example, if A is ‘‘09:30’’ and B is ‘‘09:30–16:30’’ in the input sentence ‘‘Lecture time: Saturday. 09:30–16:30

(1 day/week, 7 hours/day),’’ then the scope of B includes that of A. Thus, the two would be merged.16 ‘‘Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day).’’

Page 11: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 227

and a slot relation check. We have two versions of context rules (see Section 5.2): context rulesfrom user-oriented learning (see Section 6.2) and generalized context rules from separate-contextlearning (see Section 6.3). These rules verify the extracted instance hypotheses using the left andright context.

4.5. Target frame filling

A context rule represents only one slot instance using the left and right context. On the otherhand, WHISK (Soderland, 2001) permits rule description for multi-slots, which is a major reasonwhy WHISK gives accurate results in discovering multi-slot and their relations. However,WHISK requires learning all the types of permutations because the rule description depends onthe ordering of slots.

Our dynamic slot grouping removes the two major limitations of previous systems such asWHISK, i.e., the number of slots to describe and the learning load to permute. Two ormore contextrules are woven into a rule after discovering instance hypotheses. For example, if two context rules‘‘{#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @po-sition sym_par #ce_c_period}’’ and ‘‘{#ce_c_period sym_:} $PERIOD {#ce_c_time}’’ share thesame boundary ‘‘#ce_c_period,’’ then the slot grouping dynamically combines the two rules as‘‘{#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture @po-sition sym_par #ce_c_period sym_:} $PERIOD {#ce_c_time}.’’ Any restriction on the number ofslots to describe does not exist, because slots are freely grouped in running time. This eventuallymakes POSIE extract multi-slots without any training or rule for them.

The ordering of slots does not affect the learning load of permutation because the source oflearning is a simple context rule, not a combined form of either two or more rules. Thus, dynamicslot grouping is a prospective algorithm for the multi-slot extraction that other informationextractors regard as a burdensome chore. The following examples show our dynamic slotgrouping:

[Input document after HTML pre-processing]17

Child art medical cure primary class

Teacher: Yoon, Youngok (major on art medical cure, art medical curer)

Lecture period: 15 weeks

Lecture time: Saturday. 09:30–16:30 (1 day/week, 7 hours/day)

Registration fee: 450,000 Won

17 The underlined phrases are slot instances.

Page 12: Information extraction with automatic knowledge expansion

228 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

[Slot instances and their context]{NULL} $NAME {#ce_c_teacher 18}{#ce_c_teacher sym_:} $TEACHER {sym_par %picture @action @action sym_, %picture@position sym_par #ce_c_period}{#ce_c_period sym_:} $PERIOD {#ce_c_time}{#ce_c_time sym_:} $TIME {#ce_c_money}{#ce_c_money sym_:} $MONEY {NULL}

[Dynamic slot grouping with left and right context 19]Combined rule for multi-slots: {NULL} $NAME {#ce_c_teacher sym_:} $TEACHER{sym_par %picture @action @action sym_, %picture @position sym_par #ce_c_periodsym_:} $PERIOD {#ce_c_time sym_:} $TIME {#ce_c_money sym_:} $MONEY {NULL}

Finally, slot relation check (SR) determines the number of target frames and slots to fill by theinspection of the groups to which slots belong as follows:

Let M be the group with the largest number of slots.Let slot-numðAÞ be the number of slots belonging to group A.

18 The italic words are the context boundaries to dynamically group.19 The same patterns in the below table indicate that they shares their own context.

Page 13: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 229

Let n be the number of groups where slot-numðAÞ > 1.For each group A do

If slot-numðAÞ is 1 and the slot name in A is one of the slot names inM , then remove group A.If n is 1, the number of target frames is also 1.

After the target frame filling, the user can confirm the extraction results. POSIE providesseveral types of information, i.e., instance type, position, context score, and group number, to theuser to help the confirmation. The feedback updates context rules and then generalized contextrules to incrementally improve the extraction performance.

5. Rules for information extraction

Information extraction requires several pieces of knowledge to be closely related to a pre-defined target. Many systems attempt to minimize domain-dependent knowledge and handcraftedfeatures. Machine learning algorithms such as co-training (Blum & Mitchell, 1998) and inductivelearning (Michalski, Carbonell, & Mitchell, 1983) are widely used as aspects of the effort. We alsofollow mainstream research by reusing semantic information, minimizing manual annotation, andgeneralizing rules using machine learning.

In contrast to other systems, we use two kinds of rules to extract and filter instances: in-stance rules and context rules, including a generalized version. A context rule consists of a slotname and two context separated into left and right. This two-level rule description increasesextraction coverage without keeping all the possible permutations between instances and thecontext.

5.1. Instance rules

From each table field in structured Web documents, instance rules are automatically acquiredthrough a sentence-to-LSP transfer process (see Section 6.1). Instance rules, a tool of instancefinding, consist of lexico-semantic pattern and slot name pairs. Automatic acquisition of instancerules overcomes the current major barriers of information extraction: domain portability withminimal human intervention while maintaining a high extraction performance. POSIE auto-matically extracts the instances for each slot from structured Web documents, and uses them asseed instance examples.

Instance rules are the knowledge required to find slot instance hypotheses. The role of the rulesin information extraction resembles that of question answering, whose rules have named entitiesto discover answer candidates. In information extraction, however, the rules consist of slot in-stances. The rules are applied to test documents to obtain all possible instance hypotheses. Thus,instance filtering is a role of the context rules.

Feedback on user-oriented learning helps to select instance rules to be added later. With thepositive confirmation of the user, new instance rules from the merged instances are added intoinstance rules. Next, the system automatically applies these newly updated rules to extract in-stance hypotheses. This process assures an incremental improvement of the system by enhancedrecall.

Page 14: Information extraction with automatic knowledge expansion

Table 5

Examples of context rules

. . .Teacher: David Lee (picture remedy major, picture curer) Lecture period. . .

. . .Lecture time: Saturday: 09:30–16:30 r Registration fee. . .

Left context Slot namea Right context

#ce_c_teacher sym_: $TEACHER sym_par %picture @action @action sym_, %picture

@position sym_par sym_h #ce_c_period

#ce_c_time sym_: $TIME sym_h #ce_c_moneya The bold and underlined in the above examples are slot names. Their lefts are left context, and the rights are right

context.

230 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

5.2. Context rules and generalized context rules

In this section, we describe both context rules produced by user-oriented learning and gener-alized context rules by separate-context learning. The two learning algorithms are sequentiallyapplied to produce the two rules.

Context rules composed of a left and right context are the knowledge to represent the contextof selected instances. The sample context and their context rules are given in Table 5.

The rules proposed by Califf and Mooney (1998) consist of a filler, a pre-filler, and a post-filler.The context rule of POSIE also has three similar parts. However, several differences exist betweenthe two systems� rules: First, the context style and component: their rules consist of part-of-speechtags, semantic classes, and words regarded as independent features. Our rules are lexico-semanticpatterns tightly coupled with linguistic components. Second, instance representation: they rep-resent the filler in the same format as a pre-filler and a post-filler. Their rules are, in a sense, zerothorder, i.e., more rules are required to represent various different contexts. In POSIE, instance rulesand context rules are separated. A context rule, a meta-instance rule, has only a slot name. Thetwo-level architecture enhances coverage and reduces the size of the rules. Third, context range:they define the range as the number of common features between the examples, which causes thecontext too short to include all of the clue words. On the other hand, POSIE selects the furthestslot name instance 20 within pre-defined window size, currently 10, as the context boundary. 21

As described above, POSIE adopts a two-level rule architecture. However, the absence of rulegeneralization does not ensure reliable coverage. We propose a separate-context learning and asequential covering algorithm, to produce a generalized version of context rules. In their formats,generalized context rules (Table 6) are different from context rules; the latter consists of threeparts: a left context pattern, a slot name and a right context pattern, but, the former of four parts:slot name, context type, context pattern and coverage score (see Section 6.3).

As with context rules, the slot name is the upper level of two-level rule description. The contexttype has information about the direction of the current rule, i.e., left and right, and affirmativeness,i.e., positive and negative. Negative rules play a role on filtering instance hypotheses. If the current

20 Slot name instances are the variations of a given slot name, for example, ‘‘professor,’’ ‘‘teacher,’’ and ‘‘lecturer’’ for

slot name ‘‘$TEACHER.’’21 When no slot name instance is found, context becomes NULL.

Page 15: Information extraction with automatic knowledge expansion

Table 6

Examples of generalized context rules

Slot name Context type Context pattern Coverage score

$TEACHER LEFT (+) #ce_c_teachersym_: 5/7

$PERIOD RIGHT (+) #ce_c_time 4/6

$START RIGHT ()) #ce_c_period@weekday#ce_c_start 2/5

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 231

hypothesis matches with a negative rule, then the hypothesis would be discarded. Where no contextrules are applied, we selectively match with generalized rules according to the context type. Thesource of a generalized rule is context rules described with lexico-semantic patterns. Occasionally,however, the generalized rule would include an incomplete component. For example, ‘‘#ce_c_t’’originally from ‘‘#ce_c_teacher’’ or ‘‘#ce_c_time,’’ because our learning algorithm produces itas the common string among context patterns. Coverage score is the number of context coveredby a current generalized rule among the entire context with the same slot name.

6. Incremental expansion of knowledge

To ensure an incremental extraction performance, new reliable knowledge should be added intoan extraction system as training proceeds. POSIE automatically extract instances, the source ofinstance rules, for each slot using mDTD 22 (Kim et al., 2003). Whenever a Web robot gathersdocuments for a given domain, for the structured documents among them, the mDTD rulesextract the instances. This process gradually increases instance rules through sentence-to-LSPtransfer. Further, POSIE incrementally expands domain knowledge using a sequence of user-oriented learning and separate-context learning. We adapt original user-oriented learning to re-ducing the user�s involvement by replacing manual annotation with automatic bootstrapping.User-oriented learning, a promising algorithm which applies to both learning and extractionphases, is combined with separate-context learning to produce a generalized version of the contextrules confirmed by the user.

Our knowledge expansion on instances and context is similar with Jones and his colleagues�work in that they also use two distinct knowledge: phrases and extraction patterns (Jones et al.,1999). However, we do not use their mutual bootstrapping-like methodology because iterativebootstrapping loop on different knowledge would cause error propagation although each loopchooses the highest scoring pattern. We prevent error propagation by excluding iterative mutuallearning between the knowledge, and filter instance hypotheses by applying dynamic slot groupingand the validation with both instance and context rules.

6.1. Extracting instances from automatic bootstrapping

The instance extractor focuses more on declarative-style knowledge, which can be extendedwith human interaction for practical-level performance in an actual deployed commercial system.

22 mDTD (modified Document Type Definition): an analytical interpretation to identify target information from the

textual fragments of Web documents.

Page 16: Information extraction with automatic knowledge expansion

232 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

The extractor applies a new extraction method to combine declarative DTD-style extractionpatterns and a machine learning algorithm without an annotated corpus to generate the extractionpatterns.

The DTD concept is generally used for markup languages, such as SGML, XML, and HTML.In these documents, DTD is usually located in some external files, and defines the elements whichbelong to this document type (Flynn, 1998). Using DTD, SGML documents can encode the ele-ments included in the documents, and also parse those elements that appear in the document. Weintroduce the concept of mDTD, an extension of the conventional DTD concept of SGML, whichwe modify for applicability to HTML-based Web document extraction. The background idea ofmDTD is similar to DTD usage in SGML. Hence, mDTD is used to encode and decode thetextual elements of the extraction target. In the learning phase, mDTD rules are learned andadded to the set of seed mDTDs for the extraction task. In the extracting phase, a learned mDTDrule set is used as extraction patterns to identify the elements in HTML documents from Websites. The idea of mDTD gives a more structured encoding ability to an otherwise degeneratedHTML document.

A Web Robot gathers Web pages according to the seed URL lists for a given domain. Therobot downloads only structured Web documents among the pages. Next, the instance extractorparses seed mDTD rules and then uses token sequences to construct hierarchical mDTD objectgraphs. In this token sequence, the extraction process is the same as the instance classificationtask, where each token is classified into an instance of the extraction target based on the HTMLtable structure, and the possible name instances of each class.

The extractor identifies various types of extraction targets, which are defined by the templatewith slots (same as the schema attributes in the relational database), and then fills the emptytemplate slots with the identified instances. Table 7 shows an example of the output template, withits filled slots. We process part-of-speech tagging and rule matching tests. Basically, the extractor

Table 7

Example of a slot name and its instances extracted from structured Web documents (English italics are approximate

translation)

Slot name Instance

$NAME (course name)

C++ programming language

CNC Processing Technology Park, Jongbok

Piping Work Park, Seungri

Confucius’s love study

Lecture of scientific investigation

Course of study for tourism

Church music––piano

Beading

Page 17: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 233

uses exact matching techniques between the input token with POS tag information and thesymbolic rule object. If the input token partially matches the symbolic object, decision-making formatching depends on the ratio of the matched characters compared with the total length of theinput token, where the threshold level is up to half the total length.

The POS tag sequence rules are used only to test for exact matching. If none of the symbolicrules from lexical similarity match, this module evaluates the POS tag rules; otherwise, POS tagrules are applied to confirm the matching results between the token and mDTD rules. SmL (Kimet al., 2002; Kim et al., 2003), the forerunner of POSIE, describes the whole process in detail.

6.2. Adapting user-oriented learning

User-oriented learning, a moderately supervised learning algorithm introduced by Pierce andCardie, concentrates on two main issues in information extraction: scalability and portability(Pierce & Cardie, 2001b). Real users are deployed in the identification of the target they wish tolocate and extract. Real users may not be experts at machine learning or text processing, but maybe qualified experts at judging their goals. The authors believe that users can specify their in-formation needs by providing training examples; the user is proficient at judging an informationstructure as adequate or inadequate. User-oriented learning performs three steps: annotation,location, and confirmation. Users confirm examples as positive, negative, and unconfirmed (nodecision) examples (Fig. 3). Distinct from active learning, the user merely confirms the desirabilityof new examples. Definitive judgments from the user also differentiate the form of weakly

Fig. 3. Extraction results before and after applying the user confirmation.

Page 18: Information extraction with automatic knowledge expansion

234 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

supervised learning. However, user involvement in the final decision step is inevitable to acquirean acceptable quality for target extraction, as shown in Fig. 3.

The followings show the context score (confirmation score) calculation formula:

Context score ¼ ð# positive decisions� # negative decisionsÞ=# total decisions applythreshold if # total decisions is 0 ðinitial conditionÞ where current threshold is �0:25

Using structured Web documents as annotated sources removes the need for manual annota-tion (Kim et al., 2002). Users can concentrate on confirmation without any manual annotation oftraining corpus. Thus, the learning steps in POSIE are reduced to two steps which differ from theoriginal user-oriented learning: location and confirmation. The judgment of the user greatlyinfluences the location of new candidate examples. The context score ranges from )1 to 1.

6.3. Generalization of context rules

The sequential covering algorithm, which is a family of algorithms for inductive learning, learnsone rule and then removes the examples which the rule covers (Mitchell, 1998). This iteration iscalled a one-rule-learning–discarding process. Using this process, we introduce a separate-contextlearning algorithm to generalize the characteristic of context rules (Fig. 4).

CalculateMinScore( ) and CalculateMaxScore( ) calculates the minimal (smin) and the maximal(smax) covering scores for current context rule (ci). The covering score represents the number ofrules covered by the current rule.

Covering score (for positive)¼# positive rules covered by given context¼)1 if some negative rules are covered by the context

Fig. 4. Separate-context learning algorithm.

Page 19: Information extraction with automatic knowledge expansion

Table 8

Separate-context learning compared with the other sequential covering algorithms

Example selection Rule scoring Example types

Separate-context

learning

Best context score and frequency

(specific-to-general)

Coverage Positive/negative and

left/right

SmL No selection Coverage and length Only positive

CN2 Best complex (general-to-specific) Entropy Positive and negative

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 235

Covering score (for negative)¼# negative rules covered by given context¼)1 if some positive rules are covered by the context

GeneralizeContext( ) returns the new context rule as long as possible with the maximal (smax)covering score. The function repeatedly grows the size of a given rule and calculates its owncovering score while the maximal (smax) covering score would not be dropped. The enlargementwill be stopped when the maximal (smax) covering score changes down. Like a standard separate-and-conquer algorithms such as IREP (Cohen, 1995), our separate-context learning trains rules ina greedy fashion. However, we attempt to find the best context rule set strictly holding the highestcontext score.

Table 8 compares the three sequential covering algorithms using three important features:example selection, rule scoring, and example types to demonstrate the general ability of our al-gorithm. CN2 (Clark & Niblett, 1989) measures the performance for the generated rules usinginformation gain, which resembles FOIL (Quinlan, 1990), while, sequential m-DTD learning(SmL) (Kim et al., 2002) calculates the lexical similarity and coverage rate. Our separate-contextlearning selects examples by both context score and rule frequency. Positive and negative ex-amples for learning are the field instances simply extracted from structured Web documents.Unlike the other two algorithms, our algorithm learns four sets of examples: positive + left,positive + right, negative + left, and negative + right. When each set of generalized rules entirelycovers its own examples, the learning stops.

7. Experimental results

A Web robot searched Web sites, such as universities and education centers, which provideinformation on ‘‘continuing education.’’ We manually gathered and filtered 431 Web documentson course information from tens of education-related Korean Web sites such as http://oun.knou.ac.kr/, http://www.ajou.ac.kr/~lifetime/, and http://ncle.kedi.re.kr/. Two hundred and fortyeight of them were semi-structured Web documents 23 and the others were structured Web doc-uments. One thousand seven hundred and ninety six instance rules were automatically extracted

23 Documents containing tuples with missing attributes, attributes with multiple values, variant attribute

permutations, and exceptions.

Page 20: Information extraction with automatic knowledge expansion

Table 9

Incremental user-oriented learning

Measure Technique No user training 24 docs 54 docs 78 docs

Total Recall CS 0.748 0.756 0.756 0.789

CS+GC 0.748 0.78 0.78 0.78

CS+GC+SR 0.748 0.772 0.772 0.772

Precision CS 0.35 0.489 0.508 0.61

CS+GC 0.35 0.653 0.681 0.727

CS+GC+SR 0.35 0.674 0.704 0.731

F 1-measure CS 0.477 0.594 0.608 0.688

CS+GC 0.477 0.711 0.727 0.753

CS+GC+SR 0.477 0.72 0.736 0.751

236 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

from 3715 instances in the structured Web documents (Table 3). The rules determine instancehypotheses from the semi-structured Web documents in the first extraction phase.

POSIE extracts instances for seven slots: prescribed number ($NUMBER), teacher($TEACHER), course name ($NAME), start time ($START), period ($PERIOD), tuition fee($MONEY) and school hours ($TIME). The semi-structured Web documents include severalmulti-slots to handle. 24 We divide the documents into two sets: a training and a test set. Onehundred and seventy out of 248 are randomly selected as the test set. The training set consists ofthree subsets: 24, 30 and 24. POSIE measures the extraction performance after learning eachtraining subset. Finally, we applied the context score (CS; see Section 6.2), generalized context(GC; see Section 6.3) and slot relation check strategies (SR; see Section 4.5). 25

Table 9 shows that user-oriented learning enhances extraction performance. As the criteria ofthe performance, we measure recall, precision and F 1-measure on three techniques: context score(CS), context score + generalized context (CS+GC) and context score + generalized context + slotrelation check (CS+GC+SR). We define the baseline of the system as the performance with nouser training, i.e., no manual confirmation is applied. F 1-measure on the baseline is 0.477. Recallis sufficiently high to apply to this domain without human intervention. The high recall and lowprecision imply that the automatic knowledge construction from structured-documents helps tosearch instance hypotheses, but does not have useful information to select one. As the size of theuser training documents grows, the system obtains a higher performance, up to 0.75 for F 1-measure.

The performance after learning 24 documents rapidly increases. While the user training doc-uments less than 10% of total are used for user-oriented learning, precision and F 1-measure reachalmost the peak of our extraction performance. Almost all of our improved performance comes inprecision, while recall stays almost completely flat. Only the merged instances are added into newinstance rule set when the user positively confirms, that is, any instance rules completely separatedfrom existed rule set would not be added into the set. Providing the way to consider the separated

24 More than one target frames in a document.25 All the strategies include dynamic slot grouping strategy.

Page 21: Information extraction with automatic knowledge expansion

Fig. 5. IE performance on each slot.

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 237

rules during the user-oriented learning will certainly increase recall and overall performance, andwhich is one of our future works.

CS+GC+SR strategy is always superior to those of CS and CS+GC except for 78 documentsin F 1-measure, but no exception exists in precision. The reason would be that the high perfor-mance of CS and GC decreases SR�s effect. Indeed, in a case of omitting CS or GC, the per-formance gap by SR distinctly increased for several randomly selected documents. Generalcontext rules ensure a high precision. User-oriented learning makes a set of context rules, andseparate-context learning generates the generalized version that determines whether the currentinstance hypothesis extracted from an unseen document is a real instance. From the above results,we can see that the greater the number of user learning documents, the less the role of the slotrelation check.

We also experimented with the case that CS+GC+SR does not include dynamic slot groupingstrategy. Recall is the same as the result in Table 9, and precision drops to 0.721 at 78 docs. Thelittle effect of the grouping is caused by the high performance of context rules, that is, CS and GC.As we expected, in the case without CS and GC strategies, SR with dynamic slot grouping has0.61 as mentioned in the above table, and SR without dynamic slot grouping has 0.47 in precisionat 78 docs.

Fig. 5 shows the extraction performance on each seven slot. From four slots such as $NUM-BER, $PERIOD, $MONEY and $TIME, we obtain F 1-measures higher than 0.8. On the otherhand, course name and teacher slots have the range of 0.5–0.6 for F 1-measure. The two slots havemore variations in their forms than the other slots, which proves that the compression ratios ofthe two are less than the others.

However, the low performance on some specific slots does not discourage POSIE. Through anindirect comparison with other systems on semi-structured documents, WHISK (Soderland, 2001)and SRV (Freitag, 1998a, 1998b), for teacher slot, 26 POSIE achieves a much better performance.For WHISK, the recall for ‘‘speaker’’ slot is only 0.111 at precision 0.526, and for SRV, precisionis 0.62. POSIE shows a remarkable result that precision for ‘‘teacher’’ slot is 0.769 at recall0.435. This result is noteworthy because the ‘‘teacher’’ slot often requires more than one name,

26 They call it ‘‘speaker’’ slot.

Page 22: Information extraction with automatic knowledge expansion

Table 10

Added extraction knowledge after user-oriented learning

No user training 24 docs 54 docs 78 docs

New instance rules 0 112 144 200

Context rules 0 26 317 364

General context rules 0 16 19 14

238 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

occasionally even five or six persons, for an instance in the ‘‘continuing education’’ domain. Thehigh recall for the ‘‘teacher’’ slot would be obtained from a category dictionary which is crucialknowledge for answering questions. Even in the worst case, precision and recall always are above0.4, which ensures the reliability of our extraction algorithm. While previous IE systems are weakto extract persons, locations, and organizations, question answering systems endeavor to discoverthem. To extract these types, POSIE adopts many methodologies from the question answeringsystem, which eventually ensure higher recall than other IE systems.

User-oriented learning and separate-context learning enrich extraction knowledge (see Table10). New instance rules and context rules are incrementally increased as learning proceeds.General context rules oscillate in their size due to the conflict between context rules, because ourlearning produces only correct generalized rules which do not cover any opposite rules.

We also experimented on ‘‘job offering.’’ We manually gathered and filtered 190 Web docu-ments from http://www.jobkorea.co.kr/, http://www.joblink.co.kr/, and http://www.guinbank.com/ and so on.

POSIE extracts instances for seven slots: category, number, age, schooling, salary, area andperiod. We divide the documents into two sets as the above experiment: a training and a test set.One hundred and thirty five out of 190 are randomly selected as the test set. The training setconsists of three subsets: 24 and 55. POSIE measures the extraction performance after learningeach training subset. Finally, we applied the mixture of the context score, generalized context andslot relation check strategies to the test documents (Table 11).

8. Conclusion

8.1. Characteristics of the information extraction systems

Table 12 summarizes the functional characteristics of well-known information extraction sys-tems (Eikvil, 1999). 27 The first four rows have a background in the wrapper generation com-munities (Kushmerick, 2000; Muslea, Minton, & Knoblock, 1998), i.e., they generate wrappers forstructured Web documents with delimiter-based patterns, and the others in the traditional in-formation extraction communities. RAPIER (Califf & Mooney, 1998), SRV and WHISK adoptrelational learning algorithms to handle a wider range of texts. The last two are systems developedfrom our recent research. SmL applies automatic bootstrapping to the instances in structuredWeb documents (Kim et al., 2003). While SmL shows the optimal reduction of human inter-

27 The table entries except last two rows are cited from Eikvil�s.

Page 23: Information extraction with automatic knowledge expansion

Table 12

Functional characteristics of information extraction systems (O�, requires all possible permutations to be trained to

process; O��, groups related information after extraction)

Name Structured

document

Semi-structured

document

Free text Multi-slot Missing

items

Permutations

ShopBot O – – – – –

WIEN O – – O – –

SoftMealy O O – – O O�

STALKER O O – O�� O O

RAPIER O O – – O O

SRV O O – – O O

WHISK O O O O O O�

SmL O O – – O O

POSIE O O Oa O O OaPOSIE can handle free text documents with lexico-semantic patterns although it does not include any syntactic

chunker or parser.

Table 11

Incremental user-oriented learning on ‘‘job offering’’

Technique Slot No user training 24 docs 55 docs

Recall CS+GC+SR Category 1.000 1.000 1.000

Number 0.600 0.400 0.400

Age 0.600 0.556 0.578

Schooling 1.000 1.000 1.000

Salary 0.911 0.911 0.911

Area 1.000 1.000 1.000

Period 1.000 1.000 1.000

Precision CS+GC+SR Category 0.771 0.800 0.821

Number 0.234 0.909 1.000

Age 0.639 0.644 0.736

Schooling 0.699 0.788 0.849

Salary 0.706 0.869 0.883

Area 0.663 0.744 0.808

Period 0.600 0.600 0.957

F -measure CS+GC+SR Category 0.871 0.889 0.902

Number 0.337 0.556 0.572

Age 0.619 0.597 0.648

Schooling 0.823 0.881 0.918

Salary 0.800 0.890 0.897

Area 0.797 0.853 0.894

Period 0.750 0.750 0.978

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 239

vention and guides an adequate use of the Web document types, it suffers from unstable extractionresults due to the lack of natural language processing capabilities.

Page 24: Information extraction with automatic knowledge expansion

240 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

POSIE satisfies all the functions required by Table 12. To handle multi-slot extraction, POSIElinks related instances using shared boundaries between context rules and slot relation check, anddetects missing instances that frequently appear in semi-structured and free text documents. SincePOSIE dynamically combines the context rules extracted from a document, 28 the specific or-dering of the instances does not degrade its performance.

8.2. Discussion

POSIE is an information extraction system with the hybridization of automatic bootstrappingand a sequence of learning algorithms on a question answering framework. POSIE uses structuredWeb documents, dictionaries, and semantic information to seed patterns of instances and context.Minimal human effort is used to validate these patterns, and then iteratively discover new ones.The system has several strong points. First, minimal intervention is required by users who aredomain experts. Second, question answering techniques give a high performance with a reliablerecall of slot instances. Third, a wide linguistic combination from lexical forms to semantic fea-tures is employed. Fourth, a sequence of learning algorithms in both the learning and extractionphases ensures incremental extraction performance. Future work includes the following topics:adding new domains to ascertain domain portability, 29 designing flexible generalization algo-rithm to obtain maximal coverage for unseen documents, and updating user-oriented learninginterface to add separated instance rules to increase recall.

Acknowledgements

This work was supported by BK21 (Ministry of Education) and mid-term strategic funding(MOCIE, ITEP).

References

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the

conference on computational learning theory.

Brin, S. (1998). Extracting patterns and relations from the World Wide Web. In Proceedings of the international

workshop on the Web and databases.

Califf, M., & Mooney, R. (1998). Relational learning of pattern-match rules for information extraction. In Proceedings

of AAAI spring symposium on applying machine learning to discourse processing.

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4).

Cohen, W. (1995). Fast effective rule induction. In Proceedings of the 12th international conference on machine learning.

Eikvil, L. (1999). Information extraction from World Wide Web: A survey. Technical Report 945, Norwegian

Computing Center.

28 This implies that it does not need to get trained on all possible permutations.29 Now we are preparing ‘‘job offering and hunting’’ domains, and some experiments on them showed similar

performance with ‘‘continuing education’’ domain.

Page 25: Information extraction with automatic knowledge expansion

H. Jung et al. / Information Processing and Management 41 (2005) 217–242 241

Flynn, P. (1998). Understanding SGML and XML tools––Practical programs for handling structured text. Kluwer

Academic Publishers.

Freitag, D. (1998a). Information extraction from HTML: Application of a general machine learning approach. In

Proceedings of the 15th conference on artificial intelligence.

Freitag, D. (1998b). Toward general-purpose learning for information extraction. In Proceedings of the 17th conference

on computational linguistics and the 36th annual meeting of the association for computational linguistics.

Grishman, R. (1997). Information extraction: Techniques and challenges. Materials for information extraction.

International Summer School SCIE-97.

Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Gı̂rju, R., Rus, V., & Morarescu,

P. (2000). FALCON: Boosting knowledge for answer engines. In Proceedings of the 9th text retrieval conference.

Jones, R., McCallum, A., Nigam, K., & Riloff, E. (1999). Bootstrapping for text learning tasks. In Proceedings of the

IJCAI-99 workshop on text mining: Foundations, techniques and applications.

Jung, H., Lee, G., Choi, W., Min, K., & Seo, J. (2003). Multi-lingual question answering with high portability on

relational databases. IEICE Transactions on Information and Systems, E86-D(2).

Kim, D., Cha, J., & Lee, G. (2002). Learning mDTD extraction patterns for semi-structured Web information

extraction. Computer Processing of Oriental Languages, 15(1).

Kim, D., Jung, H., & Lee, G. (2003). Unsupervised learning of mDTD extraction patterns for Web text mining.

Information Processing and Management, 39(4).

Kim, H., Kim, K., Lee, G., & Seo, J. (2001). A fast and reliable question-answering system based on predictive answer

indexing and lexico-syntactic pattern matching. Computer Processing of Oriental Languages, 14(4).

Knoblock, C., Lerman, K., Minton, S., & Muslea, I. (2000). Accurately and reliably extracting data from the Web: A

machine learning approach. Data Engineering Bulletin, 23(4).

Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118.

Lee, G., Seo, J., Lee, S., Jung, H., Cho, B., Lee, C., Kwak, B., Cha, J., Kim, D., Ahn, J., Kim, H., & Kim, K. (2001).

SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP. In

Proceedings of the 10th text retrieval conference.

Michalski, R., Carbonell, J., & Mitchell, T. (1983). Machine learning: An artificial intelligence approach. Tioga

Publishing Company.

Mikheev, A., & Finch, S. (1995). Towards a workbench for acquisition of domain knowledge from natural language.

In Proceedings of the 7th conference of the European chapter of the association for computational linguistics.

Mitchell, T. (1998). Machine learning. McGraw-Hill.

Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Goodrum, R., Gı̂rju, R., & Rus, V. (1999). LASSO: A tool for

surfing the answer net. In Proceedings of the 8th text retrieval conference.

Muslea, I., Minton, S., & Knoblock, C. (1998). STALKER: Learning extraction rules for semistructured, Web-based

information sources. In Proceedings of AAAI workshop on AI and information integration.

Nahm, U. (2001). Text mining with information extraction: Mining prediction rules from unstructured text. PhD

Proposal, The University of Texas at Austin.

Nahm, U., & Mooney, R. (2000). Using information extraction to aid the discovery of prediction rules from text. In

Proceedings of the KDD (knowledge discovery in databases)––2000 workshop on text mining.

Nigam, K., & Ghani, R. (2000). Understanding the behavior of co-training. In Proceedings of the KDD (knowledge

discovery in databases)––2000 workshop on text mining.

Pierce, D., & Cardie, C. (2001a). Limitations of co-training for natural language learning from large datasets. In

Proceedings of the conference on empirical methods in natural language processing.

Pierce, D., & Cardie, C. (2001b). User-oriented machine learning strategies for information extraction: Putting the

human back in the loop. Working notes of the IJCAI workshop on adaptive text extraction and mining.

Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning, 5.

RayChaudhuri, T., & Hamey, L. (1997). Active learning––approaches and issues. Intelligent Systems, 7.

Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the 13th national

conference on artificial intelligence.

Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In

Proceedings of the 16th national conference on artificial intelligence.

Page 26: Information extraction with automatic knowledge expansion

242 H. Jung et al. / Information Processing and Management 41 (2005) 217–242

Rim, H. (2001). Language resources in Korea. In Proceedings of the symposium on language resources in Asia.

Sasaki, Y. (1999). Applying type-oriented ILP to IE rule generation. In Proceedings of the AAAI-99 workshop on

machine learning and information extraction.

Shim, J., Kim, D., Cha, J., Lee, G., & Seo, J. (2002). Multi-strategic integrated Web document pre-processing for

sentence and word boundary detection. Information Processing and Management, 38(4).

Soderland, S. (2001). Learning information extraction rules for semi-structured and free text. Machine Learning, 34.

Sudo, K., Sekine, S., & Grishman, R. (2001). Automatic pattern acquisition for Japanese information extraction.

In Proceedings of the conference on human language technology.

Yangarber, R., & Grishman, R. (1998). Transforming examples into patterns for information extraction. In Proceedings

of TIPSTER text program phase III.

Yangarber, R., & Grishman, R. (2000). Machine learning of extraction patterns from unannotated corpora: Position

statement. In Proceedings of the 14th European conference on artificial intelligence workshop on machine learning

for information extraction.

Zechner, K. (1997). A literature survey on information extraction and text summarization. Paper for direct reading.