Definition Extraction from Text: an experiment using...

24
Definition Extraction from Text: an experiment using evolving algorithms Claudia Borg Department of Artificial Intelligence University of Malta January 2008 Abstract Definition extraction can be useful for the creation of glossaries and in question answering systems. It is a tedious task to extract such sentences manually, and thus an automatic system is desirable. In this work we re- view various attempts at rule-based approaches reported in the literature and discuss their results. We also propose a novel experiment involving the use of genetic programming and genetic algorithms, aimed at assisting the discovery of grammar rules which can be used for the task of definition extraction. 1 Introduction The extraction of definitions from text can be useful in various scenarios, in- cluding the automatic creation of glossaries for the building of dictionaries and in question answering systems. In this work, we will focus on the use of defini- tion extraction in eLearning, where definitions can help learners conceptualise new terms and help towards the understanding of new concepts encountered in learning material. eLearning is the process of acquiring knowledge through electronic aids, by providing access to materials that will enable them to learn a particular task. A tutor can direct the learning process through a Learning Management Sys- tem (LMS), where learning material is presented to the student at the tutor’s direction. The tutor can also use the LMS to create courses, add new content, structure layout and presentation of courses and monitor student performance. Learning material, packaged into units known as learning objects (LOs), nor- mally contain implicit information in natural language, which would require extensive work from the tutor’s side to extract manually, and would thus be ideal to automate. Definitions present in LOs are part of this implicit informa- tion present, and could facilitate a student’s learning process if presented with such information. We propose to use linguistic information to automatically identify and extract definitions from LOs, which could then be incorporated 1

Transcript of Definition Extraction from Text: an experiment using...

Page 1: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

Definition Extraction from Text:

an experiment using evolving algorithms

Claudia Borg

Department of Artificial Intelligence

University of Malta

January 2008

Abstract

Definition extraction can be useful for the creation of glossaries and in

question answering systems. It is a tedious task to extract such sentences

manually, and thus an automatic system is desirable. In this work we re-

view various attempts at rule-based approaches reported in the literature

and discuss their results. We also propose a novel experiment involving

the use of genetic programming and genetic algorithms, aimed at assisting

the discovery of grammar rules which can be used for the task of definition

extraction.

1 Introduction

The extraction of definitions from text can be useful in various scenarios, in-cluding the automatic creation of glossaries for the building of dictionaries andin question answering systems. In this work, we will focus on the use of defini-tion extraction in eLearning, where definitions can help learners conceptualisenew terms and help towards the understanding of new concepts encountered inlearning material.

eLearning is the process of acquiring knowledge through electronic aids, byproviding access to materials that will enable them to learn a particular task.A tutor can direct the learning process through a Learning Management Sys-tem (LMS), where learning material is presented to the student at the tutor’sdirection. The tutor can also use the LMS to create courses, add new content,structure layout and presentation of courses and monitor student performance.Learning material, packaged into units known as learning objects (LOs), nor-mally contain implicit information in natural language, which would requireextensive work from the tutor’s side to extract manually, and would thus beideal to automate. Definitions present in LOs are part of this implicit informa-tion present, and could facilitate a student’s learning process if presented withsuch information. We propose to use linguistic information to automaticallyidentify and extract definitions from LOs, which could then be incorporated

1

Page 2: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

into an LMS, for instance to create a glossary. The tutor will then be able torefine this information rather than create it from scratch.

The task of definition extraction from unstructured natural language textsis a challenging one. We are attempting to identify sentences which define aterm, rather than simply describe it vaguely, or compare it to other terms. Adefinition is a term together with a description of its meaning or the conceptit refers to. Different grammatical constructs are used in natural languageto define terms. The most common style of definition is referred to as thegenus et differentia, where the genus describes the broader concept of the termand the differentia explains the distinguishing features of that term from otheritems within that broad concept. Some definitions also consist of glosses, whichpresent a summary of a meaning such as acronyms. Ideally definitions shouldbe complete and unambiguous, explaining the meaning of the term clearly andin simple terms. Clearly there is no crisp dividing line between definitions andnon-definitions. Furthermore, some sentences can only be classified with addedspecialised knowledge.

The outcome of this work is to evaluate the use of machine learning tech-niques (GAs and GPs) and their results in learning restricted grammars. Thegrammars developed through these experiments can then be applied by rule-based techniques to extract definitions. The results of the GP and the GA willbe used to discover features which identify certain definitions with a high rateof accuracy, but also other features to classify less clear-cut definitions usingfeatures in a combined manner.

In section 2 we look at the work carried out in the task of definition extrac-tion for different applications such as glossary creation or question answeringsystems. We compare various results achieved and methods used for this task,and propose to use an alternative machine learning technique which has not beenapplied to definition extraction as yet. In our experiment, we propose to usegenetic algorithms to learn how to rank the correctness of classified definitions,and genetic programming to learn new grammatical rules to capture definitions.Thus we review work carried out in similar areas using genetic algorithms andgenetic programming in section 3 in order to assess the possibility or success ofsuch an experiment. In section 4 we propose our solution using GAs and GPs,and explain how we intend to combine the two towards a complete definitionextractor.

This work is done in collaboration with an EU-funded FP6 project LT4eL1.The project is described in more detail in [BR07] and [MLS07], and aims atenhancing LMSs by using language technologies and semantic knowledge.

2 Definition Extraction

The literature review presented in this section looks at various attempts indefinition extraction, some of which are aimed at specific applications, suchas the automatic creation of glossaries or question answering techniques. The

1Language Technologies for Learning www.lt4el.eu

2

Page 3: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

work reviewed tends to concentrate on a specific domain or a particular type oftext, with few attempts on definition extraction from generic texts. We anal-yse the various techniques used, from grammar patterns, cue-phrases, statisticalmethods to support vector machines, and present a discussion on the differentapproaches at the end of this section. Although definition extraction is a rel-atively unexplored area, the review below will help to highlight the issues anddifficulties present in the task.

2.1 Grammar and rule-based extraction

Work carried out on automatic creation of glossaries usually tends to be rule-based, taking into consideration mainly part-of-speech (POS) information asthe main linguistic feature. Park et al. [PBB02] extract glosses2 from technicaltexts, using additional linguistic information added to the texts to enable themto identify possible glosses. For this purpose they use several external tools,including POS tagging and morphological analysis, to add the linguistic infor-mation to the texts. They manually identify the linguistic sequences which couldconstitute glosses, and describe the sequences as cascade finite-state transduc-ers. Since sequences sometimes also capture non-gloss items, Park et al. includefiltering rules to discard certain forms from the candidate set. These includeperson and place names, special tokens such as URLs, words containing symbols(except for hyphens and dashes) and candidates with more than six words inlength.

Variants are identified and grouped, with one set at the canonical form andothers listed as variants (including misspelt items and abbreviations). Finallyall glossary candidates are ranked and presented to human experts who selectwhich glosses will be part of the final glossary. In the evaluation of their work,three human experts accepted 228 (76%), 217 (72%), and 203 (68%) out of thetop 300 as valid glossary items. The evaluation does not consider missed glosseswhich ranked lower than 300 and the inter-annotator agreement between thethree experts is not discussed in this evaluation. The choice of the top 300,rather than say 200, glosses seems to be arbitrary, possibly expecting expertsto go through glosses only until non-gloss items become more frequent in thelist. They also do not state the length of the texts and how many gloss itemson average are present in their texts.

Muresan and Klavans [MK02] propose a rule-based system, DEFINDER, toextract definitions from technical medical texts, which can then be fed into adictionary. The corpora used consist of consumer-oriented texts, where a medi-cal term is explained in general language words in order to provide a paraphrase.They approach the task by identifying linguistic features present in definitionsor synonyms through manual observation. They point out that the structure ofdefinitions might not always follow the genus et differentia model and that thedifferent styles of writing can be a challenge for the automatic identification ofdefinitions.

2A gloss provides a short meaning of a term, usually expanding an acronym or specialised

terms

3

Page 4: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

DEFINDER first identifies candidate phrases by looking for cue-phrases suchas “is called a”, “is the term for”, “is defined as”, or a set of punctuation markswhich are deemed important for this task. A finite state grammar is then appliedto extract the definitions, basing its rules on POS and noun phrase chunking tohelp with the definition identification process. DEFINDER then uses statisticalinformation from a grammar analysis module to improve the results, which theauthors claim that this takes into account the styles for writing of definitions(apposition, relative clauses, anaphora). In this work we see that the automaticidentification of definitions is mainly based on the primary identification ofcertain phrases, and then further filtered through certain rules that reinforce asentence being a definition (such as its POS structure).

The evaluation of DEFINDER focuses on comparing the classification of def-initions extracted from nine articles against those marked by human annotators,obtaining a precision of 86.95% and a recall of 74.47%. Since the end use ofDEFINDER is to provide alternative definitions to medical jargon, a qualita-tive evaluation focuses on the usefulness and the readability of the definitionsproposed in comparison to other online medical dictionaries.

Storrer and Wellinghoff [SW06] report work on definition extraction fromtechnical texts by using valency frames. Valency frames contain linguistic infor-mation indicating what arguments a verb takes, such as object, subject, positionand prepositions. Frames are used to match structures of sentences, and thusextracting definitions using rules which are centred around verbs. This is a rule-based expert driven approach, with all information being provided by humanexperts (valency frame, definition categorisation). Such an approach is possi-ble because definitions in technical texts tend to be well-structured, frequentlymatching more crisp rules. Classifying definitions into categories according totheir valency frame allows Storrer and Wellinghoff to concentrate on specificpatterns per rule, taking a fine-grained approach rather than a generic ‘catch-all’ definition extractor. The evaluation was carried out on a corpus containingmanually identified definitions, and achieved an average of 34% precision and70% recall over all the rules created.

Walter and Pinkal [WP06] work on definition extraction from German courtdecisions using parsing based on linguistic information. Their corpus consistsof court decisions, restricted to environmental law due to domain specific ter-minology in legal texts. They identify two styles of definitions: (i) normativeknowledge is described as text that ‘connects legal consequences to descriptionsof certain facts’, (ii) terminological knowledge consists in ‘definitions of some ofthe concepts used in the first descriptions’. They discuss issues specific to legaltext, where although the text is fairly structured, the decisions themselves couldconstitute in controversy as often the terms used in legal texts need clarification.

Through an initial survey, common structural elements constituting defini-tions where recognised as (i) the element that is defined (the term), (ii) theelement that fixes the meaning to be given to the term (the definition), (iii) theconnector — indicating a relation between the term and the definition, (iv) aqualification specifying a domain area, and finally (v) signal word that servesto mark a definition (dann in German, for which there is no equivalent term in

4

Page 5: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

English). The first three structural elements are applicable to any definitions.The fourth one is domain specific being more applicable to legal texts, and thefifth element is present in German but not English, thus language specific.

Walter and Pinkal also note that definitions contained broad linguistic vari-ations and that simply using keyword or pattern matching would not suffice toextract such sentences. On the other hand, through the recognition of the sim-ilarities between the structural elements, and with added linguistic knowledge,the task of definition extraction becomes more effective. Based on these obser-vations, they manually craft 33 extraction rules which retrieve 4716 sentencesas definitions from their corpus. Filtering rules are then applied, bringing thequantity down to 1342 sentences (28% of the initial 4716 sentences). In order toevaluate the extracted definitions, 473 sentences where given to two annotatorsto decide on the correctness of the classification, reaching a precision of 44.6%and 48.6% respectively. Precision improves considerably (over 70%) when thebest 17 rules are used out of the 33. Recall is not calculated since the corpus isnot annotated with definitions.

Fahmi and Bouma [FB06] use the medical pages of the Dutch Wikipediaas their corpus, starting with a rule-based approach to definition extractionand then turning to machine learning techniques to try and improve their re-sults. They start by extracting all sentences containing the verb to be, togetherwith some other grammatical restrictions, as the initial set of possible defini-tion sentences. These sentences are then manually categorised into definitionsand non-definitions. From these sentences they observe which features can beimportant to distinguish definitions from non-definitional sentences. These fea-tures include text properties (n-grams, root forms, and parentheses), sentenceposition, syntactic properties (position of the subject of sentence and if the sub-ject contains a determiner or not) and named entities (location, organisationand person). These features, and different combinations of them, are then cat-egorised into attribute configuration files which are fed into three supervisedlearning methods, namely naıve Bayes, maximum entropy and support vectormachine. This work evaluates the improvement achieved in definition extractionby using these machine learning methods and the different feature combinations.The precision achieved over the different methods and configuration files variesfrom 77% to 92%, with the best resulting method being the maximum entropyon the configuration using the first three features described above. The preci-sion was calculated over manually annotated definitions within their corpus andrecall is not discussed.

2.2 Definition extraction from the web

Klavans et al. [KPP03] look at the Internet as a corpus, focusing mainly on largegovernment websites, trying to identify definitions by genus by using linguisticinformation. They use POS and noun phrase chunking as part of the linguistictools to annotate documents, and also use cue phrases to identify possible im-portant information. In this task, several problems are identified, including theformat of the definitions and the content in which they are present. Definitions

5

Page 6: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

on the Internet can be ambiguous, uncertain or incomplete. They are also beingderived from heterogeneous document sources. Another problem encounteredis that the Internet is a dynamic corpus, and websites could change their in-formation over time. An interesting discussion is presented in how to evaluatea definition extractor, proposing a gold standard for such a type of evaluation,based on qualitative evaluation apart from the standard quantitative metrics.In their evaluation human annotators are given 25 definitions and are asked toidentify the head word and the defining phrase. They then present the inter-annotator agreement according to what has been marked. They remark thatalthough there is a higher agreement on the marking of the head words, there isa lower agreement on the defining phrase, with an 81% agreement consideringpartial phrases (e.g. one annotator marks a full sentence, and another marksonly part of it).

Liu et al. [LCN03] apply a rule-based approach to extract definitions ofconcepts for learning purposes from the Internet, focusing mainly on ComputerScience topics. They aim to assist learning by presenting learners with defini-tions of concepts, and the sub-topics or salient concepts of the original topic.Their system queries search engines with a concept, and the top 100 rankedresults are retrieved. In order to discover salient and sub-topics, they look atlayout information presented in the html tags for features such as headings, boldand italic. A rule-based approach is then applied to filter out items which aregenerally also highlighted in webpages (such as company names, URLs, lengthydescriptions) and stopword removal. They then take frequency into considera-tion and rank the proposed salient or sub-topics.

Definition extraction is then attempted for the concepts and sub-conceptsidentified in the first phase of their work. The identification is carried outthrough rule-based patterns, (e.g. {concept} {refers to | satisfies}. . . ). Web-pages containing definitions are attached to the concept, and presented to theuser in ranked order. The ranking is based on how many concepts/definitionsare present in a webpage (the more being available, the higher the ranking as itis considered more informative).

Liu et al. also propose a way of dealing with ambiguity, where the termbeing learned is too generic and may appear in different context (e.g. classifi-cation may be used in library classification, product classification, data miningclassification, etc.). Again, to differentiate between different contexts, the termis allocated a parent topic, and through the use of document layout structurea hierarchical structure of topics is built accordingly. Mutual reinforcement isalso used to provide further evidence of the hierarchy built, by further search-ing for sub-topics under a particular concept. This generally would result infinding information about other salient topics under that same concept, whichthus continues to re-ensure that the sub-topics are related and belong to theparent concept. Their evaluation compares the definitions retrieved by theirsystem (achieving 61.33% precision) to the results provided by Google (preci-sion 18.44%) and AskJeeves (precision 16.88%).

It is interesting to note that the definition extraction phase is preceded bya phase of concept identification, simplifying the task of definition extraction.

6

Page 7: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

However, the search for definitions is carried out on particular concepts, andnot all definitions contained in a given text. This might result in definitions ofother concepts present in parsed documents being lost. The work also does notmake any use of linguistic information, since at their level of processing simplepattern matching on particular keywords is sufficient. The web is a very largecorpus and thus can provide many documents containing definitions. However,this can also be a disadvantage affecting the quality of definitions found, similarto the problems encountered by Klavans et al. [KPP03] described above (i.e.ambiguous, uncertain or incomplete definitions). This is partially surpassed byproviding several resulting definitions, and documents containing more defini-tions (which might be more authoritative) are presented first. However, qualityof definitions is important in any learning phase, and can determine the learner’sunderstanding of a concept.

2.3 Answering definitional questions

Question Answering (QA) can be considered as part of information retrieval,attempting to provide answers to questions posed in natural language by tryingto identify the reply/replies present in a collection of documents. There areseveral areas in which QA modules focus on, including definitional questions onwhich the following review will focus on. QA systems use the question posedas a starting point to searching for information, and apply linguistic techniquesto filter out the unnecessary content. Some systems go into summarisationtechniques to provide detailed answers rather than short snippets. In eithercase, we are interested to analyse the techniques used to identify sentencescontaining the possible definitional information, and the way they are chosen tobe presented to a user.

Prager et al. [PRC01] attempt a solution to definitional questions to formpart of a full QA system. They identify those hypernyms3 which co-occur withthe definition term, and then check with Wordnet how closely related the twoterms are. Co-occurring terms with the highest occurrence count and closestrelation in Wordnet are ranked highest. They argue that highest ranked co-occurring terms are more likely to be considered acceptable hypernyms and canbe presented as a definitional answer. However, not all definitional questionscan be answered in this way, since sometimes Wordnet does not contain the termbeing defined, or does not contain useful hypernyms which could be presentedas definitions. In some cases, a hypernym is not the ideal answer as a usermight not know what the answer means, and the system risks falling into acyclic pattern of definitional answering.

Joho et al. [JLS01] observe particular patterns which could extract defini-tional information or hypernyms not present in Wordnet. Their method extractsall the sentences where the query noun matches, and then rank the sentencesaccording to (i) the sentence number — that sentence where the query term

3A term is said to be a hypernym of a group of terms where the former denotes a class of

objects of the latter. The term ‘animal’ is a hypernym for the terms cat, dog, bird, etc.

7

Page 8: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

occurs first in a document is more likely to contain a definition (if a definition ispresent in the document at all), (ii) word count of the highest occurring wordsin all candidate answers (excluding stop words) and finally, (iii) a weight to thepattern itself.

Miliaraki and Androutsopoulos [MA04] use Prager and Joho’s work de-scribed above as the baseline implementation for their work which they applyon the TREC QA data4. Their learning-based method uses a Support VectorMachine (SVM) and they attempt to learn classification of definitions based on4 different configurations. An SVM is given a training set of terms to be definedand all 250-character-snippets where the term occurs in the middle, retrievedfrom the IR engine. Each snippet is converted into a training vector accordingto the configurations being learned. The snippets are manually classified intoacceptable definitions or not (3004 definitions vs. 15469 non-definitions). TheSVM learns to generalise the attributes to be able to classify unseen vectors.

1. The first configuration is based on work by Joho et al. The SVM learnsthe attributes of (i) sentence position, (ii) word count, based on the mostwords appearing in all snippets, and (iii) a weight to each manually craftedpatterns, 13 in all.

2. The second configuration takes the above attributes, but also includes anadditional binary attribute to show whether the snippet contains one ofthe best hypernyms or not.

3. The third configuration adds m extra binary attributes to the above, eachcorresponding to an automatically acquired pattern. The patterns con-sidered are the n-grams extracted from documents where the query termoccurs before or after. The patterns are ranked according to precision, andthe best 100 to 300 patterns are used. The advantage of using n-grams isthat domain specific indicators in definitions are also learnt.

4. The fourth configuration discards the binary attribute that shows whethera snippet contains a hypernym or not. This was done to evaluate whetherthe use of Wordnet facilitates finding definitional answers or not.

Their results show that the third configuration obtains the best results with84% of the questions being answered with acceptable snippets.

Blair-Goldensohn et al. [BGMS04] apply both pattern matching and ma-chine learning techniques towards answering definitional questions from genericcorpora containing a range of topics. They then apply summarisation techniquesto provide a long format answers. They work towards extracting all possiblesentences that contain definitional information, based on a pre-defined predicateset as follows:

1. Genus, a generic description of the category the term belongs to

4TREC is the yearly Text REtrieval Conference, with a specific track on QA

8

Page 9: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

2. Species, specific information which distinguishes the term from other termsin the same category

3. Target Partition, dividing the term into sub-parts

4. Cause — the cause of something

5. History — historical information about the term

6. Etymology — information on the term’s origin

7. Non-Specific Definitional — any type of information related to the term(this can be seen as a superset of all the above categories, and more)

In order to learn the above predicates, 14 terms were selected and 5 doc-uments were retrieved from the web for each term. The texts were manuallyannotated according to these predicates and used as training data. Two ap-proaches were considered to learn the categorisation of sentences according tothe predicates. The first uses machine learning techniques to learn feature-basedclassification and the second uses manually extracted lexicosyntactic patternsover the training dataset.

The machine learning technique uses the Incremental Reduced Error Pruning(IREP) learning algorithm, achieving an 81% precision on the training data.The rules learnt take into consideration term concentration and position within adocument, and the accuracy is considered sufficient since these sentences will notbe presented directly to the user, but rather summarised at a later stage. Thesecond technique concentrates on identifying definitions by genus et differentiausing lexicosyntactic patterns. Using the 18 manually crafted rules a precisionof 96% over the training set, however recall was not given. Since summarisationis being applied to the extracted sentences, the final evaluation focuses on thereadability and quality of the summaries provided as answers.

Sang et al. [SBdR05] develop two QA strategies that look specifically atmedical texts about RSI (Repetitive Strain Injury) in Dutch, based on layoutinformation and syntactic parsing. By restricting the domain to medical texts,the type questions asked and the type of answers expected are identified asfollows: treatment, symptom, definition, cause, diagnosis, prevention and other.Type of answers required are also identified as named entities (what does RSImean?), list (a list of causes), paragraph (a more detailed explanation), and ayes/no type of answer.

The first strategy was a layout-based information extraction system, whichextracts the first sentence of each article from a medical encyclopedia. Throughthis exercise it was observed that in larger descriptions in encyclopedias, in-formation is divided into sections with heading containing keywords similar tothe question types recognised (treatment, symptoms, causes), and thus furtherdefinitions could be extracted from such sections. The second strategy usessyntactic parsing based on dependency relations, with the corpus annotatedand stored as dependency parse trees. Patterns were then defined manually toextract information pertaining to definitional, cause and symptom questions.

9

Page 10: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

Definitional answers were based on the concept being present as the head wordfollowed by the verb to be, and achieved a precision of 18%. Symptom answerswere based on the presence of particular phrases (a sign of, an indication of, isrecognizable, points to, manifests itself in, etc.), with a precision of 76%. Causeanswers were based on the presence of certain phrases (causes, cause of, resultsof, arises, leads to), extracting 6600 definitions and achieving 78% precision.

The extracted sentences are then made available to a QA system, and if noanswer is found within that set, it uses a generic QA system as a fallback option.The results show that these systems do not perform well, and the precision ofthe combined correct and incomplete answers range between 14% and 12% forthe respective systems. Sang et al. argue that three main reasons for theseresults are that (i) these strategies do not contain question analysis, (ii) thecorpus is too small (e.g. does not contain any information on the treatment ofRSI) and (iii) the fall back generic QA system is not appropriate for medicaltexts.

2.4 Summary

In this literature review, rule-based approaches rely heavily on expert knowledgeat hand crafting and improving rules for definition extraction. Filtering rulesare also used to improve results, as described by Park et al [PBB02], Muresanand Klavans [MK02], and Walter and Pinkal [WP06]. With the exception of Liuet al. [LCN03], all rule-based or machine learning techniques exploit linguisticinformation such as POS or named entity recognition in order to abstract awayfrom specific terms and instead focus on the linguistic structure of definitionalsentences. Whilst most focus their corpora around technical, legal or medicaltexts where definitions are generally well-structured, Blair-Goldensohn et al.[BGMS04] manage to achieve very positive results in extracting definitions frommore generic domains. However, QA systems already have the query term asa starting point and thus need only consider sentences containing the queryterm. This feature of QA systems could explain why identification of definitionalsentences produces rather positive results.

Identifying glosses in the work of Park et al. [PBB02] can be seen as a subsetof identifying full definitions. Whilst achieving an average precision of 72%,their system is aimed at providing semi-automatic glossary creation, suggestingpossible glosses to a human expert. For their task, they consider this result assatisfactory. Muresan and Klavans [MK02] point out various difficulties thatare encountered when attempting definition extraction, especially since writingstyles can differ. Their system DEFINDER still manages to obtain a highprecision and recall, however the system works with medical texts and theirevaluation is carried out on a set of 53 definitions, which is considerably small.On the other hand, the results obtained by Storrer and Wellinghoff [SW06] andWalter and Pinkal [WP06] could indicate that relying solely on grammaticalrules is not enough. This is further substantiated in various works carried out,including that of Muresan and Klavans [MK02] who use statistical techniquesover and above grammatical information, and Fahmi and Bouma [FB06] who

10

Page 11: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

apply machine learning to learning rules for definition extraction.Fahmi and Bouma manage to increase precision substantially from 54% to

an average of 90% precision by introducing an element of machine learning torule-based definition extraction. One must take into account that the corpusused is a set of Wikipedia articles which are well structured, and definitionstend to be contained within the first sentence or paragraph of the article. Thissurely facilitates the identification task of definitions. They also concentrate onsentences containing the verb to be, thus limiting the learning task.

Extracting definitions from web documents presents further challenges, inthat such a system would have to deal with the faults of the web. The con-tent is dynamic, likely to change or become unavailable, and not necessarilyauthenticated. The last point can cause problems if definitions extracted areincorrect because they state erroneous facts. The problem here is not the def-inition extractor, but rather the origin of such text, which cannot be detectedwith automatic methods.

QA systems rely heavily on lexical information, sentence or term position,and cue-phrases. Sang et al. [SBdR05] approach QA by extracting all firstsentences contained in medical articles on RSI. They consider these sentencesas definitions which could possibly answer definitional questions. As one couldexpect this system does not perform well when compared to other systemsbecause primarily it limits its answers to this extracted set of definitions. Itdoes not try to extract other definitions contained within the rest of the textswhich might provide more correct answers for their QA system.

In work by Blair-Goldensohn et al. [BGMS04] and Miliaraki and Androut-sopoulos [MA04], we see that approaching definition extraction with machinelearning techniques increases precision considerably to over 80%. This is a posi-tive result, reflecting also the increase in accuracy by Fahmi and Bouma (92%).Of particular interest is the work by Blair-Goldensohn et al. where they areinterested in a wider spectrum of definitions which will finally be summarisedinto a long format answer.

From the above we can conclude that improving results achieved by rule-based techniques can be done by including machine learning techniques. How-ever, since definition extraction is still relatively unexplored, not many differentmachine learning techniques have been attempted on this type of task. The pur-pose of this work is to explore the use of evolutionary algorithms, specificallyusing Genetic Algorithms and Genetic Programming, for the task of rule learn-ing and ranking of definitions. Thus the next section will review work carriedout in areas similar to the task of definition extraction using these techniques.

3 Machine Learning

A Genetic Algorithm (GA) [Hol75, Gol89] is a search technique which emulatesnatural evolution, attempting to search for an optimal solution to a problemby mimicking natural selection. By simulating a population of individuals rep-resented as strings (with a particular interpretation) GAs try to evolve better

11

Page 12: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

solutions by selecting the best performing individuals (through the use of afitness function), allowing them to survive and reproduce. This is done usingtwo operations called crossover and mutation. Crossover takes two individuals(parents), splits them at a random point, and switches them over, thus creatingtwo new individuals (children, offspring). Mutation takes a single individualand modifies it, usually in a random manner. The fitness function measuresthe performance of each individual, which is used by the GA to decide whichindividuals should be selected for crossover and mutation, and which individualsshould be eliminated from the population. This process mimicks the survival ofthe fittest, with the better performing individuals remaining within the popu-lation, and poor performing individuals being removed.

Genetic Programming (GP) [Koz92] is based on GAs, with the singulardifference that the individuals are represented as programs (stored as trees), asopposed to a string representation in GAs. In this manner, ‘programs’ can beevolved through the use of GPs. This technique, although still not as widespreadas GAs, is an excellent way to retain the advantages of GAs and at the same timetake advantage of tree representation which is more ideal for certain learningproblems. The same type of operations are carried out in GPs, using sub-nodesfor creating new individuals in the crossover phase, and changing one noderandomly during the mutation phase. The fitness function performs the exactsame function as in GAs, in that it decides which individuals will be selected toproceed further into the lifetime of the population.

Concerns in both GAs and GPs are similar. The fitness function is generallydifficult to derive, and experiments are required to determine the effect of differ-ent fitness functions. Other parameters, such as population size and how oftenmutation and crossover are carried out, also have an effect on the results. Withrespect to GPs, the tree representation translates into having a larger searchspace and it is more difficult to implement crossover and mutation. These fac-tors have to be considered, and the literature review below will analyze suchdecisions taken.

Definition extraction is closely related to grammatical inference, in whichmachine learning techniques are used to learn grammars from data. GAs, andless frequently GPs, are two such techniques which have been applied to gram-matical inference. We will thus review work in this area to observe the differenttechniques used and results achieved.

3.1 Genetic Algorithms

Lankhorst [Lan94] describes a GA used to infer context-free grammars frompositive and negative examples of a language. New, possibly better performing,grammar rules may be discovered by combining parts of different grammar rules.A discussion is presented on the choice of gene representation, between a binaryrepresentation and a high-level representation. It is argued that a bit stringcan represent many more schemata than higher level representations, yieldingto a better coverage of the search space. A bit representation is chosen, withthe lower order bits encoding grammar rules on the right hand side of the rule,

12

Page 13: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

whereas higher order bits are encoding the left-hand side symbol.Selection is based on a stochastic universal sampling algorithm, which helps

to prevent premature convergence by ‘holding back’ super individuals from tak-ing over the population within a few generations. The best individual is alsoalways allowed to survive to the next generation. Mutation allows for a bit in achromosome to be mutated. However, this operation is given a low probabilityso as not to change the population randomly. Reproduction is effected by theschema chosen. The crossover point is influenced by how the representation ofthe rule is expressed. Lankhorst chooses a two-point crossover5 system, thusallowing right-hand side rules to crossover more easily.

The grammars that Lankhorst is learning with the GA are aimed to clas-sify positive and negative examples over a language correctly. Thus the fitnessfunction that is considered takes into consideration the correct classification ofpositive and negative examples. For various sets of languages, such as matchingbracketing and ‘0*1*0*1*’, the fitness function allows the population to con-verge into a solution with reasonable results (the number of generations the GAneeded to converge to the best solution vary from 17 to over 1000, depending onthe grammar being learnt). However, for a micro-NL language, the fitness func-tion did not result in a correct grammar. A further adjustment to the fitnessfunction resulted in correct classifications of substrings to influence the fitnessscore, allowing the GA for a micro-NL grammar to converge correctly, after justunder 2000 generations.

Losee [Los96] applies a GA to learn the syntactic rules and tags for the pur-pose of document retrieval by providing linguistic meaning to both documentsand search terms. Individuals in the GA represent a syntactic rule for parsingor a rule containing the sequence of POS tags, or a combination of the two.Losee limits the number of rules for each nonterminal symbol to 5 to allow theexperiment to achieve acceptable results given the computational restraints atthe time. He also restricts the GA to produce one offspring during the crossoverphase. He maintains that, apart from improving computational performancesince only one child has to be evaluated, this also introduces superior genes intothe population more quickly as it increases the rate of learning. As a fitness func-tion, the GA uses a weighted function of the resulting ranking of the documentand the average maximum parse length, thus measuring performance in terms ofhow effectively the added linguistic information facilitates document retrieval.The evaluation measures which documents are retrieved and the length of theparse required to retrieve such documents. The GA is reported to improve thequality of the initial randomly generated syntactic rules and POS tags.

Belz and Eskikaya [BE98] attempt grammatical induction from positive datasets in the field of phonotactics. The grammar is represented with finite stateautomata, and looks at the use of GA for this task. In this paper two resultsare produced, one for German syllables and the other for Russian bisyllabicwords. The GA is described in detail, including the type of methods selected,and chromosome representation. The GA used is a fine-grained one, where

5A two-point crossover splits each individual in three parts rather that the standard two.

13

Page 14: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

individuals are on a fixed size matrix and are allowed to mate with immediateneighbours. The neighbourhood function is defined as though the individualslie on the surface of a torus.

They argue that an important issue is the representation used for individuals.They present two alternatives: (1) production rules of the form s1 → as2 ands1 → a (where s is a nonterminal symbol and a is a terminal symbol), or (2)a state transition matrix. They argue that production rules produce more fine-grained genotype representations since the terminals and nonterminals can berepresented individually. On the other hand, state transition matrices can beonly seen as a whole, each represented by a single cell. The final representationchosen for this work is that of transition matrices. Each individual represents apossible transition matrix which in turn represents the grammar being induced.In order for GAs to be used, the matrix is flattened into one string (one rowafter the other).

The chosen representation has direct implications on the rest of the GAoperations. Crossover and mutation cannot be carried out in the traditionalsense of GAs, and certain knowledge must be present in the GA so as to maintaina sound structure of this flat matrix. Belz and Eskikaya manage to discoverautomata for German syllables and on average for 94% of the Russian noundataset, achieving rather promising results.

Keller and Lutz [KL97] attempt learning Context Free Grammars throughthe use of GAs, by learning probabilities to all possible grammar rules. Theinitial population of the GA is made of all possible combinations of terminalsand nonterminals of the form A → BC and A → a, where A, B and C arenonterminals and a is a terminal. This guarantees that, although the grammaris a large one, it is finite. There is also no loss of generality, as all possible rulesare present in the initial grammar (including those that will not be part of thefinal solution).

The type of GA chosen for this work also uses a 2D grid representation intorus formation, where mating occurs only with immediate neighbours. Eachindividual is encoded as a set of weights, each weight relevant to one parameter.The weight is represented as an n-bit block, and an individual can be viewed asconsisting of M blocks of n-bits, where M is the number of all possible rules.Since the grammar is considerably larger than the final solution, Keller and Lutztry to give more importance to zero probability assignment. Thus, the initialbit/s of the n-bit block is seen as a “binary-switch” as to whether the rest ofthe bits should be taken into consideration or not.

As for crossover, Keller and Lutz achieved better performance by using anovel genetic operator which they call and-or crossover than by the classicalcrossover operation. The and-or crossover looks at the parents bit by bit, withone child taking the bits produced by the and-operator (conservative), and theother child takes the bits produced by the or-operator (liberal).

Their experiments focus on 6 grammars (equal number of as and bs, anbn,balanced brackets, balanced brackets with two bracket symbols, palindromesover {a,b}, and palindromes over {a,b,c}). They ran the GA for each grammar10 times, except for the last language which ran only for 3 times due to process-

14

Page 15: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

ing time required. The results show that in the majority of runs the GA wasable to learn a grammar which contained all the necessary rules to generate thelanguage. In some instances it would also learn an additional rule with a nearzero probability. In other cases where the GA does not converge to a suitablegrammar, it is terminated after a number of generations. In such cases Kellerand Lutz suggest that this is due to the presence of local maxima around whichthe population would have converged.

Spasic et al. [SNA04] aim at classifying biomedical terms into specific classes,which represent concepts from an ontology in the biomedicine domain. In orderto derive the possible class of a term, they look at the surrounding context ofthat term. This context is learned through data mining, extracting the con-textual patterns surrounding the terms. The patterns contain morph-syntacticand terminological information and are represented as generalised regular ex-pressions. Each contextual pattern is then given a value indicating its statisticalrelevance. They use this to remove the top and bottom ranked features sincethey are considered too general or too rare to play a role in term classification.Class selection of a term is then learned using a GA. For a particular class, theGA tries to learn which of the contextual patterns are relevant. Each individualin the GA is a subset of contextual patterns and its fitness corresponds to theprecision and recall of using these patterns on the training data. Eventually theGA learns a good subset of features which can be used to identify terms in thatclass.

In their evaluation, Spasic et al. manage to obtain higher precision and re-call than 3 other baseline methods used (random rule, majority rule and naıveBayes), with f-measure standing at 47%. This experiment shows that by in-troducing linguistic knowledge to the problem of classification, they were ableto improve results considerably. They suggest that by exploiting further or-thographic and lexical terms (such as suffix), it would be possible to improveprecision even further.

3.2 Genetic Programming

Smith and Witten [SW95] propose a GP that adapts a population of hypothesisgrammars towards a more effective model of language structure. They discussgrammatical inference using statistical methods, and the problems encounteredin their work. They point out that probabilistic n-gram models allow frequent,well-formed expressions to statistically overwhelm infrequent ungrammatical ex-pressions. There is also the problem with allowing probability for unseen data.The ‘zero-frequency problem’ entails assigning a small probability to all unseendata, resulting in both ungrammatical n-grams becoming as probable as unseengrammatical ones.

In their work, Smith and Witten aim at finding a context-free grammar whichis able to parse and generate strings in English. They consider the choice ofrepresentation of the grammar between logical s-expressions or Chomsky NormalForm, and base their discussion on similar work by Koza [Koz92] and Wyard[Wya91]. Koza successfully trained a GP using s-expressions to learn a context-

15

Page 16: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

free grammar to detect exons of length 5 within DNA. Wyard on the other hand,failed to learn equal numbers of as and bs using CNF grammars. Smith andWitten decide to base their work on Koza’s representation and the population isrepresented as Lisp and-or s-expressions. Initial experiments showed that certainconstraints were required in order for the GP to evolve, including a maximumdepth for nesting and a grammar-generator to allow the GP to evolve towardsmore suitable grammars. With these constraints in place, the GP evolves simplegrammars even within two generations, forming simple sentences such as ‘thedog saw a cat’. However, the GP is left to run over more generations to achievea broader exploration of the search space and result in a more efficient grammar.In order to evaluate the performance of the grammar being learnt, after every25 generations, they introduce new sentences to see if a similar word groupingis achieved. For example, the sentence ‘the dog bit a cat’ evolved the samegrammar as ‘the dog saw a cat’ within the next cycle, as the authors expected.

3.3 Summary & Overview

We have reviewed work using GAs and GPs in the area of grammar induc-tion, which is closely related to our problem of learning grammars which canbe used for definition extraction. Work by Lankhorst [Lan94] highlights theimportance of the fitness function, effecting the success (convergence) of thealgorithm. Lankhorst uses the positive and negative classifications of a gram-mar to calculate its performance. He eventually arrives to a function that alsoconsiders partially correct classification and context of a sentence. The differ-ent functions were derived through the various experiments carried out, and fordifferent types of grammars. Fine-tuning a fitness function will be necessaryfor any experiment with GAs. Losee [Los96] also basis the fitness function onwhat is parsed correctly, creating a average parse metric in order to be able toevaluate the fitness of the grammar being learnt. In cases where the problembeing learnt is a grammar, the fitness function must take into account whatis being parsed correctly, or classified wrongly. This type of measure is veryclose to precision and recall metrics used in information retrieval, and also inthe work reviewed in section 2. These type of metrics could be reflected in thechoice of fitness function when implementing our work.

The issue of how to represent the problem being learnt, i.e. the encodingof the individual, seems to be overlooked at times, or ‘easier’ representationschosen. In the work reviewed above, we often note that rather than opting for atree representation and using a GP for learning a grammar, a flattened bit-typerepresentation is chosen and restricting where a GA can dissect the individualfor crossover and mutation operations.

4 Proposed Approach

We propose to use GAs and GPs for definition extraction from non-technicalEnglish texts. The corpus will consist of a collection of LOs gathered as part of

16

Page 17: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

the LT4eL project in standard XML format derived from the XCES DTD [IS02].These were converted from several LOs created by different tutors and originallyin different formats, including PDF, HTML and other proprietary formats. Thecorpus is annotated with linguistic information, using the Stanford POS tagger[TM00] and PCKimmo’s morphological analyser [Ant90].

Manually, definitions in the LT4eL corpus were annotated (amounting to450 definitions), and split into six different categories (described in more detailin [Bor07, BRP07]). To discriminate between definitions and non-definitions inan automatic manner, it is important to identify features which are present inone class but not the other. Such features may range from a simple rule statingcontains the verb to be, to more complex rules, such as POS sequences.

Manually crafted features, based on human observation, obtained a 17%precision and 58% recall in the containing the verb to be category, and a 34%precision and 32% recall in the other defining verbs category6, when appliedto a subset of the corpus. By learning to identify definitions in the separatecategories, we reduce the size and complexity of the search space. The proposedmachine learning approach aims at extending and improving the results achievedfrom the manually-crafted grammars.

4.1 Experiment One: Genetic Algorithm

From the experience gained through the manual crafting of grammar rules fordefinition extraction, we noticed that certain rules, or sub-parts, contain morespecific or important information than other rules. This led to the idea of usinga GA as a possible technique to learn the importance of the features that canrecognise definitions. This can be done by assigning weights to each featureand allowing the algorithm to adjust the weights according to the performance.A feature is a function which given a sentence (which includes linguistic infor-mation) returns a numeric score. An example of a feature would be that ofa POS sequence that may capture a definition — returning a numeric valueindicating how close the given sentence matches that particular sequence (e.g.DT→NN→VBZ→DT→NN→IN→NNS). Another example of a feature is a testwhether the sentence contains the verb ‘to be’ — with the only possible valuesof the score now being 1 or 0, indicating the presence or otherwise of the verb.A feature is said to be an effective classifier of definitions if it gives higher scoresto sentences which define a term than to other sentences.

Given a set of n basic features, f1 to fn, and n numeric constants, α1 toαn, one can produce a new feature combining these basic features in a linearfashion:

F (s) =n∑

i=1

αi × fi(s)

The problem is now: given a fixed set of features, how can we calculate a

6This category constitutes of defining verbs or phrases, such as ‘is known as’, ‘is also called’

and ‘is defined as’.

17

Page 18: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

good set of weights so as to maximise the effectiveness of the combined fea-tures as a definition classifier? We propose to use a GA to learn these weights.The values learnt would thus correspond to the relative effectiveness of the in-dividual features as classifiers of definitions. Before starting the experiment, apredefined set of features will be adopted which will remain static throughoutthe experiment.

A gene will be a list of length equal to the number of predefined features ofnumbers. Thus, the ith gene (1 ≤ i ≤ populationsize) will have the followingstructure:

gi = 〈αi,1, αi,2 . . . αi,n〉Note that n corresponds to the number of predefined features. The interpre-

tation of the gene is thus a list of weights which will be applied to the output ofthe predefined features. For instance, αi,1 is the weight that the gene gi wouldassign to the first feature. Such a gene will therefore return a score to a givensentence s as follows:

scorei(s) =n∑

j=1

fj(s) × αi,j

The initial population will consist of genes that contain random weightsassigned to each feature. The interpretation of the gene is a function thatwhen applied to a sentence gives the summation of the feature-function scoresmultiplied by their weights.

The fitness function will take an individual and produce a score accordingto its performance. The score will be calculated by applying the gene to bothpositive and negative examples and will be judged according to how the gene isable to separate the two sets of data from each other.

Crossover and mutation will be carried out in the traditional way of GA.Crossover will take two individuals, split them at a random position and createtwo new children by interchanging the parts of the parents. Mutation will takea random position in the gene and change its value. If the children performbetter than the parents, they will replace them.

Once the population converges, the weights of the best individual would givethe clearest separating threshold between definitions and non-definitions. Thiswill also allow us to identify which of the features in the predefined feature setare most important.

4.2 Experiment Two: Genetic Programming

Identifying possible grammar rules for the task of definition extraction not onlyrequires linguistic expertise, but also a degree of trial-and-error to measure theeffectiveness of new grammar rules. This process is ideal to automate using aGP, which is a technique that uses GA principles to evolve simple programs. Themain difference between GAs and GPs is the representation of the populationand how the operations of crossover and mutation are carried out. The members

18

Page 19: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

of the population are parse trees, usually of computer programs, whose fitness isdetermined by execution. Crossover and mutation are carried out on subtrees,ensuring that the resulting tree would still be of the correct structure.

Whereas the scope of the previous experiment was to learn which are thebest performing features for the task of definition extraction from a set of pre-defined ones, this experiment aims at identifying new features. The choice ofwhat type of structure we are trying to learn determines the complexity of thesearch space. In our application, two possible options could be regular lan-guages (in the form of regular expressions) or context-free languages (in theform of context-free grammars), the latter having a larger search space thanthe former. Through observations and current work with lxtransduce [Tob05],regular expressions (extended with a number of constructs) should be sufficientto produce expressions that would, in most cases, correctly identify definitions.

Both basic and combined features used in the GA can serve to inject aninitial population into the GP. The selection can be made based on the weightslearned by the GA and translating those features into extended regular ex-pressions. The extensions that are being considered are conjunction (possiblyonly at the top level due to complexity issues), negation and operators such ascontains subexpression. Note that some of these can already be expressed asregular expressions, however, introducing them as a new single operator helpsthe genetic program learn faster.

The population will evolve with the support a fitness function in order toselect those individuals for mating. The fitness function can apply the extendedregular expressions on the given training set and then use measurements suchas precision and recall over the captured definitions. Such measurements canindicate the performance of the individuals and will allow us to fine-tune the GPaccording to the focus of the experiment (where one could emphasis on a highpercentage for one measurement at a time, or take an average for both). Thisflexibility will also allow us to have different results in the various runs of the ex-periment, where, for instance, in one we could try to learn over-approximationswhereas in another we can learn an under-approximation.

Crossover takes two trees and creates two new children by exchanging nodeswith a similar structure. If an offspring is able to parse correctly one definition,it survives into the next generation, otherwise it is discarded. Parents wouldnormally also be retained in the population, since we would not want to losethe good individuals (it is not obvious that their offspring would have the samecapability of correctly identifying definitions).

Mutation takes an individual and selects at random one node. If that nodeis a leaf, it is randomly replaced by a terminal symbol. If it is an internal node,it is randomly replaced by one of the allowed operators. Once again, the newtree is allowed to survive to the next generation only if it is able to capture atleast one definition.

Once the GP converges, we expect to have new expressions that would cap-ture some aspects of a definition. The application of this program will allowus to extend our current set of grammar rules by deriving new rules from theabove operations. Although we do not expect the GP to learn rules, it will help

19

Page 20: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

towards the discovery of new rules which might have been overlooked, and thushelping towards a more complete grammar for definition extraction.

The GP will also allow the flexibility of running this experiment separatelyfor each of the categories of the definitions as identified in section 4.2. Thismeans that the new features being learned will be restricted to one category ata time.

4.3 Combining the Two Experiments

The role of this work is to develop techniques to extract definitions. The twoexperiments are independent of each other. The GA takes a set of features andassigns a weight to each feature, whereas the GP learns new features through theevolution of the population of extended regular expressions. We can combine thetwo experiments by migrating the new features learned by the GP to augmentthe feature set which is used in the GA. Figure 1 shows how these two techniquescan be combined together.

Annotated Text

Text

Linguistic

Information

LXTransduce Genetic

Algorithm

Definition

Filter

Definitions

Genetic

Program

Features / Grammars

Figure 1: Architecture Overview

In the final definition extractor one can start by checking whether a givensentence can be confidently classified as a definition by using the features learntby the GP, possibly giving a preference to over- or under-approximations. Onewould then run the weighted sum and threshold as learned by the GA based onthe features we manually identified and others that the GP may have learned.Clearly the training of the GA would have to be done on a subset of the training

20

Page 21: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

set, removing the confidently classified non/definitions. We believe that thisapproach will improve the quality of the definition identifier.

4.4 Evaluation

The evaluation of this proposal will focus around the improvement achievedin definition extraction in terms of precision and recall, and will compare thesefigures to other work described in the literature review. We also plan to evaluatethe weights learnt by the GA to common statistical techniques such as n-gramsto be in a position to judge whether using GAs for the task of learning weightsto features is an effective technique or not. The evaluation of the GP willfocus on what new rules have been discovered and compare these to the alreadyexisting hand-crafted rules for definition extraction, using precision and recallas metrics. As an overall system, we intend to validate the overall improvementof performance through the GA and GP, and compare this to other machinelearning techniques such as SVMs applied to definition extraction.

5 Conclusions

In this paper we have presented a literature review on definition extractionand grammatical inference. We have seen that although some machine learningtechniques have been used in definition extraction, GAs and GPs have as yet notbeen considered. In this experiment we also propose to fuse the two techniquestogether for the discovery of new possible grammar rules which could extractdefinitions and learning weights of such rules to enable us to rank definitions.We have outlined how these techniques will be combined within a final system.Our final aim is to evaluate the performance of GAs and GPs against othertechniques outlined in the literature review. We believe that this experimentwill contribute both to the natural language processing field and the machinelearning field. The application of GAs and GPs for definition extraction isnovel, and there is no literature present to indicate if such experiments havebeen previously carried out. The above work will also compare the results toother machine learning techniques, thus contributing by putting forward newresults achieved through this experiment.

References

[Ant90] Evan L. Antworth. PC-KIMMO: a two–level processor for morpho-logical analysis. In Occasional Publications in Academic ComputingNo. 16, 1990.

[BE98] Anja Belz and Berkan Eskikaya. A genetic algorithm for finite stateautomata induction with an application to phonotactics. In Pro-ceedings of the ESSLLI-98 Workshop on Automated Acquisition ofSyntax and Parsing, 1998.

21

Page 22: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

[BGMS04] Sasha Blair-Goldensohn, Kathleen R. McKeown, and Andrew HazenSchlaikjer. New Directions in Question Answering, chapter Answer-ing Definitional Questions – A Hybrid Approach. AAAI Press, 2004.

[Bor07] Claudia Borg. Discovering grammar rules for Automatic Extrac-tion of Definitions. In Doctoral Consortium at the Eurolan SummerSchool 2007, Iasi, Romania., pages 61–68, 2007.

[BR07] Claudia Borg and Mike Rosner. Language Technologies for aneLearning Scenario. In Proceedings of the Computer Science An-nual Workshop, CSAW07, Malta, 2007.

[BRP07] Claudia Borg, Mike Rosner, and Gordon Pace. Towards AutomaticExtraction of Definitions. In Proceedings of the Computer ScienceAnnual Workshop, CSAW07, Malta, 2007.

[FB06] Ismail Fahmi and Gosse Bouma. Learning to Identify Definitionsusing Syntactic Features. In Workshop of Learning Structured In-formation in Natural Language Applications, EACL, Italy, 2006.

[Gol89] David E. Goldberg. Genetic Algorithms in Search, Optimization andMachine Learning. Addison-Wesley, Reading, MA, 1989.

[Hol75] John H. Holland. Adaptation in Natural and Artificial Systems. Uni-versity of Michigan Press, Ann Arbor, 1975.

[IS02] Nancy Ide and Keith Suderman. Xml corpus encoding standard doc-ument xces 0.2. Technical report, Department of Computer Science,Vassar College, and Equipe Langue et Dialogue, Vandoeuvre-les-Nancy, 2002.

[JLS01] Hideo Joho, Ying Ki Liu, and Mark Sanderson. Large scale testing ofa descriptive phrase finder. In HLT ’01: Proceedings of the first inter-national conference on Human language technology research, pages1–3, 2001.

[KL97] Bill Keller and Rudi Lutz. Learning Stochastic Context-Free Gram-mars from Corpora Using a Genetic Algorithm. In Workshop onAutomata Induction Grammatical Inference and Language Acquisi-tion (ICML-97), Nashville, Tennessee, 1997.

[Koz92] John R. Koza. Genetic Programming: On the Programming of Com-puters by means of Natural Selection. MIT Press, Cambridge, MA,1992.

[KPP03] Judith L. Klavans, Samuel Popper, and Rebecca Passonneau. Tack-ling the internet glossary glut: Automatic extraction and evaluationof genus phrases. In SIGIR’03 Workshop on Semantic Web, 2003.

22

Page 23: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

[Lan94] Marc M. Lankhorst. Breeding grammars: Grammatical inferencewith a genetic algorithm. Technical Report CS-R9401, Departmentof Computer Science, University of Gronigen, PO Box 800, 9700 AVGroningen, The Netherlands, 1994.

[LCN03] Bing Liu, Chee W. Chin, and Hwee T. Ng. Mining Topic-SpecificConcepts and Definitions on the Web. In Proceedings of the TwelfthInternational World Wide Web Conference (WWW’03), 2003.

[Los96] Robert M. Losee. Learning Syntactic Rules and Tags with GeneticAlgorithms for Information Retrieval and Filtering: An EmpiricalBasis for Grammatical Rules. In Information Processing and Man-agement, volume 32, 1996.

[MA04] Spyridoula Miliaraki and Ion Androutsopoulos. Learning to IdentifySingle-Snippet Answers to Definition Questions. In Proceedings ofColing 2004, pages 1360–1366, Geneva, Switzerland, 2004. COLING.

[MK02] Smaranda Muresan and Judith L. Klavans. A method for automat-ically building and evaluating dictionary resources. In Proceedingsof the Language Resources and Evaluation Conference, 2002.

[MLS07] Paola Monachesi, Lothar Lemnitzer, and Kiril Simov. LanguageTechnology for eLearning. In First European Conference on Tech-nology Enhanced Learning, 2007.

[PBB02] Youngja Park, Roy J Byrd, and Branimir K Boguraev. Automaticglossary extraction: beyond terminology identification. In Proceed-ings of the 19th international conference on Computational linguis-tics, pages 1–7, Morristown, NJ, USA, 2002. Association for Com-putational Linguistics.

[PRC01] John Prager, Dragomir Radev, and Krzysztof Czuba. Answeringwhat-is questions by virtual annotation. In HLT ’01: Proceedingsof the first international conference on Human language technologyresearch, pages 1–5, 2001.

[SBdR05] Erik Tjong Kim Sang, Gosse Bouma, and Maarten de Rijke. De-veloping offline strategies for answering medical questions. In Pro-ceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, pages 41–45, 2005.

[SNA04] Irena Spasic, Goran Nenadic, and Sophia Ananiadou. Learning toClassify Biomedical Terms through Literature Mining and GeneticAlgorithms. In Intelligent Data Engineering and Automated Learn-ing (IDEAL 2004), volume 3177 of LNCS, pages 345–351. Springer-Verlag, 2004.

23

Page 24: Definition Extraction from Text: an experiment using ...staff.um.edu.mt/cbor7/publications/SeminarPaper.pdf · such an experiment. In section 4 we propose our solution using GAs

[SW95] Tony C. Smith and Ian H. Witten. A Genetic Algorithm for theInduction of Natural Language Grammars. In Proceedings IJCAI-95 Workshop on New Approaches to Learning for Natural LanguageProcessing, Canada, pages 17–24, 1995.

[SW06] Angelika Storrer and Sandra Wellinghoff. Automated detection andannotation of term definitions in german text corpora. In LanguageResources and Evaluation Conference, 2006.

[TM00] K. Toutanova and C. D. Manning. Enriching the Knowledge SourcesUsed in a Maximum Entropy Part-of-Speech Tagger. In Joint SIG-DAT Conference on Empirical Methods in Natural Language Pro-cessing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong,2000.

[Tob05] Richard Tobin. Lxtransduce a replacement for fsgmatch. Technicalreport, University of Edinburgh, 2005.

[WP06] Stephan Walter and Manfred Pinkal. Automatic Extraction of Defi-nitions from German Court Decisions. In Workshop on InformationExtraction Beyond The Document, pages 20–28, 2006.

[Wya91] Peter J. Wyard. Context free grammar induction using genetic al-gorithms. In ICGA, pages 514–519, 1991.

24