AND05

7/27/2019 AND05

1/8

The Design of a System for the Automatic Extraction of aLexical Database Analogous to WordNet from Raw Text

Reinhard RappLIF-CNRSAix-Marseille Universit

[email protected]

Michael ZockLIF-CNRSAix-Marseille Universit

[email protected]

ABSTRACTConstructing a lexical database such as WordNet manuallytakes several years of work and is very costly. On the otherhand, methods for the automatic identification of semanti-cally related words and for computing relational similaritiesbased on large corpora of raw text have reached a consider-able degree of maturity, with the results coming close to na-tive speakers performance. We describe ongoing work which

aims at further refining and extending these approaches,thereby making it possible to fully automatically generatea resource similar to WordNet. The developed system willbe largely language independent and is to be applied to fourEuropean languages, namely English, French, German, andSpanish. This is an outline of the approach: Starting froma raw corpus we first compute related words by applyingan algorithm based on distributional similarity. Next, toidentify synsets, an algorithm for unsupervised word senseinduction is applied, and each word in the vocabulary isassigned to one or (if ambiguous) several of the synsets. Fi-nally, to determine the relations between words, a methodfor computing relational similarities is applied.

Categories and Subject Descriptors[Computing methodologies]: Lexical semantics, Languageresources

1. INTRODUCTIONWordNet [8] is a large lexical database of English wherenouns, verbs, adjectives and adverbs are grouped into setsof synonyms (synsets), each expressing a distinct concept(see Figure 1 for a sample entry relating to the word motion).Synsets are interlinked by means of conceptual-semantic andlexical relations. WordNet has become an invaluable resourcefor processing semantic aspects of language, and it has served

This research was supported by a Marie Curie Intra Eu-ropean Fellowship within the 7th European CommunityFramework Programme.

as model for similar lexical databases created for other lan-guages. However, creating a WordNet for a new languagein the established way takes several years of work. It alsoinvolves countless subjective decisions which may be contro-versial. On the other hand, methods for the automatic iden-tification of semantically related words based on large textcorpora, which can be used to identify the synsets, havereached a considerable degree of maturity, and the results

could be shown to come close to human judgement [32]. Ina recent paper, Turney & Pantel [45] highlighted the successof Vector Space Models (VSM):

The success of the VSM for information retrievalhas inspired researchers to extend the VSM toother semantic tasks in natural language proces-sing, with impressive results. For instance, Rapp(2003) [29] used a vector-based representation ofword meaning to achieve a score of 92.5% onmultiple-choice synonym questions from the Testof English as a Foreign Language (TOEFL), whe-reas the average human score was 64.5%. Turney(2006) [42] used a vector-based representation ofsemantic relations to attain a score of 56% on

multiple-choice analogy questions from the SATcollege entrance test, compared to an average hu-man score of 57%.

Together with similar work conducted by other researchers,which confirmed these findings, the two papers mentionedin this citation are the basis of the methodological consider-ations to be described here. Our aim is to further refine andextend these approaches, and to apply them to a new task,namely the fully automatic generation of a resource similarto WordNet. The developed system will be largely languageindependent and is to be applied to four major Europeanlanguages, namely English, French, German, and Spanish.This paper is mainly conceptual in nature, aiming at givingan overview on the various parts of a larger project. Al-though various foundational steps have been completed (seeSection 3), this is still work in progress. For this reason anddue to space constraints, further results will be presented insubsequent publications.

This work, although anchored in computational linguistics,is also related and relevant to the field of ontology learn-ing which has attracted considerable attention in computerscience since the semantic web has been announced to beamong the foci of the World Wide Web Consortium. Inthis field, in recent years a number of resources sharingsome properties with WordNet have been designed and de-

7/27/2019 AND05

2/8

Figure 1: WordNet entry for the word motion.

veloped, among them LexInfo [3], LingInfo [4], LexOnto [7],Linguistic Information Repository (LIR [20]), and LinguisticWatermark Suite (LWS [25]). However, these resources aretoo recent to be able to claim a status similar to WordNet.WordNet has been developed and optimized over decades,

it has been scrutinized and used by a large community ofscientists and, despite some criticism, the validity of its un-derlying principles is widely acknowledged. This is the mainreason why the work descibed here is based on WordNetrather than on one of the above mentioned ontologies. An-other reason is that WordNet appears to be a more directrepresentation of native speakers intuitions about languagethan most ontologies. Therefore, when working with it, itshould be somewhat more straightforward to draw conclu-sions concerning human cognition. As far as applicable, wenevertheless take experiences from ontology building intoaccount (for an overview see [2]).

The focus of this work is to generate Lexical Databases as

similar as possible to the existing manually created Word-Nets in a way that is easily adaptable to other languages forwhich no WordNets exist yet. The evaluation is conductedin a direct way by taking the existing WordNet and experi-mental data on word synonymy as a gold standard, and bycomparing the generated data to this gold standard. Analternative would be to conduct an indirect evaluation bycomparing the performance of the automatically generatedWordNet to the performance of existing WordNets in someapplicative areas, such as word sense disambiguation and in-formation retrieval. However, applications are too numerousto consider one (or a few) of them as authoritative. For ex-ample, Rosenzweig et al. [34] list 868 papers of which manydescribe applications of WordNet. For such reasons we sug-

gest here the automatic construction of a general purposeWordNet which is not optimized with respect to a specificapplication. However, looking at specific applications wouldbe a logical next step following the work described here.

Let us mention that although there has been quite somework on specific aspects of what is described here, to ourknowledge no previous attempt to automatically create aWordNet-like system using state-of-the-art modules for lex-ical acquisition has been fully completed, with the resultinglexical database b eing finalized and published.1 However,

1Ongoing work for Polish is described in [27]

variations of a completely different method for automati-cally creating WordNets have been described and put intopractice many times. They are based on the idea that a rawversion of a WordNet for a new language can be created bysimply translating an existing WordNet of a closely related

language. Recent examples are [40] for Thai and WOLF forFrench.2 However, this methodology is rather unrelated towhat we investigate here. It does not have as much cognitiveplausibility, cannot produce WordNets specific to the genreof a particular corpus, and can only be applied if a WordNetof a related language is available.

2. RESEARCH METHODOLOGY AND PRE-

VIOUS WORKOur approach comprises three major steps which will bedescribed in more detail in the following subsections:

1. Computing word similaritiesStarting from a large part-of-speech tagged corpus of

the respective language, various methods for comput-ing related words, e.g. using syntax parsing or latentsemantic analysis, are considered. The results are eval-uated by comparing them to a recently published dataset comprising the 200,000 human similarity judgmentsfrom the Princeton Evocation Project,3 in addition tothe commonly used but less adequate 80 item TOEFLdataset which has been the standard so far.

2. Word sense inductionTo identify each words senses, an algorithm for un-supervised word sense induction is applied which inessence clusters the entire semantic space into synsets.Subsequently each word in the vocabulary is assignedto one or (if ambiguous) several of the synsets. The

WordNet glosses (short descriptions of the synsets) arereplaced by concordances of the respective senses.

3. Conceptual relations between wordsTo reveal the relations between words (e.g. hyponymy,holonymy, and meronymy) we use mainly the method-ology for computing relational similarities introducedby Turney [42] and refined by Pennacchiotti & Pantel[26]. But we will also take the results of related workin the construction of ontologies into account [2].

2http://alpage.inria.fr/sagot/wolf-en.html3http://wordnet.cs.princeton.edu/downloads.html

7/27/2019 AND05

3/8

Figure 2: Words co-occurring with red and blue.

2.1 Computing word similaritiesIn his seminal paper Distributional StructureZellig S. Har-ris [10] hypothesized that words occurring in similar contextstend to have similar meanings. This finding is often referredto as the distributional hypothesis. It was put into practiceby Ruge [35] who showed that the semantic relatedness oftwo words can be computed by looking at the agreement oftheir lexical neighborhoods. For example, as illustrated inFigure 2, a certain degree of semantic relatedness betweenthe words red and blue can be derived from the fact thatthey both frequently co-occur with words like color, dress,

flower, etc. although there are other context words occuringonly with one of them. If on the basis of a large text cor-pus a matrix of word co-occurrences is compiled, then thesemantic similarities between words can be determined bycomparing the vectors in the matrix. This can be done usingany of the standard vector similarity measures such as thecosine coefficient.

Since Ruges pioneering work, many researchers, e.g. Schutze[38], Rapp [28], and Turney [42] used this type of distri-butional analysis as a basis to determine semantically re-lated words. An important characteristic of some algorithms(e.g. as used by Ruge [35], Grefenstette [9], and Lin [17]) isthat they parse the corpus and only consider co-occurrences

of word pairs showing a specific relationship, e.g. a head-modifier, verb-object, or subject-object relation. Others donot parse, but perform a singular value decomposition(SVD)on the co-occurrence matrix, which also improves results([14] [15]). As an alternative to the SVD, Sahlgren [37] usesrandom indexingwhich is computationally less demanding.

Our aim here is to systematically compare some of the bestalgorithms, among them the ones described in [23] and [32],and to come up with an improved version which combinesthe advantages of all. In doing so we hope to be able toprovide at least partial answers to questions such as thefollowing: Does the analysis of syntax help in determiningsemantic similarities between words? Is it true that a di-mensionality reduction of the semantic space using singu-lar value decomposition uncovers latent semantic structuresbetween words? Does the optimal number of dimensions asfound empirically reflect cognitive processes, or can a similarbehavior be achieved by simply applying various strengthsof smoothing?

To be able to come up with good answers to these ques-tions, an accurate evaluation method is necessary which al-lows to analyze fine grained distinctions e.g. with regard toword frequency, ambiguity, saliency or part of speech. Forthis purpose, many possibilities can be thought of and havebeen applied in the past. For example, Grefenstette [9] used

Table 1: Performances for TOEFL synonym testDescription Score Reference

Random guessing. Four alter- 25.00% Rapp [30]nativesof which one is correctAverage non-English US col- 64.50% Landauer &lege applicant taking TOEFL Dumais [14]Non-native speakers of 86.75% Rapp [30]

English living in AustraliaNative speakers of English 97.75% Rapp [30]living in Australia

available dictionaries as a gold standard, Lin [17] comparedhis results to WordNet, and Landauer & Dumais [14] usedexperimental data taken from the synonym portion of theTest of English as a Foreign Language (TOEFL).

As it more directly reflects human judgements, in the lit-erature for the purpose of evaluation the TOEFL data hasoften been preferred over dictionaries or lexical databases.As pointed out by Turney [42], another advantage of the

TOEFL data is that it has gained considerable acceptanceamong researchers.

The TOEFL is an obligatory test for non-native speakers ofEnglish who intend to study at universities with English asthe teaching language. The data used by Landauer & Du-mais had been acquired from the Educational Testing Ser-vice and comprises 80 test items. Each item consists of aproblem word embedded in a sentence and four alternativewords, from which the test taker is asked to choose the onewith the most similar meaning to the problem word. Forexample, given the test sentence Both boats and trains areused for transporting the materials and the four alternativewords planes, ships, canoes, and railroads, the subject would

be expected to choose the word ships, which is supposed tobe the one most similar to boats.

Table 1 [32] shows a number of relevant baselines for theTOEFL test as obtained by either random guessing or bythree groups of test persons with different levels of languageproficiency. As can be seen, with 97.75% correct answers,the performance of the native speakers is more than 30%better than that of language learners intending to apply foradmission at a US college.

Systems capable of computing semantically related wordscan also answer the TOEFL synonym questions. Usually thecontext is simply discarded and the system is only offered thetest word. Then the similarity scores between the test wordand the four alternative words are determined, and the wordwith the best score is considered to be the systems answer.

Let us now look at the state of the art in automaticallysolving the standard TOEFL synonym test comprising 80items. In the literature, there are essentially three basicapproaches. One is lexicon-based, another is corpus-based,and the third, which is usually referred to as hybrid, is amixture of the first two.

With the lexicon-based approaches (used e.g. by Leacock &Chodrow [16], Hirst & St-Onge [11], and Jarmasz & Szpa-

7/27/2019 AND05

4/8

kowicz [12]), a given word is looked up in a large lexicon (orlexical database) of synonyms and it is determined whetherthere is a match between any of the retrieved synonyms andthe four alternative words presented in the TOEFL ques-tion. If there is a match, the respective word is consideredto be the solution to the question. In the other case, the pro-cedure can be extended to indirect matches, e.g. involvingsynonyms of synonyms.

This procedure works rather well if the lexicon has a goodcoverage of the respective vocabulary. In the literature typi-cally WordNet [8] has been used, and performances of up to78.75% on the TOEFL task have been reported [12]. On theother hand, both the TOEFL questions and the lexicons arehandcrafted and therefore reflect human intuitions. So it isnot surprising that a high correspondence between these twoclosely related types of human intuitions can be observed. Inour setting, as the purpose of our work is to generate a lexi-cal database similar to WordNet, it would be contradictoryto presuppose WordNet for the similarity computations.

Therefore we concentrate here on the second method. Thisis a corpus-based machine learning approach which appearsto be more interesting from a cognitive perspective as itpotentially better captures the relevant aspects of humanvocabulary acquisition. Table 2 (derived from [32] and theACL Wiki) gives an overview on the current state of theart with regard to performance figures on the TOEFL syn-onym test. With 90.9% and 92.5% correct answers, the bestperformances were achieved by Pantel & Lin [23] and Rapp[32]. This is why here we will concentrate on combiningthese two rather different approaches, with the first beingsyntax-based and the second using singular value decompo-sition for dimensionality reduction of the semantic space.The intention is to introduce some amount of syntax to thesecond approach by operating it on a part-of-speech-taggedrather than a raw text corpus. This should not only lead to

better results, but is also necessary to obtain WordNet likeentries which distinguish between parts of speech.

The third approach is hybrid ([33] [13] [18] [12] [44]) and isbasically a fall-back strategy for the first approach: That is,by default the lexicon-based approach is used as its resultstend to be more reliable. However, if the relevant wordscannot be found in the lexicon, then it is of course better touse a corpus-based approach rather than to guess randomly.With a performance of up to 97.5% on the TOEFL synonymtest [44], the results of the hybrid approach are the best.However, it is nevertheless inappropriate for our researchbecause, like the lexicon-based approach, it also presupposesreadily available lexical knowledge.

Although the scores from the 80 item TOEFL synonymtest, which has been the standard so far, give some ideaconcerning the overall performance of an algorithm, it canbe argued that this test set is rather small and thereforeprone to statistical variation. Also, this test was not de-signed to measure the strengths and weaknesses of vari-ous algorithms concerning particular properties of the inputwords, e.g. their frequency, saliency, part of speech, or am-biguity. We will therefore base our future evaluation on amuch larger data set, namely the 200,000 sense specific hu-man similarity judgments that were collected in the Prince-

Table 2: Comparison of corpus-based approachesCharacterization of algorithm Score Ref.

Latent semantic analysis 64.38% [14]Raw co-occurrences and city-block 69.00% [28]Dependency space 73.00% [22]Pointwise mutual information (MI) 73.75% [41]PairClass 76.25% [43]

Pointwise mutual information 81.25% [39]Context window overlapping 82.55% [36]Positive pointwise MI with cosine 85.00% [5]Generalized latent semantic analysis 86.25% [19]Similarities between parsed relations 90.90% [30][23]Modified latent semantic analysis 92.50% [32]

ton Evocation project. Such a large scale data set will allowa much more detailed analysis of the behavior of the al-gorithms, and we would like to see this as the future goldstandard for such comparisons.

2.2 Word sense inductionThe previous step of introducing a similarity measure thatgenerates judgements of word relatedness akin to humanintuition lays the foundation for the next steps which are tofind out about possible senses (called synsets in the WordNetterminology), and to assign to each word at least one ofthem. This is what we call word sense induction.

Pantel & Lin [23] have conducted such work with consid-erable success. Using a specially designed clustering algo-rithm (clustering by committee), they divided the semanticspace into thousands of basic concepts which can be seen inanalogy to WordNets synsets. As the clustering is based onsimilarities between global co-occurrence vectors, i.e. vectors

that are based on the co-occurrence counts from an entirecorpus, we call this global clustering. Since (by looking atdifferential vectors) their algorithm allows a word to belongto more than one cluster, each cluster a word is assigned tocan be considered as one of its senses. However there is a po-tential problem with this approach as it allows only as manysenses as there are clusters, thereby limiting the granularityof the meaning space. This problem is avoided by Neill [21]who uses local instead of global clustering. In this case, tofind the senses of a given word, only its close associationsare clustered, i.e. for each word new clusters will be found.

Concerning the type of co-occurrence vectors used, most ap-proaches to word sense induction (including [23] and [1])that have been published so far rely on global co-occurrencevectors based on string identity. Since most words are se-mantically ambiguous, this means that these vectors reflectthe sum of the contextual behavior of a words underlyingsenses, i.e. they are mixtures of all senses occurring in thecorpus.

However, since reconstructing the sense vectors from suchmixtures is difficult, the question arises whether we reallyneed to base our work on the mixtures or if there is not adirect way to observe the contextual behavior of the senses,thereby avoiding the mixtures right from the beginning. Herewe suggest to compare Pantel & Lins approach [23] to a

7/27/2019 AND05

5/8

Table 3: Term/context matrix for the word palmc1 c2 c3 c4 c5 c6

arm beach coconut finger hand

shoulder tree

method outlined in [31] which looks at local rather thanglobal co-occurrence vectors. As can be seen from humanperformance, in almost all cases the local context of an am-biguous word is sufficient to disambiguate its sense. Thismeans that if we consider words within their local contextthey are hardly ever ambiguous.

The basic idea is now that we do not cluster the global co-occurrence vectors of the words (based on an entire corpus)but local ones which are derived from the various contexts

of a single word. That is, the computations are based on theconcordance of a word. Also, we do not consider a term/termbut a term/context matrix. This means that for each wordto be analyzed we get an entire matrix.

Let us illustrate this using the ambiguous word palm whichcan refer to a tree or to a part of the hand. If we assumethat our corpus contains six occurrences of palm, i.e. thatthere are six local contexts, then we can derive six localco-occurrence vectors for palm. Considering only strong as-sociations to palm, these vectors could, for example, look asshown in Table 3.

The dots in the matrix indicate whether the respective word

occurs in a particular context or not. We use binary vectorssince we assume short contexts where words usually occuronly once. The matrix reveals that the contexts c1, c3, andc6 seem to relate to the handsense ofpalm, whereas the con-texts c2, c4, and c5 relate to its tree sense. These intuitionscan be resembled by using a method for computing vectorsimilarities such as the cosine coefficient. If we then applyan appropriate clustering algorithm to the context vectors,we should obtain the two expected clusters. Each of the twoclusters corresponds to one of the senses of palm, and thewords closest to the geometric centers of the clusters shouldbe good descriptors of each sense.

However, as matrices of the above type can be extremelysparse, clustering is a difficult task, and common algorithmsoften produce sub-optimal results. Fortunately, the sparsityproblem can be minimized by reducing the dimensionalityof the matrix. An appropriate algebraic method which hasthe capability to reduce the dimensionality of a rectangularor square matrix in an optimal way is singular value decom-position. As shown by Schutze [38] by reducing the dimen-sionality a generalization effect can be achieved which oftenyields improved results. The approach that we suggest hereinvolves reducing the number of columns (contexts) and thenapplying a clustering algorithm to the row vectors (words)of the resulting matrix. This should work well as it is oneof the strengths of SVD to reduce the effects of sampling

errors and to close gaps in the data. Although SVD is com-putationally demanding, previous experience shows that itis feasible to deal with matrices of several hundred thousanddimensions [32].

In summary, we will compare two fundamental types of algo-rithms for word sense induction, one being based on globaland the other on local clustering of words. If the results

are similar, we will give preference to the global clusteringas it matches better the WordNet approach. On the otherhand, the local clustering makes it easier to provide contextsfor each sense, which will be used as replacements for theWordNet glosses. Empirical verification of these issues maygive us important arguments to question some underlyingprinciples of WordNet.

2.3 Conceptual relations between wordsWordNet distinguishes a number of relations between words,e.g. hyponymy, holonymy, and meronymy. It has for longbeen unclear how such relations could be automatically ex-tracted from a corpus. Caraballo & Charniak [6] did somepioneering work, and so did Turney [42] who further elab-

orated on the concept of the so called relational similar-ities. The basic idea is to consider pairs of co-occurringcontent words, and to assume that two pairs have the samerelation if the content words constituting a pair are sepa-rated by the same sequence of function words. For exam-ple, the sequence in the may indicate a part-of relationship(holonymy), whereas and is more likely to express the ideaof coordination between two similar terms, hence synonymy.The current state of the art in this respect is the Espressosystem [26] and followup work [24]. What we suggest be-low is roughly along these lines. But we will also take re-lated work from computer science into account, dealing withassociation rule mining, social annotations, formal conceptanalysis, and ontology learning [2].

An unsupervised approach would involve clustering all pos-sible pairs of content words co-occurring within a distanceof about four words, according to their separating word se-quences, and to manually assign some meaning to the mostsalient clusters. Alternatively, as the aim is to resemble therelations as defined in WordNet, a weakly supervised boots-trapping approach is suggested. For each of the WordNetrelations a few typical pairs will be manually chosen andtaken as seeds. Next, their behavior with regard to typ-ical separating word sequences is quantitatively analyzed.The typical separators would then serve as samples to findmore word pairs with the same relations. That is, using thisbootstrapping mechanism the set of seeds is extended by as-signing the appropriate type of relations to each word pair.Work along these lines has already been successfully con-ducted for Polish and is described in detail by Piasecki etal. [27].

We intend to do so not only for several languages, but toalso go beyond this: To improve results, in a further stepan analogous procedure will be applied to a word sense dis-ambiguated corpus. This means that instead of consideringthe relations between pairs of words, we will consider therelations between pairs of word senses. The reasoning isthat different senses of an ambiguous word may well havedifferent types of relations with another word (or another

7/27/2019 AND05

6/8

words sense). Words represent mixes of senses and lookingat these mixes leads to blurred results. To avoid this, wemust first perform a word sense disambiguation on the entirecorpus, and then apply the procedure for relation detection.From the previous step (of word sense induction) we alreadyhave the possible word senses readily available, so that usingavailable software (including some of our own) it is relativelystraightforward to conduct a word sense disambiguation. It

can be expected that in the disambiguated corpus the rela-tions between word senses are more salient than they wouldbe for words. Nevertheless it will be necessary to optimizethe algorithm in a procedure of stepwise refinement by com-paring its results to a representative subset of the relationsfound in WordNet.

3. RESULTSThe focus of this paper is to give an overview on the Au-toWordNet project. As the project is ongoing (and also dueto space constraints) detailed results concerning the variousaspects of the project cannot be presented here, but will bepublished separately. However, in order to give the readeran idea of the outcome, let us summarize here the results

concerning one of the fundamental aspects of the project,namely the computation of thesauri of related words.

Although this work has been completed for several languages,as more information can be found in [32], we will confineour description here to the English version. For the otherlanguages, the procedure is essentially the same. As ourunderlying textual basis we used the British National Cor-pus(BNC). While being considerably smaller than more re-cent corpora (e.g. the WaCky or the LDC Gigaword cor-pora, which were used for some of the other languages),our experience is that it leads to somewhat better resultsfor this task as it is well balanced, whereas the other cor-pora have a stronger tendency to produce idiosyncrasies. In

a pre-processing step, we lemmatized this corpus and re-moved the function words (for details concerning this stepsee [32]. Based on a window size of 2 words, we thencomputed a co-occurrence matrix comprising all of the ap-proximately 375,000 lemmas occurring in the BNC. The rawco-occurrence counts were converted to association strengthsusing the entropy-based association measure as described in[32]. Inspired by Latent Semantic Analysis [14], in a fur-ther step we applied a Singular Value Decomposition to theassociation matrix, thereby reducing the dimensionality ofthe semantic space to 300 dimensions. This dimensionalityreduction has a generalization and smoothing effect whichcould be shown to improve the results of the subsequentsimilarity computations [30].

Given the resulting dimensionality reduced matrix, wordsimilarities were computed by comparing word associationvectors using the standard cosine similarity measure. Thisled to results like the ones shown in Table 4 (the lists areranked according to decreasing cosine values).

For a quantitative evaluation we used the system for solvingthe TOEFL synonym test (see section 2.1) and comparedthe results to the correct answers as provided by the Ed-ucational Testing Service. Remember that in this test thesubjects had to choose the word most similar to a given stim-ulus word from a list of four alternatives. In the simulation,

Table 4: Sample lists of related words as computedgreatly (0.52), immensely (0.51), tremendously

enorm- (0.48), considerably (0.48), substantially (0.44),ously vastly (0.38), hugely (0.38), dramatically

(0.35), materially (0.34), appreciably (0.33)Shortcomings (0.43), defect (0.42), deficiencies

flaw (0.41), weakness (0.41), fault (0.36), drawback(0.36), anomaly (0.34), inconsistency (0.34),discrepancy (0.33), fallacy (0.31)question (0.51), matter (0.47), debate (0.38),

issue concern (0.38), problem (0.37), topic (0.34),consideration (0.31), raise (0.30), dilemma(0.29), discussion (0.28)building (0.55), construct (0.48), erect (0.39),

build design (0.37), create (0.37), develop (0.36),construction (0.34), rebuild (0.34), exist(0.29), brick (0.27)disparity (0.44), anomaly (0.43), inconsistency

discre- (0.43), inaccuracy (0.40), difference (0.36),pancy shortcomings (0.35), variance (0.34), imbalance

(0.34), flaw (0.33), variation (0.33)primarily (0.50), largely (0.49), purely (0.48),

essen- basically (0.48), mainly (0.46), mostly (0.39),tially fundamentally (0.39), principally (0.39), solely

(0.36), entirely (0.35)

we assumed that the system made the right decision if thecorrect answer was ranked best among the four alternatives.This was the case for 74 of the 80 test items which givesus an accuracy of 92.5%. In comparison, the performanceof human subjects had been 97.75% for native speakers and86.75% for highly proficient non-native speakers (see Table1). This means that our programs performance is in be-tween these two levels with about equal margins towardsboth sides.

An interesting observation is that in Table 4 most wordslisted are of the same part of speech as the stimulus word.This is insofar surprising as the simulation system neverobtained any information concerning part of speech, but in the process of computing term relatedness implicitlydetermines it. This observation is consistent with other work(e.g. [14]).

As mentioned above, the method has also been applied toother languages, namely French, German, Spanish and Rus-sian [32]. Apart from corpus pre-processing (e.g. segmen-tation and lemmatization) the algorithm had remained un-changed, but nevertheless delivered similarly good results.

As an outcome, large thesauri of related words (analogousto the samples shown in Table 1) each comprising in theorder of 50,000 entries are available for these languages.

4. SUMMARY, CONCLUSIONS, OUTLOOKOur methodology builds on previous work concerning thethree steps which have been described above, namely thecomputation of word similarities, word sense induction, andthe identification of conceptual relations between words. Ouraim is to advance the state of the art for each of these tasks,and to combine them into an overall system. For the com-putation of word similarities, as described in the previous

7/27/2019 AND05

7/8

section, human intuitions have been successfully replicatedvia an automatic system building on previous studies suchas [14], [23], and [32].

In word sense induction, current methods can make roughsense distinctions, but are far from reaching the sophis-tication of human judgements. Here our current work fo-cuses on comparing methods based on local versus global co-

occurrence vectors, as well as local versus global clustering.There are deep theoretical questions behind these choiceswhich also correlate with some design principles of Word-Net. We intend to compare three existing systems whichcan be seen as prototypical for different choices, namely theones described by Pantel & Lin [23], Rapp [31], and Bordag[1]. By providing empirical evidence this should enable usto at least partially answer these questions. By combiningthe best choices we hope to be able to come up with animproved algorithm.

Concerning the identification of conceptual relations holdingbetween words, the field is still at an early stage and it isunclear whether the aim of automatically replicating Word-Nets relations through unsupervised learning from raw textis realistic. However, attempting to do so is certainly of inter-est. On one hand, it is still rather unclear what the empiricalbasis for these relations is, and how they can be extractedfrom a corpus. On the other hand, WordNet provides suchrelations and can therefore be used as a gold standard forthe iterative refinement of an algorithm. As a possible out-come, it may well turn out that the empirical support forWordNets conceptual relations is not equally strong for alltypes. This would raise the question whether the choicesunderlying WordNet were sound, and what the most salientalternative relations would be. Also, there may be interest-ing findings within each category, as most categories are onlyapplicable to certain subsets of words (e.g. holonymy cannoteasily be applied to abstract terms).

Although the envisaged advances concerning the three stepsare of a more evolutionary nature, the sum of it is supposedto lead to a time saving and largely language independentalgorithm for the automatic extraction of a WordNet-likeresource from a large text corpus.

The work is also of interest from a cognitive perspective, asWordNet is a collection of different types of human intu-itions, namely intuitions on word similarity, on word senses,and on word relations. The question is whether all of theseintuitions find their counterpart in corpus evidence. Shouldthis be the case, this would support the view that human lan-guage acquisition can be explained by unsupervised learning

(i.e. low level statistical mechanisms) on the basis of per-ceived spoken and/or written language. If not, other sourcesof information available for language learning would haveto be identified, which may e.g. include knowledge derivedfrom visual perception, world knowledge as postulated in Ar-tificial Intelligence, or some inherited high level mechanismssuch as Pinkers language instinct or Chomskys languageacquisition device.

Although the suggested methodology is unlikely to com-pletely replace current manual techniques of compiling lexi-cal databases in the near future, it should at least be useful

to efficiently aggregate relevant information for subsequenthuman inspection, thereby making the manual work moreefficient. This is of particular importance as the suggestedmethods should in principle be applicable to all languagesso that the potential savings multiply.

Another aspect is that automatic methods will in principleallow generating WordNets for particular genres, domains

or dialects by simply running the algorithm on a large textcorpus of the respective type. This is an aspect that wouldnot be easy to obtain manually, as human intuitions tendto be based on the sum of lifetime experience, so that it isdifficult to concentrate on specific aspects.

Let us conclude by citing from Piasecki et al. [27]: A lan-guage without a wordnet is at a severe disadvantage. ... Lan-guage technology is a signature area of ... the Internet, ...including increasingly clever search engines and more andmore adequate machine translation. A wordnet a richrepository of knowledge about words is a key element of... language processing.

5. REFERENCES[1] S. Bordag. Word sense induction: triplet-basedclustering and automatic evaluation. Proc. of EACL2006.

[2] P. Buitelaar, P. Cimiano (eds.). Ontology Learning andPopulation: Bridging the Gap between Text andKnowledge Selected Contributions to OntologyLearning and Population from Text. IOS Press 2008.

[3] P. Buitelaar, P. Cimiano, P. Haase, M. Sintek. Towardslinguistically grounded ontologies. Proceedings of the6th ESCW, Heraklion, Greece, 111125, 2009.

[4] P. Buitelaar, T. Declerck, A. Frank, S. Racioppa,M. Kiesel, M. Sintek, R. Engel, M. Romanelli,D. Sonntag, B. Loos, V. Micelli, R. Porzel, P. Cimiano.

LingInfo: Design and applications of a model for theintegration of linguistic information in ontologies. Proc.of the OntoLex Workshop, Genoa, Italy, 2834, 2006

[5] J.A. Bullinaria, J.P. Levy. Extracting semanticrepresentations from word co-occurrence statistics: Acomputational study. Behavior Research Methods, 39,510526, 2007.

[6] S.A. Carabello, E. Charniak. Determining thespecificity of nouns from text. Proc. of EMNLP-VLC,6370, 1999.

[7] P. Cimiano, P. Haase, M. Herold, M. Mantel,P. Buitelaar. LexOnto. A model for ontology lexiconsfor ontology based NLP. Proceedings of the OntoLex07Workshop at ISWC07, South Corea, 2007.

[8] C. Fellbaum (ed.). WordNet: An Electronic LexicalDatabase. Cambridge, MA: MIT Press, 1998.

[9] G. Grefenstette. Explorations in Automatic ThesaurusDiscovery. Dordrecht: Kluwer, 1994.

[10] Z.S. Harris. Distributional structure. Word, 10(23),146162, 1954.

[11] G. Hirst, D. St-Onge. Lexical chains as representationof context for the detection and correction ofmalapropisms. In: C. Fellbaum (ed.): WordNet: AnElectronic Lexical Database, Cambridge: MIT Press,305332, 1998.

[12] M. Jarmasz, S. Szpakowicz. Rogets thesaurus and

7/27/2019 AND05

8/8

semantic similarity. Proc. of RANLP, Borovets,Bulgaria, September, 212219, 2003.

[13] J.J. Jiang, D.W. Conrath. Semantic similarity basedon corpus statistics and lexical taxonomy. Proceedingsof the International Conference on Research inComputational Linguistics, Taiwan, 1997.

[14] T.K. Landauer, S.T. Dumais. A solution to Platosproblem: the latent semantic analysis theory of

acquisition, induction, and representation of knowledge.Psychological Review, 104(2), 211240, 1997.

[15] T.K. Landauer, D.S. McNamara, S. Dennis,W. Kintsch (eds.). Handbook of Latent SemanticAnalysis. Lawrence Erlbaum, 2007.

[16] C. Leacock, M. Chodorow. Combining local contextand WordNet similarity for word sense identification.In: C. Fellbaum (ed.). WordNet: An Electronic LexicalDatabase. Cambridge: MIT Press, 265283, 1998.

[17] D. Lin. Automatic retrieval and clustering of similarwords. Proc. of COLING-ACL, Montreal, Vol. 2,768773, 1998.

[18] D. Lin. An information-theoretic definition ofsimilarity. Proc. of the 15th International Conference

on Machine Learning (ICML-98), Madison, WI,296304, 1998.

[19] I. Matveeva, G. Levow, A. Farahat, C. Royer.Generalized latent semantic analysis for termrepresentation. Proc. of RANLP, Borovets, Bulgaria,2005.

[20] E. Montiel-Ponsoda, W. Peters, G. Auguado de Cea,M. Espinoza, A. Gomez Perez, M. Sini. Multilingualand Localization Support for Ontologies. Technicalreport, D2.4.2 Neon Project Deliverable, 2008.

[21] D.B. Neill. Fully Automatic Word Sense Induction bySemantic Clustering. Cambridge University, MastersThesis, M.Phil. in Computer Speech, 2002.

[22] S. Pado, M. Lapata. Dependency-based construction

of semantic space models. Computational Linguistics,33(2), 161199, 2007.

[23] P. Pantel; D. Lin. Discovering word senses from text.Proc. of ACM SIGKDD, Edmonton, 613619, 2002.

[24] P. Pantel, M. Pennacchiotti. Automatically harvestingand ontologizing semantic relations. In: P. Buitelaar,P. Cimiano (eds.) Ontology Learning and Population:Bridging the Gap between Text and Knowledge -Selected Contributions to Ontology Learning andPopulation from Text, IOS Press, 2008.

[25] M.T. Pazienza, A. Stellato. Exploiting LinguisticResources for building linguistically motivatedontologies in the Semantic Web. Proc. of the 2ndOntoLex Workshop, 2006.

[26] Pennacchiotti, M.; Pantel, P.. A b ootstrappingalgorithm for automatically harvesting semanticrelations. Proceedings of Inference in ComputationalSemantics (ICoS), Boxton, England, 87-96, 2006.

[27] M. Piasecki, S. Szpakowicz, B. Broda. A WordNetfrom the Ground Up. Oficyna Wydawnicza PolitechnikiWroclawskiej, 2009.

[28] R. Rapp. The computation of word associations:comparing syntagmatic and paradigmatic approaches.Proc. of the 19th COLING, Taipei, ROC, Vol. 2,821827, 2003.

[29] R. Rapp. Word sense discovery based on sensedescriptor dissimilarity. Proceedings of the Ninth MTSummit, 315322, 2003.

[30] R. Rapp A freely available automatically generatedthesaurus of related words. Proceedings of the 4thLREC, Lisbon, Vol. II, 395398, 2004.

[31] R. Rapp. A practical solution to the problem ofautomatic word sense induction. Proc. of the 42nd

Meeting of the ACL, Comp. Vol., 195198, 2004.[32] R. Rapp. The automatic generation of thesauri of

related words for English, French, German, andRussian. International Journal of Speech Technology, 11(3), 147156, 2009.

[33] P. Resnik. Using information content to evaluatesemantic similarity. Proc. of the 14th InternationalJoint Conference on Artificial Intelligence (IJCAI),Montreal, 448453, 1995.

[34] J. Rosenzweig, R. Mihalcea, A. Csomai. WordNetbibliography. Web page: a bibliography referring toresearch involving the WordNet lexical database. URLhttp://lit.csci.unt.edu/wordnet/, 2007.

[35] G. Ruge. Experiments on linguistically based term

associations. Information Processing and Management,28(3), 317332, 1992.

[36] M. Ruiz-Casado, E. Alfonseca, P. Castells. Usingcontext-window overlapping in synonym discovery andontology extension. Proc. of RANLP, Borovets,Bulgaria, 2005.

[37] M. Sahlgren. Vector-based semantic analysis:representing word meanings based on random labels.In: A. Lenci, S. Montemagni, V. Pirrelli (eds.):Proceedings of the ESSLLI Workshop on the Acquisitionand Representation of Word Meaning, Helsinki, 2001.

[38] H. Schutze. Ambiguity Resolution in LanguageLearning: Computational and Cognitive Models.Stanford: CSLI Publications, 1997.

[39] E. Terra, C.L.A. Clarke. Frequency estimates forstatistical word similarity measures. Proceedings ofHLT/NAACL, Edmonton, Alberta, 244251, 2003.

[40] S. Thoongsup, K. Robkop, C. Mokarat,T. Sinthurahat, T. Charoenporn, V. Sornlertlamvanich,H. Isahara. Thai WordNet construction Proc. of the7th Workshop on Asian Language Resources atACL-IJCNLP, Suntec, Singapore, 139144, 2009.

[41] P.D. Turney. Mining the Web for synonyms. PMI-IRversus LSA on TOEFL. Proc. of the Twelfth EuropeanConference on Machine Learning, Freiburg, Germany,491502, 2001.

[42] P.D. Turney. Similarity of Semantic RelationsComputational Linguistics, 32(3), 379416, 2006.

[43] P.D. Turney. A uniform approach to analogies,synonyms, antonyms, and associations. Proceedings ofthe 22nd Coling, Manchester, UK, 905912, 2008.

[44] P.D. Turney, M.L. Littman, J. Bigham, V. Shnayder.Combining independent modules to solvemultiple-choice synonym and analogy problems. Proc.of RANLP, Borovets, Bulgaria, pp. 482489, 2003.

[45] P.D. Turney, P. Pantel. From frequency to meaning:vector space models of semantics. Journal of ArtificialIntelligence Research, Volume 37, 141188, 2010.

AND05

Documents

Transcript of AND05