Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple...

11
Taxonomy Induction Using Hierarchical Random Graphs Trevor Fountain and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected], [email protected] Abstract This paper presents a novel approach for in- ducing lexical taxonomies automatically from text. We recast the learning problem as that of inferring a hierarchy from a graph whose nodes represent taxonomic terms and edges their degree of relatedness. Our model takes this graph representation as input and fits a taxonomy to it via combination of a maximum likelihood approach with a Monte Carlo Sampling algorithm. Essentially, the method works by sampling hierarchical struc- tures with probability proportional to the like- lihood with which they produce the input graph. We use our model to infer a taxonomy over 541 nouns and show that it outperforms popular flat and hierarchical clustering algo- rithms. 1 Introduction The semantic knowledge encoded in lexical re- sources such as WordNet (Fellbaum, 1998) has been proven beneficial for several applications including question answering (Harabgiu et al., 2003), doc- ument classification (Hung et al., 2004), and tex- tual entailment (Geffet and Dagan, 2005). As the effort involved in creating such resources manu- ally is prohibitive (cost, consistency and coverage are often cited problems) and has to be repeated for new languages or domains, recent years have seen increased interest in automatic taxonomy in- duction. The task has assumed several guises, such as term extraction — finding the concepts of the taxonomy (Kozareva et al., 2008; Navigli et al., 2011), term relation discovery — learning whether any two terms stand in an semantic relation such as IS- A, or PART- OF (Hearst, 1992; Berland and Char- niak, 1999), and taxonomy construction —- creat- ing the taxonomy proper by organizing its terms hi- erarchically (Kozareva and Hovy, 2010; Navigli et al., 2011). Previous work has also focused on the complementary task of augmenting an existing tax- onomy with missing information (Snow et al., 2006; Yang and Callan, 2009). In this paper we propose an unsupervised ap- proach to taxonomy induction. Given a corpus and a set of terms, our algorithm jointly induces their re- lations and their taxonomic organization. We view taxonomy learning as an instance of the problem of inferring a hierarchy from a network or graph. We create this graph from unstructured text simply by drawing an edge between distributionally sim- ilar terms. Next, we fit a Hierarchical Random Graph model (HRG; Clauset et al. (2008)) to the observed graph data based on maximum likelihood methods and Markov chain Monte Carlo sampling. The model essentially works by sampling hierarchi- cal structures with probability proportional to the likelihood with which they produce the input graph. This is advantageous as it allows us to consider the ensemble of random graphs that are statistically sim- ilar to the original graph, and through this to de- rive a consensus hierarchical structure from the en- semble of sampled models. The approach differs crucially from hierarchical clustering in that it ex- plicitly acknowledges that most real-world networks have many plausible hierarchical representations of roughly equal likelihood and does not seek a sin- gle hierarchical representation for a given network. This feature also bodes well with the nature of lexi- cal taxonomies: there is no uniquely correct taxon- omy for a set of terms, rather different taxonomies

Transcript of Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple...

Page 1: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

Taxonomy Induction Using Hierarchical Random Graphs

Trevor Fountain and Mirella LapataInstitute for Language, Cognition and Computation

School of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB

[email protected], [email protected]

Abstract

This paper presents a novel approach for in-ducing lexical taxonomies automatically fromtext. We recast the learning problem asthat of inferring a hierarchy from a graphwhose nodes represent taxonomic terms andedges their degree of relatedness. Our modeltakes this graph representation as input andfits a taxonomy to it via combination of amaximum likelihood approach with a MonteCarlo Sampling algorithm. Essentially, themethod works by sampling hierarchical struc-tures with probability proportional to the like-lihood with which they produce the inputgraph. We use our model to infer a taxonomyover 541 nouns and show that it outperformspopular flat and hierarchical clustering algo-rithms.

1 Introduction

The semantic knowledge encoded in lexical re-sources such as WordNet (Fellbaum, 1998) has beenproven beneficial for several applications includingquestion answering (Harabgiu et al., 2003), doc-ument classification (Hung et al., 2004), and tex-tual entailment (Geffet and Dagan, 2005). As theeffort involved in creating such resources manu-ally is prohibitive (cost, consistency and coverageare often cited problems) and has to be repeatedfor new languages or domains, recent years haveseen increased interest in automatic taxonomy in-duction. The task has assumed several guises, suchas term extraction — finding the concepts of thetaxonomy (Kozareva et al., 2008; Navigli et al.,2011), term relation discovery — learning whetherany two terms stand in an semantic relation such as

IS-A, or PART-OF (Hearst, 1992; Berland and Char-niak, 1999), and taxonomy construction —- creat-ing the taxonomy proper by organizing its terms hi-erarchically (Kozareva and Hovy, 2010; Navigli etal., 2011). Previous work has also focused on thecomplementary task of augmenting an existing tax-onomy with missing information (Snow et al., 2006;Yang and Callan, 2009).

In this paper we propose an unsupervised ap-proach to taxonomy induction. Given a corpus anda set of terms, our algorithm jointly induces their re-lations and their taxonomic organization. We viewtaxonomy learning as an instance of the problemof inferring a hierarchy from a network or graph.We create this graph from unstructured text simplyby drawing an edge between distributionally sim-ilar terms. Next, we fit a Hierarchical RandomGraph model (HRG; Clauset et al. (2008)) to theobserved graph data based on maximum likelihoodmethods and Markov chain Monte Carlo sampling.The model essentially works by sampling hierarchi-cal structures with probability proportional to thelikelihood with which they produce the input graph.This is advantageous as it allows us to consider theensemble of random graphs that are statistically sim-ilar to the original graph, and through this to de-rive a consensus hierarchical structure from the en-semble of sampled models. The approach differscrucially from hierarchical clustering in that it ex-plicitly acknowledges that most real-world networkshave many plausible hierarchical representations ofroughly equal likelihood and does not seek a sin-gle hierarchical representation for a given network.This feature also bodes well with the nature of lexi-cal taxonomies: there is no uniquely correct taxon-omy for a set of terms, rather different taxonomies

Page 2: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

are likely to be appropriate for different tasks anddifferent taxonomization criteria.

Our contributions in this paper are three-fold: weadapt the HRG model to the taxonomy inductiontask and show that its performance is superior to al-ternative methods based on either flat or hierarchi-cal clustering; we analyze the requirements of thealgorithm with respect to the input graph and thesemantic representation of its nodes; and introducenew ways of evaluating the fit of an automaticallyinduced taxonomy against a gold-standard. In thefollowing section we provide an overview of relatedwork. Next, we describe our HRG model in moredetail (Section 3) and present the resources and eval-uation methodology used in our experiments (Sec-tion 4). We conclude the paper by presenting anddiscussing our results (Sections 4.1–4.4).

2 Related Work

The bulk of previous work has focused on term re-lation discovery following essentially two method-ological paradigms, pattern-based bootstrapping andclustering. The former approach (Hearst, 1992;Roark and Charniak, 1998; Berland and Charniak,1999; Girju et al., 2003; Etzioni et al., 2005;Kozareva et al., 2008) utilizes a few hand-craftedseed patterns representative of taxonomic relations(e.g., IS-A, PART-OF, SIBLING) to extract instancesfrom corpora. These instances are then used to ex-tract new patterns which are in turn used to find newinstances and so on. Clustering-based approacheshave been mostly employed to discover IS-A andSIBLING relations (Lin, 1998; Caraballo, 1999; Pan-tel and Ravichandran, 2004). A common assump-tion is that words are related if they occur in similarcontexts and thus clustering algorithms group wordstogether if they share contextual features. Most ofthese algorithms aim at inducing flat clusters ratherthan taxonomies, with the exception of Brown et al.(1992) whose method induces binary trees.

Contrary to the plethora of algorithms developedfor relation discovery, methods dedicated to taxon-omy learning have been few and far between. Cara-ballo (1999) was the first to induce a taxonomyfrom a corpus using a combination of clustering andpattern-based methods. Specifically, nouns are orga-nized into a tree using a bottom-up clustering algo-rithm and internal nodes of the resulting tree are la-beled with hypernyms from the nouns clustered un-derneath using patterns such as “B is a kind of A”.

Kozareva et al. (2008) and Navigli et al. (2011)both develop systems that create taxonomies end-to-end, i.e., discover the terms, their relations, andhow these are hierarchically organized. The two ap-proaches are conceptually similar: they both use theweb and pattern-based methods for finding domain-specific terms. Additionally, in both approaches theacquired knowledge is represented as a graph fromwhich a taxonomy is induced using task-specific al-gorithms such as graph pruning, edge weighting,and so on.

Our work also addresses taxonomy learning, how-ever, without the term discovery step — we assumewe are given the terms for which to create a taxon-omy. Similarly to Kozareva et al. (2008) and Nav-igli et al. (2011), our model operates over a graphwhose nodes represent terms and edges their rela-tionships. We construct this graph from a corpussimply by taking account of the distributional sim-ilarity of the terms in question. Our taxonomy in-duction algorithm is conceptually simpler and moregeneral; it fits a taxonomy to the observed networkdata using the tools of statistical inference, combin-ing a maximum likelihood approach with a MonteCarlo Sampling algorithm. The technique allows usto sample hierarchical random graphs with probabil-ity proportional to the likelihood that they generatethe observed network. The induction algorithm canoperate over any kind of (undirected) graph, and thusdoes not have to be tuned specifically for differentinputs. We should also point out that our formula-tion of the inference problem utilizes very little cor-pus external knowledge other than the set of inputterms, and could thus be easily applied to domainsor languages where lexical resources are scarce.

The Hierarchical Random Graph model (Clausetet al., 2008) has been applied to construct hierarchi-cal decompositions from three sets of network data:a bacterial metabolic network; a food-web amonggrassland species; and the network of associationsamong terrorist cells. The only language-related ap-plication we are aware of concerns word sense in-duction. Klapaftis and Manandhar (2010) create agraph of contexts for a polysemous target word anduse the HRG to organize them hierarchically, underthe assumption that different tree heights correspondto different levels of sense granularity.

Page 3: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

A

BC

D E

F

(a) Input graph

1.00

1.00 C

A B

1.00

E F

0.11

0.50

D

(b) Binary tree

C A B

E F

D

(c) Hierarchy

A B

C D

E F

(d) Clusters

Figure 1: Flow of information through the Hierarchical Random Graph algorithm. From a semantic net-work (1a), the model constructs a binary tree (1b). Edges in the semantic network are then used to computethe θ parameters for internal nodes in the tree; the maximum-likelihood-estimated θ parameter for an internalnode indicates the density of edges between its children. This tree is then resampled using the θ parameters(1b) until the MCMC process converges, at which point it can be collapsed into a n-ary hierarchy (1c). Thesame collapsing process can be also used to identify a flat clustering (1d).

A B

CA

B C A

B

C

Figure 2: Any internal node with subtrees A, B andC can be permuted to one of two possible alter-nate configurations. Shaded nodes represent internalnodes which are unmodified by such permutation.

3 The Hierarchical Random Graph Model

A HRG consists of a binary tree and a set of likeli-hood parameters, and operates on input organizedinto a semantic network, an undirected graph inwhich nodes represent terms and edges betweennodes indicate a relationship between pairs of terms(Figure 1a). From this representation, the modelconstructs a binary tree whose leaves correspondto nodes in the semantic network (Figure 1b); themodel then employs a simple Markov chain MonteCarlo (MCMC) process in order to explore the spaceof possible binary trees and derives a consensus hi-erarchical structure from the ensemble of sampledmodels (Figure 1c).

3.1 Representing a Hierarchical Structure

Formally, we denote a semantic network S = (V,E),where V = {v1,v2 . . .vn} is the set of vertices, oneper term, and E is the set of edges between termsin which Ea,b indicates the presence of an edge be-tween va and vb.

Given a network S, we construct a binary tree Dwhose n leaves correspond to V and whose n− 1internal nodes denote a hierarchy over V . Becausethe leaves remain constant for a given S, we defineD as the set of internal nodes D = {D1,D2 . . .Dn}and associate each edge Ea,b ∈ E with an internalnode Di being the lowest common parent of a,b∈V .The core assumption underlying the HRG model isthat edges in S have a non-uniform and independentprobability of existing. Each possible edge Ea,b ∈ Eexists with a probability θi, where θi is associatedwith the corresponding internal node Di.

For a given internal node Di, let Li and Ri be thenumber of leaves in Di’s left and right subtrees, re-spectively; let Ei be the number of edges in E asso-ciated with Di (colloquially, the number of edges inS between leaves in Di’s left and right subtrees). Foreach Di ∈ D, we can estimate the maximum likeli-hood for the corresponding θi as θi =

EiLiRi

. The like-lihood L(D,θ|S) of a HRG over a given semanticnetwork S is then given by:

L(D,θ|S) =n−1

∏i=1

(θi)Ei(1−θi)

LiRi−Ei (1)

3.2 Markov Chain Monte Carlo SamplingGiven a representation for a HRG H (D,θ) and amethod for estimating the likelihood of a given Dand θ, we can focus on obtaining the binary tree Dwhich best fits (or most plausibly explains) a givensemantic network. Because the space of possible bi-nary trees over V is super-exponential with respectto |V |, we employ a MCMC process to sample fromthe space of binary trees. During each iteration of

Page 4: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

Algorithm 1: MCMC SamplingCompute the likelihood L(D,θ) of the current1binary tree.Pick a random internal node Di ∈ D.2Randomly permute Di according to Figure 2.3

Compute the likelihood L(D,θ) of the modified4binary tree.if L(D,θ)> L(D,θ) then5

accept the transition;6else7

accept with probability L(D,θ)/L(D,θ)8(i.e., standard Metropolis acceptance).

end9Repeat;10

this process we randomly select a node within thetree and permute it according to Figure 2. If thispermutation improves the overall likelihood of thedendrogram we accept it as a transition, otherwise itis accepted with a probability proportional to the de-gree to which it decreases the overall likelihood (i.e.standard Metropolis acceptance). This procedure isdescribed in more detail in Algorithm 1.

3.3 Consensus HierarchyOnce the MCMC process has converged, the modelis left with a binary tree over the terms from the inputsemantic network. As in standard hierarchical clus-tering, however, this imposes an arbitrary structurewhich may or may not correspond to the observeddata — the tree at convergence will be similar to anideal tree given the graph, but may not be the mostplausible structure. Indeed, for taxonomy inductionit is quite unlikely that a binary tree will provide themost appropriate categorization.

To avoid encoding such bias we employ amodel averaging technique to produce a consen-sus hierarchy. For a set of binary trees sam-pled after convergence, we first identify the setof possible clusters encoded in the tree, e.g., thebinary tree in Figure 1b encodes the clusters{AB,ABC,EF,D,DEF,ABCDEF}. As in Clauset etal. (2008), each cluster instance is then weightedaccording to the likelihood of the originating HRG(Equation 1); we then sum the weights for each dis-tinct cluster across all resampled trees and discardthose whose aggregate weight is lower than 50% ofthe total observed weight. The remaining clustersare then used to reconstruct a hierarchy in which

Algorithm 2: Flat ClustersLet Dk be the root node of D.1

if θk > θ then2output the leaves of the subtree rooted at Dk3as a cluster

else4repeat 2 with left and right children of Dk.5

end6

each subtree appears in the majority of trees ob-served after the sampling process has reached con-vergence, hence the term consensus hierarchy.

3.4 Obtaining Flat ClustersFor evaluation purposes we may want to comparethe groupings created by the HRG to a simpler non-hierarchical clustering algorithm (see Section 4 fordetails). We thus defined a method of converting thetree produced by the HRG into a flat (hard) clus-tering. This can be done in a relatively straightfor-ward, principled fashion using the HRG’s θ parame-ters. For a given H (D,θ) we identify internal nodeswhose θk likelihood is greater than the mean likeli-hood and who possess no parent node whose θk like-lihood is also greater than the mean. Each such nodeis the root of a densely-connected subtree; each suchsubtree is then assumed to represent a single discretecluster of related items, where θ = mean(θ) (illus-trated in Figure 1c). This procedure is explained ingreater detail in Algorithm 2.

4 Evaluation

Data We evaluated our taxonomy induction algo-rithm using McRae et al.’s (2005) dataset whichconsists of for 541 basic level nouns (e.g., DOGand TABLE). Each noun is associated with features(e.g., has-legs, is-flat, and made-of-wood for TABLE)collected from human participants in multiple stud-ies over several years. The original norming studydoes not include class labels for these nouns, how-ever, we were able to exploit a clustering providedby Fountain and Lapata (2010), in which a set of on-line participants annotated each of the McRae et al.nouns with basic category labels.

The nouns and their class labels were further tax-onomized using WordNet (Fellbaum, 1998). Specif-ically, we first identified the full hypernym path inWordNet for each noun in McRae et al.’s (2005)dataset, e.g., APPLE > PLANT STRUCTURE > NAT-

Page 5: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

URAL OBJECT > PHYSICAL OBJECT > ENTITY (atotal of 493 concepts appear in both). These hyper-nym paths were then combined to yield a full tax-onomy over McRae et al.’s nouns; internal nodeshaving only a single child were recursively removedto produce a final, compact taxonomy1 containing186 semantic classes (e.g., ANIMALS, WEAPONS,FRUITS) organized into varying levels of granular-ity (e.g., SONGBIRDS > BIRDS >ANIMALS).

Evaluation measures Evaluation of taxonomi-cally organized information is notoriously hard (seeHovy (2002) for an extensive discussion on thistopic). This is due to the nature of the task which isinherently subjective and application specific (e.g., adolphin can be a Mammal to a biologist, but a Fishto a fisherman or someone visiting an aquarium).Nevertheless, we assessed the taxonomies producedby the HRG against the WordNet-like taxonomy de-scribed above using two measures, one that sim-ply evaluates the grouping of the nouns into classeswithout taking account of their position in the taxon-omy and one which evaluates the taxonomy directly.

To evaluate a flat clustering into classes we usethe F-score measure introduced in the SemEval 2007task (Agirre and Soroa, 2007); it is the harmonicmean of precision and recall defined as the numberof correct members of a cluster divided by the num-ber of items in the cluster and the number of itemsin the gold-standard class, respectively. Althoughinformative, evaluation based solely on F-score putsthe HRG model at a comparative disadvantage as thetask of taxonomy induction is significantly more dif-ficult than simple clustering. To overcome this dis-advantage we propose an automatic method of eval-uating taxonomies directly by first computing thewalk distance between pairs of terms that share agold-standard category label within a gold-standardand a candidate taxonomy, and then computing thepairwise correlation between distances in each tree(Lapointe, 1995). This captures the intuition thata ‘good’ hierarchy is one in which items appearingnear one another in the gold taxonomy also appearnear one another in the induced one. It is also con-ceptually similar to the task-based IS-A evaluation(Snow et al., 2006) which has been traditionally usedto evaluate taxonomies.

Formally, let G = {g0,1,g0,2 . . .gn,n−1}, where ga,bindicates the walk distance between terms a and b

1The taxonomy and flat cluster labels are available fromhttp://homepages.inf.ed.ac.uk/s0897549/data.

in the gold standard hierarchy. Similarly, let C ={c0,1,c0,2 . . .cn,n−1}, where ca,b is the distance be-tween a and b in the candidate hierarchy. The tree-height correlation between G and C is then givenby Spearman’s ρ correlation coefficient between thetwo sets. All tree-height correlations reported inour experiments were computed using the WordNet-based gold-standard taxonomy over McRae et al.’s(2005) nouns.

Baselines We compared the HRG output againstthree baselines. The first is Chinese Whispers (CW;Biemann (2006)), a randomized graph-clustering al-gorithm which like the HRG also takes as input agraph with weighted edges. It produces a hard (flat)clustering over the nodes in the graph, where thenumber of clusters is determined automatically. Oursecond baseline is Brown et al.’s (1992) agglomer-ative clustering algorithm that induces a mappingfrom word types to classes. It starts with K classesfor the K most frequent word types and then pro-ceeds by alternately adding the next most frequentword to the class set and merging the two classeswhich result in the least decrease in the mutual in-formation between class bigrams. The result is aclass hierarchy with word types at the leaves. Ad-ditionally, we compare against standard agglomer-ative clustering (Sokal and Michener, 1958) whichproduces a binary dendrogram in a bottom-up fash-ion by recursively identifying concepts or clusterswith the highest pairwise similarity.

In the following, we present our taxonomy induc-tion experiments (Sections 4.1–4.3). Since HRGsprovide a means of inducing a hierarchy over agraph-based representation, which may be con-structed in an arbitrary fashion, our experimentswere designed to investigate how the topology andquality of the input graph influences the algorithm’sperformance. We thus report results when the se-mantic network is created from data sources of vary-ing quality and granularity.

4.1 Experiment 1: Taxonomy Induction fromFeature Norms

Method We first considered the case where the in-put graph is of high semantic quality and constructeda semantic network from the feature norms collectedby McRae et al. (2005). Each noun was representedas a vector with dimensions corresponding to thepossible features generated by participants of thenorming study; the value of a term along a dimen-

Page 6: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

Method F-score Tree Correlation

HRG 0.507 0.168CW 0.464 —Agglo 0.352 0.137

Table 1: Cluster F-score and tree-height correla-tion evaluation; a semantic network constructed overMcRae et al.’s (2005) nouns and features is given asinput to the algorithms.

sion was taken to be the frequency with which par-ticipants generated the corresponding feature whengiven the term. For each pair of terms an edge wasadded to the semantic network if the cosine similar-ity between their vector representations exceeded afixed threshold T (set to 0.15).

The resulting network was then provided as in-put to the HRG, which was resampled until con-vergence. The binary tree at convergence was col-lapsed into a hierarchy over clusters using the pro-cedure described in Section 3.4; this hierarchy wasevaluated by computing the cluster F-score betweenits constituent clusters and those of a gold-standard(human-produced) clustering. The resulting consen-sus hierarchy was evaluated by computing the tree-height correlation between it and the gold-standard(WordNet-derived) hierarchy.

Results Our results are summarized in Table 1.We only give the tree correlation for the HRG andagglomerative methods (Agglo) as CW does not in-duce a hierarchical clustering. In addition, we donot compare against Brown et al. (1992) as the in-put to this algorithm is not vector-based. Whenevaluated using F-score, the HRG algorithm pro-duces better quality clusters compared to CW, inaddition to being able to organize them hierarchi-cally. It also outperforms agglomerative clusteringby a large margin. A similar pattern emerges whenthe HRG and Agglo are evaluated on tree correla-tion. The taxonomies produced by the HRG are abetter fit against the WordNet-based gold standard;the difference in performance is statistically signif-icant (p < 0.01) using a t-test (Cohen and Cohen.,1983).

4.2 Experiment 2: Taxonomy Induction fromthe British National Corpus

Method The results of Experiment 1 can be con-sidered as an upper bound of what can be achievedby the HRG when the input graph is constructed

from highly accurate semantic information. Featurenorms provide detailed knowledge about meaningwhich would be very difficult if not close to impos-sible to obtain from a corpus. Nevertheless, it isinteresting to explore how well we can induce tax-onomies using a lower quality semantic network.We therefore constructed a network based on co-occurrence statistics computed from the British Na-tional Corpus (BNC, 2007) and provided the result-ing semantic network as input to the HRG, CW,and Agglo models; additionally, we employed thealgorithm of Brown et al. (1992) to induce a hier-archy over the target terms directly from the cor-pus. Unfortunately, this algorithm requires the num-ber of desired output clusters to be specified in ad-vance; in all trials this parameter was set to the num-ber of clusters in the gold-standard clustering (41),thus providing the Brown-induced clusterings with aslight oracle advantage.

Again, nouns were represented as vectors in se-mantic space. We used a context window of fivewords on either side of the target word and 5,000vector components corresponding to the most fre-quent non-stopwords in the BNC. Raw frequencycounts were transformed using pointwise mutual in-formation (PMI). An edge was added to the seman-tic network between a pair of nouns if their simi-larity exceeded a predefined threshold (the same asin Experiment 1). The similarity of two nouns wasdefined as the cosine distance between their corre-sponding vectors.

The HRG algorithm was used to produce a tax-onomy from this network and was also comparedagainst Brown et al. (1992). The latter induces a hi-erarchy from a corpus directly, without the interme-diate graph representation. All resulting taxonomieswere evaluated against gold standard flat and hierar-chical clusterings, again as in Experiment 1.

Results Results are shown in Table 2. With regardto flat clustering (the F-score column in the table),the HRG has a slight advantage against CW, andBrown et al.’s (1992) algorithm (Brown). However,differences in performance are not statistically sig-nificant. Agglomerative clustering is the worst per-forming method leading to a decrease in F-score ofapproximately 1.5. With regard to tree correlation,the output of the HGRG is comparable to Brown(the difference between the two is not statisticallysignificant). Both algorithms are significantly better(p < 0.01) than Agglo.

Page 7: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

(a) s = 0.0 (b) s = 0.5 (c) s = 1.0

Figure 3: The original semantic network as derived from the BNC (a) and the same network re-weightedusing a flat clustering produced by CW (b). As s approaches 1.0 the network exhibits an increasingly strongsmall-world property, eventually reconstructing the input clustering only (c).

Method F-score Tree Correlation

HRG 0.276 0.104CW 0.274 —Brown 0.258 0.124Agglo 0.122 0.077

Table 2: Cluster F-score and tree-height correla-tion evaluation for taxonomies inferred over McRaeet al.’s (2005) nouns; all algorithms are run on theBNC.

Performance of the HRG is better when the se-mantic network is based on feature norms (compareTables 1 and 2), both in terms of tree-height correla-tion and F-score. This suggests that the algorithm ishighly dependent on the quality of the semantic net-work used as input. In particular, HRGs are knownto be more appropriate for so-called small-worldnetworks, graphs composed of densely-connectedsubgraphs with relatively sparse connections be-tween (Klapaftis and Manandhar, 2010). Indeed, in-spection of the semantic network produced from theBNC (see Figure 3a) shows that our corpus-derivedgraph is emphatically not a small-world graph, yetthe HRG is able to recover some taxonomic infor-mation from such a densely-connected network.

In the following experiments we first assess thedifficulty of the taxonomy induction task to get afeel of how well the algorithms are performing incomparison to humans and then investigate ways ofrendering the BNC-based graph more similar to asmall-world network.

4.3 Experiment 3: Human Upper Bound

Method The previous experiments evaluated theperformance of the HRG against a gold-standardhierarchy derived from WordNet. For any set ofconcepts there will exist multiple valid taxonomies,each representing an accurate if differing organiza-tion of identical concepts using different criteria;for the set of concepts used in Experiments 1–2 theWordNet hierarchy represents merely one of manyvalid hierarchies. Noting this, it is interesting to ex-plore how well the hierarchies output by the modelfit within the set of possible, valid taxonomies overa given set of concepts.

We thus conducted an experiment in which hu-man participants were asked to organize words intoarbitrary hierarchies. To render the task feasible,they were given a small subset of 12 words ratherthan the full set of 541 nouns over which the HRGoperates. We first selected a sub-hierarchy of theWordNet tree (‘living things’) along with its subtrees(e.g., ‘animals’, ‘plants’), and chose target conceptsfrom within these trees in order to produce a tax-onomy in which some items were differentiated ata high level (e.g., ‘python’ vs. ‘dog’) and others ata fine-grained level (e.g., ‘lion’ vs ‘tiger’). The ex-periment was conducted using Amazon MechanicalTurk2, and involved 41 participants, all self-reportednative English speakers. No guidelines as to whatfeatures participants were to use when organizingthese concepts were provided. Participants were pre-sented with a web-based, graphical, mouse-driveninterface for constructing a taxonomy over the cho-

2http://mturk.com

Page 8: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

Method Tree Correlation Min Max Std

HRG 0.412 -0.039 0.799 0.166Brown 0.181 0.006 0.510 0.121Agglo 0.274 -0.056 0.603 0.121Agreement 0.511 -0.109 1.000 0.267

Table 3: Model performance on a subset of thetarget words used in Experiments 1–2, applied toa subset of the semantic network used in Experi-ment 2. Instead of a WordNet-derived hierarchy,models were evaluated against hierarchies manuallyproduced by participants in an online study. Treecorrelation values are means; we also report the min-imum (Min), maximum (Max), and standard devia-tion (Std) of the mean.

sen set of concepts.To evaluate the HRG, along with the baselines

from Experiment 2, against the resulting hierarchieswe constructed a semantic network over the subsetof concepts using similarities derived from the BNC;this network was a subgraph of that used in Exper-iment 2. We compute inter-annotator agreement asthe mean pairwise tree-height correlation betweenthe hierarchies our participants produced. We alsoreport for each model the mean tree-height corre-lation between the hierarchy it produced and thosecreated by human annotators.

Results As shown in Table 3, participants achievea mean pairwise tree correlation of 0.511. This in-dicates that there is a fair amount of agreement withrespect to the taxonomic organization of the wordsin question. The HRG comes close achieving a meantree correlation of 0.412, followed by Agglo, andBrown. In general, we observe that the HRG man-ages to produce hierarchies that resemble those gen-erated by humans to a larger extent than competingalgorithms. The results in Table 3 also hint at thefact that the taxonomy induction task is relativelyhard as participants do not achieve perfect agree-ment despite the fact that they are asked to taxon-omize only 12 words.

4.4 Experiment 4: Taxonomy Induction from aSmall-world Network

Method In Experiment 2 we hypothesized that asmall-world input graph would be more advanta-geous for the HRG. In order to explore this further,we imposed something of a small-world structure on

Method F-score Tree Correlation

HRG 0.276 0.104HRG + CW 0.291 0.161HRG + Brown 0.255 0.173

Table 4: Cluster F-score and tree-height correlationevaluation for taxonomies inferred by the HRG us-ing semantic network derived from the BNC and re-weighted using CW and Brown.

the BNC semantic network, using a combination ofthe baseline clustering methods evaluated in Exper-iment 2. Specifically, we first obtain a (flat) cluster-ing using either CW or Brown, which we then useto re-weight the BNC graph given as input to theHRG.3 Note that, as the clustering algorithms usedare unsupervised this procedure does not introduceany outside supervision into the overall taxonomyinduction task.

The modified weight WA,B between a pair ofterms A,B was computed according to Equation (2),where s indicates the proportion of edge weightdrawn from the clustering, WA,B is the edge weightin the original (BNC) semantic network, and CA,B isa binary value indicating that A and B belong to thesame cluster (i.e., CA,B = 1 if A and B share a cluster;CA,B = 0 otherwise).

WA,B = (1− s)WA,B + sCA,B (2)

The value of the s parameter was tuned empiricallyon held-out development data and set to s = 0.4 forboth CW and Brown algorithms. Each re-weightednetwork was then used as input to an HRG, andthe resulting taxonomies were evaluated in the samemanner as in Experiments 1 and 2.

Results Table 4 shows results for cluster F-scoreand tree-height correlation for the HRG when us-ing a graph derived from the BNC without anymodifications, and two re-weighted versions usingthe CW and Brown clustering algorithms, respec-tively. As can been seen, re-weighting improvestree-height correlation substantially: HRG with CWand Brown is significantly better than HRG on itsown (p < 0.05). In the case of CW, cluster F-scorealso yields a slight improvement. Interestingly,the tree-height correlations obtained with CW andBrown are comparable to those attained by the HRG

3We omit agglomerative clustering as it performed poorlyon the BNC, see Table 2.

Page 9: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

bed cushion pillow sofa

bow jeans mittens veil blouse coat gown pants trousers leotards dress swimsuit shawl scarf

jacket sweater tie vest bra camisole nylons

0

1

2

3 4

Figure 4: An excerpt from a hierarchy induced by the HRG, using the BNC semantic network with Brownre-weighting. The HRG does not provide category labels for internal nodes of the hierarchy, but subtreeswithin this excerpt correspond roughly to (0) TEXTILES, (1) CLOTHING, (2) GENDERED CLOTHING, (3)MEN’S CLOTHING, and (4) WOMEN’S CLOTHING.

when using the human-produced feature norms (dif-ferences in correlations are not statistically signifi-cant). An excerpt of a HRG-induced taxonomy isshown in Figure 4.

5 Conclusions

In this paper we have presented a novel method forautomatically inducing lexical taxonomies based onHierarchical Random Graphs. The approach is con-ceptually simple, taking a graph representation asinput and fitting a taxonomy via combination of amaximum likelihood approach with a Monte CarloSampling algorithm. Importantly, the approach doesnot operate on corpora directly, instead it relies onan abstract, interim representation (a semantic net-work) which we argue is advantageous, as it allowsto easily encode additional information in the input.Furthermore, the model presented here is largelyparameter-free, as both the input graph and the in-ferred taxonomy are derived empirically in an unsu-pervised manner (minimal tuning is required whengraph re-weighting is employed, the parameter s).

Our experiments have shown that both the inputsemantic network and the representation of its nodesinfluence the quality of the induced taxonomy. Rep-resenting the terms of the taxonomy as vectors in ahuman-produced feature space yields more coherentsemantic classes compared to a corpus-based vectorrepresentation (see the F-score in Tables 1 and 4).This is not surprising, as feature norms providemore detailed and accurate knowledge about seman-tic representations than often noisy and approxi-mate corpus-based distributions.4 It may be possi-

4Note that as multiple participants are required to create arepresentation for each word, norming studies typically involve

ble to obtain better performance when consideringmore elaborate representations. We have only ex-perimented with a simple semantic space, howevervariants that utilize syntactic information (e.g., Padoand Lapata (2007)) may be more appropriate for thetaxonomy induction task. Our experiments have alsoshown that the topology of the input semantic net-work is critical for the success of the HRG. In partic-ular edge re-weighting plays an important role andgenerally improves performance. We have adopteda simple method based on flat clustering; it may beinteresting to compare how this fares with more in-volved weighting schemes such as those describedin Navigli et al. (2011). Finally, we have shown thatnaive participants are able to perform the taxonomyinduction task relatively reliably and that the HRGapproximates human performance on a small-scaleexperiment. We have evaluated model output usingF-score and tree-height correlation which we argueare complementary and allow to assess hierarchicalclustering more rigorously.

Avenues for future work are many and varied. Be-sides exploring the performance of our algorithm onmore specialized domains (e.g., mathematics or ge-ography) we would also like to create an incremen-tal version that augments an existing taxonomy withmissing information. Additionally, the taxonomiesinferred with the HRG do not currently admit termambiguity which we could remedy by modifying ourtechnique for constructing a consensus hierarchy toreflect the sampled frequency of observed subtrees.

a small number of items, consequently limiting the scope of anycomputational model based on normed data.

Page 10: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

ReferencesEneko Agirre and Aitor Soroa. 2007. Semeval-2007 task

02: Evaluating word sense induction and discrimina-tion systems. In Proceedings of the 4th InternationalWorkshop on Semantic Evaluations (SemEval-2007),pages 7–12, Prague, Czech Republic, June.

Matthew Berland and Eugene Charniak. 1999. Findingparts in very large corpora. In Proceedings of the 37thAnnual Meeting of the Association for ComputationalLinguistics, pages 57–64, College Park, Maryland.

Chris Biemann. 2006. Chinese whispers - an efficientgraph clustering algorithm and its application to natu-ral language processing problems. In Proceedings ofTextGraphs: the 1st Workshop on Graph Based Meth-ods for Natural Language Processing, pages 73–80,New York City.

BNC. 2007. The British National Corpus, version 3(BNC XML Edition). Distributed by Oxford Univer-sity Computing Services on behalf of the BNC Con-sortium.

Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza,Jenifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computa-tional Linguistics, 18:467–479.

Sharon A. Caraballo. 1999. Automatic construction of ahypernym-labeled noun hierarchy from text. In Pro-ceedings of the 37th Annual Meeting of the Associ-ation for Computational Linguistics, pages 120–126,College Park, Maryland.

Aaron Clauset, Christopher Moore, and M. E. J. New-man. 2008. Hierarchical structure and the predictionof missing links in networks. Nature, 453:98–101,February.

J. Cohen and P. Cohen. 1983. Applied Multiple Regres-sion/Correlation Analysis for the Behavioral Sciences.Erlbaum, Hillsdale, NJ.

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland,Daniel S. Weld, and Alexander Yates. 2005. Unsuper-vised named-entity extraction from the web: An exper-imental study. Artificial Intelligence, 165(1):91–134.

Christiane Fellbaum. 1998. WordNet: An ElectronicLexical Database (Language, Speech, and Communi-cation). The MIT Press.

Trevor Fountain and Mirella Lapata. 2010. Meaning rep-resentation in natural language categorization. In Stel-lan Ohlsson and Richard Catrambone, editors, Pro-ceedings of the 31st Annual Conference of the Cogni-tive Science Society, pages 1916–1921, Portland, Ore-gon. Cognitive Science Society.

Maayan Geffet and Ido Dagan. 2005. The distributionalinclusion hypotheses and lexical entailment. In Pro-ceedings of the 43rd Annual Meeting of the Associ-

ation for Computational Linguistics, pages 107–114,Ann Arbor, Michigan.

Roxana Girju, Adriana Badulescu, and Dan Moldovan.2003. Learning semantic constraints for the automaticdiscovery of part-whole relations. In Proceedings ofthe 2003 Human Language Technology Conference ofthe North American Chapter of the Association forComputational Linguistics, pages 80–87, Edmonton,Canada.

Sanda M. Harabgiu, Steven J. Maiorano, and Marius A.Pasca. 2003. Open-doman textual question answeringtechniques. Natural Language Engineering, 9(3):1–38.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings of the14th conference on Computational linguistics, pages539–545, Nantes, France.

Eduard Hovy. 2002. Comparing sets of semantic rela-tionships in ontologies. In Rebecca Green, Carol A.Bean, and Sun Hyon Myaeng, editors, The Seman-tics of Relationships: An Interdisciplinary Perspec-tive, pages 91–110. Kluwer Academic Publishers, TheNetherlands.

Chihli Hung, Stefan Wermter, and Peter Smith. 2004.Hybrid neural document clustering using guided self-organization and wordnet. IEEE Intelligent Systems,19(2):68–77.

Ioannis Klapaftis and Suresh Manandhar. 2010. Wordsense induction and disambiguation using hierarchicalrandom graphs. In Proceedings of the 2010 Confer-ence on Empirical Methods in Natural Language Pro-cessing, pages 745–755, Cambridge, MA.

Zornitsa Kozareva and Eduard Hovy. 2010. Learningarguments and supertypes of semantic relations usingrecursive patterns. In Proceedings of the 48th AnnualMeeting of the Association for Computational Linguis-tics, pages 1482–1491, Uppsala, Sweden, July.

Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008.Semantic class learning from the web with hyponympattern linkage graphs. In Proceedings of ACL-08:HLT, pages 1048–1056, Columbus, Ohio, June.

Francois-Joseph Lapointe. 1995. Comparison tests fordendrograms: A comparative evaluation. Journal ofClassification 12:265-282, 12:265–282.

Dekang Lin. 1998. Automatic retrieval and clusteringof similar words. In Proceedings of the 36th AnnualMeeting of the Association for Computational Linguis-tics and 17th International Conference on Computa-tional Linguistics, Volume 2, pages 768–774, Mon-treal, Quebec, Canada.

Ken McRae, George S. Cree, Mark S. Seidenberg, andChris McNorgan. 2005. Semantic feature productionnorms for a large set of living and non-living things.

Page 11: Taxonomy Induction Using Hierarchical Random Graphs Induction... · model then employs a simple Markov chain Monte Carlo (MCMC) process in order to explore the space of possible binary

Behavioral Research Methods Instruments & Comput-ers, 37(4):547–559.

Roberto Navigli, Paola Velardi, and Stefano Faralli.2011. A graph-based algorithm for inducing lexi-cal taxonomies from scratch. In Proceedings of the22nd International Joint Conference on Artificial In-telligence, pages 1872–1877, Barcelona, Spain.

Sebastian Pado and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Compu-tational Linguistics, 33(2):161–199.

Patrick Pantel and Deepak Ravichandran. 2004. Auto-matically labeling semantic classes. In Daniel MarcuSusan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 321–328,Boston, Massachusetts.

Brian Roark and Eugene Charniak. 1998. Noun-phrase co-occurrence statistics for semi-automatic se-mantic lexicon construction. In Proceedings of the36th Annual Meeting of the Association for Computa-tional Linguistics and 17th International Conferenceon Computational Linguistics, Volume 2, pages 1110–1116, Montreal, Quebec.

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.Semantic taxonomy induction from heterogenous evi-dence. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th AnnualMeeting of the Association for Computational Linguis-tics, pages 801–808, Sydney, Australia.

Robert Sokal and Charles Michener. 1958. A statisticalmethod for evaluating systematic relationships. Uni-versity of Kansas Science Bulletin, 38:1409–1438.

Hui Yang and Jamie Callan. 2009. A metric-basedframework for automatic taxonomy induction. In Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP, pages 271–279, Suntec, Singapore.