Constructing Task-Speciﬁc Taxonomies for Document ...infosense.cs.georgetown.edu › publication...

Constructing Task-Specific Taxonomies for Document Collection Browsing

Hui YangDepartment of Computer Science

Georgetown University37th and O street, NW

Washington, DC, [email protected]

Abstract

Taxonomies can serve as browsing tools fordocument collections. However, given an ar-bitrary collection, pre-constructed taxonomiescould not easily adapt to the specific topic/taskpresent in the collection. This paper explorestechniques to quickly derive task-specific tax-onomies supporting browsing in arbitrarydocument collections. The supervised ap-proach directly learns semantic distances fromusers to propose meaningful task-specific tax-onomies. The approach aims to produce glob-ally optimized taxonomy structures by incor-porating path consistency control and user-generated task specification into the generallearning framework. A comparison to state-of-the-art systems and a user study jointlydemonstrate that our techniques are highly ef-fective.

1 Introduction

Taxonomies are widely used for knowledge stan-dardization, knowledge sharing, and inferencing innatural language processing (NLP) tasks (Harabagiuet al., 2003; Szpektor et al., 2004). However, an-other common function of taxonomies, browsing,has received little attention in the NLP community.Browsing is the task of exploring and accessing in-formation through a structure, e.g. a hierarchy, builtupon a given document collection. In fact, tax-onomies serve as browsing tools in many venues,including the Library of Congress Subject Headings(LCSH, 2011) for the U.S. Library of Congress andthe Open Directory Project (ODP, 2011) for about

5% of the entire Web. We call taxonomies support-ing browsing as browsing taxonomies.

When used for browsing, concepts1 in taxonomiesare linked to documents containing them and taxo-nomic structures are navigated to find particular doc-uments. Users can navigate through a browsing tax-onomy to explore the documents in the collection.A browsing taxonomy benefits information accessby providing corpus overview for a document col-lection and allowing more focused reading by pre-senting together documents about the same concept.

Most existing browsing taxonomies, such asLCSH and ODP, are manually constructed to sup-port large collections in general domains. Not onlytheir constructions are expensive and slow, but alsotheir structures are static and difficult to adapt to spe-cific tasks. In situations where document collectionsare given ad-hoc, such as search result organization(Carpineto et al., 2009), email collection exploration(Yang and Callan, 2008), and literature investigation(Chau et al., 2011), existing taxonomies may evennot be able to provide the right coverage of concepts.It is necessary to explore ad-hoc (semi-)automatictechniques to quickly derive task-specific browsingtaxonomies for arbitrary document collections.

(Hovy, 2002) pointed out that one key challengein taxonomy construction is multiple perspectivesembedded in concepts and relations. One cause formultiple perspectives is the inherent facets in con-cepts, e.g., jewelries can be organized by price or bygemstone types. Another cause is task specificationor even personalization. For example, when build-ing a taxonomy for search results of query trip to

1English terms or entities; usually nouns or noun phrases.

DC, Jane may organize the concepts based on placesof interests while Tom may organize them based ondates in visit. Typically, a taxonomy only conveysone or two perspectives from many choices. It is dif-ficult to decide which perspective should be present.One realistic solution is to leave the decision to theconstructor independent of the confusion that comesfrom facets, task specification or personalization.

When multiple perspectives present in the sametaxonomy, it is not uncommon that the per-spectives are mixed. For example, along apath financial institute→bank→river bank, finan-cial institute→bank shows one perspective andbank→river bank shows another. We call this prob-lem path inconsistency. Many approaches on auto-matic taxonomy construction suffer from this prob-lem because their foci are on accurately identifyinglocal relations between concept pairs (Etzioni et al.,2005; Pantel and Pennacchiotti, 2006) instead of onglobal control over the entire taxonomic structure.More recently, approaches attempted to build the fulltaxonomy structure (Snow et al., 2006; Yang andCallan, 2009; Kozareva and Hovy, 2010), however,few have looked into how to incorporate task speci-fications into taxonomy construction.

In this paper, we extended an existing taxonomyconstruction approach (Yang and Callan, 2009) tobuild task-specific taxonomies for document collec-tion browsing. The extension comes in two parts:handling path consistency and incorporating spec-ifications from users. We uniquely employ pair-wise semantic distance as an entry point to incre-mentally build browsing taxonomies. A superviseddistance learning algorithm not only allows us toincorporate multiple semantic features to evaluatethe proximity between concepts, but also allows usto learn the metric function from personal prefer-ences. Users can thus manually modify the tax-onomies and to some extent teach the algorithm topredict his/her way to organize the concepts. More-over, by minimizing the overall semantic distancesamong concepts and restricting minimal semanticdistances along a path, we find the best hierarchicalstructure as the browsing taxonomy.

Our contributions include:- A supervised learning mechanism to capture

task-specific or personalized requirements for orga-nizing a browsing taxonomy;

- A strategy to address path inconsistency due toword sense ambiguity and/or mixed perspectives;

- A general scheme to capture user inputs in tax-onomy construction;

- A user study to evaluate the effectiveness oftask-specific taxonomies for browsing activities.

2 Related Work

Document collection browsing has been studied asan alternative to the ranked list representation forsearch results by the Information Retrieval (IR)community. The popular IR approaches includeclustering (Cutting et al., 1992) and monothetic con-cept hierarchies (Sanderson and Croft, 1999; Lawrieet al., 2001; Kummamuru et al., 2004; Carpinetoet al., 2009). Clustering approaches hierarchicallycluster documents in a collection and label the clus-ters. Monothetic approaches organize the conceptsinto hierarchies and link documents to related con-cepts. Both approaches are mainly based on purestatistics, such as document frequency (Sandersonand Croft, 1999) and conditional probability (Lawrieet al., 2001). The major drawback of these purestatistical approaches is their neglect of semanticsamong concepts. As an consequence, they often failto produce semantically meaningful taxonomies.

The NLP community has extensively studiedautomatic taxonomy construction. Although tra-ditional research on taxonomy construction fo-cuses on extracting local relations between conceptpairs (Hearst, 1992; Berland and Charniak, 1999;Ravichandran and Hovy, 2002; Girju et al., 2003;Etzioni et al., 2005; Pantel and Pennacchiotti, 2006;Kozareva et al., 2008), more recent efforts has beenmade in building full taxonomies. For example,(Snow et al., 2006) proposed to estimate taxonomicstructure via maximizing the overall likelihood of ataxonomy. (Kozareva and Hovy, 2010) proposed toconnect local concept pairs by finding the longestpath in a subsumption graph. Yang and Callan pro-posed the Minimum Evolution (ME) frameworkto model the semantic distance d(cx, cy) betweenconcepts cx and cy as a weighted combination ofvarious lexical, statistical, and semantic features:∑

j weightj ∗ featurej(cx, cy) and estimate the taxo-nomic structure by minimizing the overall semanticdistances.

Researcher also attempted to carve out tax-onomies from existing ones. For example, Stoicaet al. (Stoica and Hearst, 2007) managed to extract abrowsing taxonomy from hypernym relations withinWordNet (Fellbaum, 1998).

To support browsing in arbitrary collections, inthis paper, we propose to incorporate task specifica-tion in a taxonomy. One way to achieve it is to definetask-specific distances among concepts. Moreover,through controlling distance scores among concepts,we can enforce path consistency in taxonomies. Forexample, when the distance between financial in-stitute and river bank is big, the path financialinstitute→bank→river bank will be pruned and theconcepts will be repositioned. Inspired by ME, wetake a distance learning approach to deal with pathconsistency (Section 3) and task specification (Sec-tion 4) in taxonomy construction.

3 Build Structure-Optimized Taxonomies

This section presents how to automatically build tax-onomies. We take two steps to build browsing tax-onomy for a given document collection. The firststep is to extract the concepts and the second is toorganize the concepts. For concept extraction, wetake a simple but effective approach: (1) We firstparse the document collection and exhaustively ex-tract nouns, noun phrases, and named entities thatoccur >5 times in the collection. (2) We then fil-ter out part-of-speech errors and typos by a Web-based frequency test. In the test, we search eachcandidate concept in the Google search engine andremove a candidate if it appears <4 times withinthe top 10 Google snippets. (3) We finally clustersimilar concept candidates into groups by Latent Se-mantic Analysis (Bellegarda et al., 1996) and selectthe candidate with the highest tfidf value within agroup to form the concept set C. Although our ex-traction algorithm is very effective with 95% preci-sion and 80% recall in a manual evaluation, some-times C may still miss some important concepts forthe collection. This can be later corrected by usersinteractively through adding new concepts (Section4).

To organize the concepts in C into taxonomicstructures, we extend the incremental clusteringframework proposed by ME (Yang and Callan,

2009). In ME, concepts are inserted one at a time.At each insertion, a concept cz is at the parent (orchild) position for every existing node in the currenttaxonomy. The evaluation of the best position de-pends on the semantic distance between cz and itstemporary child (or parent) node and the semanticdistance among all other concepts in the taxonomy.An advantage in ME is that it allows incorporat-ing various constraints to the taxonomic structure.For example, ME can handle concept generality-specificity by learning different semantic distancefunctions for general concepts which are located atupper levels and specific concepts which are locatedat lower levels in a taxonomy.

In this section, we introduce a new semantic dis-tance learning method (Section 3.1) and extendMEby controlling path consistency (Section 3.2).

3.1 Estimating Semantic DistancesPair-wise semantic distances among concepts buildthe foundation for taxonomy construction. MEmodels the semantic distance d(cx, cy) between con-cepts cx and cy as a linear combination of underly-ing feature functions. Similar to ME, we also as-sume that “there are some underlying feature func-tions that measure semantic dissimilarity for con-cepts and a good semantic distance is a combinationof these features”. Different from ME, we modelthe semantic distance d(cx, cy) between concepts(cx, cy) as a Mahalanobis distance (Mahalanobis,1936): dcx,cy =

√Φ(cx, cy)TW−1Φ(cx, xy), where

Φ(cx, cy) is the set of underlying feature functions{φk : (cx, cy)}with k=1,...,|Φ|. W is the weight ma-trix, whose diagonal values weigh the various fea-ture functions. We use the same set of features asproposed in ME.

Mahalanobis distance is a general parametricfunction widely used in distance metric learning(Yang, 2006). It measures the dissimilarity betweentwo random vectors of the same distribution with acovariance matrix W , which scales the data pointsfrom their original values by W 1/2. When only di-agonal values of W are taken into account, W isequivalent to assigning weights to different axes inthe random vectors.

We choose Mahalanobis distance for two reasons.(1) It is in a parametric form so that it allows us tolearn a distance function by supervised learning and

provides an opportunity to assign different weightsfor each type of semantic features. (2) When Wis properly constrained to be positive semi-definite(PSD) (Bhatia, 2006), a Mahalanobis-formatted dis-tance will be guaranteed to satisfy non-negativityand triangle inequality, which was not addressed inME. As long as these two conditions are satisfied,one may learn other forms of distance functions torepresent a semantic distance.

We can estimateW by minimizing the squared er-rors between training semantic distances d and theexpected value d̂. We also need to constrain Wto be PSD to satisfy triangle inequality and non-negativity. The objective function for semantic dis-tance estimation is:

minW

|C|∑x=1

|C|∑y=1

(dcx,cy −

√Φ(cx, cy)TW−1Φ(cx, cy)

)2

subject to W � 0

(1)

In this implementation, we used (Sedumi, 2011) and(Yalmip, 2011) to solve the semi-definite program-ming (SDP).

To generate the training semantic distances, wecollected 100 hypernym taxonomy fragments fromWordNet (Fellbaum, 1998) and ODP. The seman-tic distance for a concept pair (cx, cy) in a trainingtaxonomy fragment is generated by assuming ev-ery edge is weighted as 1 and summing up the edgeweights along the shortest path from cx to cy in thetaxonomy fragment. In Section 4, we will show howto use user inputs as training data to capture task-specifications in taxonomy construction.

3.2 Enforcing Path Consistency

In ME, the main taxonomy structure optimizationframework is based on minimization of overall se-mantic distance among all concepts in the taxonomyand the minimum evolution assumption. We extendthe framework by introducing another optimizationobjective to the framework: path consistency objec-tive. The idea is that in any root-to-leaf path in a tax-onomy, all concepts on the path should be about thesame topic or the same perspective. Within a root-to-leaf path, the concepts need to be coherent no mat-ter how far away they are apart. It suggests that agood path’s sum of the semantic distances should besmall.

Algorithm: Automatic Taxonomy Optimization.W = minW

∑x=1

∑|N(ctrx )|y=1 ((dctrx ,ctry

−√Φ(ctrx , ctry )TW−1Φ(ctrx , ctry ))2;

foreach cz ∈ C \ SS ← S ∪ {cz};if W � 0

d(cz , .) =√

Φ(cz , .)TW−1Φ(cz , );R← R ∪ {arg minR(cz,.)

(λ objME + (1− λ) objpath)};Output T (S,R)

Figure 1: An algorithm for taxonomy structure optimiza-tion with path consistency control. C denotes the entireconcept set, S the current concept set, and R the currentrelation set. N(ctrx) is the neighborhood of a trainingconcept ctrx , including its parent and child(ten). R(cz, .)indicates the set of relations between a new concept czand all other existing concepts. T is the taxonomy withconcept set S and relation set R.

Therefore, we propose to minimize the sum of se-mantic distances along a root-to-leaf path. Particu-larly, when adding a new concept cz into an existingbrowsing hierarchy T , we try it at different positionsin T . At each temporary position, we can calculatethe sum of the semantic distances along the root-to-leaf path Pcz that contains the new concept cx. Thepath consistency objective is given by:

objpath = minPcz

∑cx,cy∈Pcz ,x<y

d(cx, cy) (2)

where x < y defines the order of the concepts toavoid counting the same pair of pair-wise distancestwice.

Towards modeling path consistency in taxonomyconstruction, we introduce a Pareto co-efficient λ ∈[0, 1] to control the contributions from objME , theoverall semantic distance minimization objective asproposed inME, and objpath, the path distance min-imization objective. The optimization is:

minλ objME + (1− λ) objpath (3)

where objME = |∑

cx,cy∈Cn∪{cz},x<y d(cx, cy) −∑cx,cy∈Cn,x<y d(cx, cy)|, 0 ≤ λ ≤ 1, and Cn is the

concept set after nth concept is added. Empirically,we set λ = 0.8.

The algorithm shown in Figure 1 outlines ourgreedy algorithm to build taxonomies with path con-sistency control. Each time when a new concept ar-rives, the algorithm first estimates its semantic dis-tances based on W learned from the training data,

then finds the optimal position for the concept byminimizing overall semantic distances and path in-consistency, and gradually grows the structure into afull taxonomy.

The order of adding concepts may affect the finaltaxonomy structure. We hence insert concepts in anarbitrary order with 10 random restarts with differ-ent initial concepts and pick the taxonomy that min-imizes both objectives among all candidate struc-tures.

4 Incorporating Task Specification

This section studies how to incorporate user-definedtask specifications in taxonomy construction. Al-though the automatic algorithm proposed in Section3 is able to well-organize most concepts for a givendocument collection, it has not yet addressed the is-sue of mixed perspective in taxonomy construction.For concepts with multiple perspectives, we need todecide which perspective is more appropriate for thebrowsing taxonomy. This task-specific requirementcan only be captured by the user/constructor whobuilds and uses the browsing taxonomy. Moreover,the automatic algorithm relies on training data fromWordNet and ODP, which are known for imperfectterm organizations such as unbalanced granularityamong terms at the same level. To correct the wrongrelations learned from imperfect training data, wepropose to utilize user inputs in the learning process.

Particularly, we formulate taxonomy constructionas a user-teaching-machine-learning process. Toguide how to organize the concepts, a user trainsthe supervised distance learning model via a taxon-omy construction interface that allows the user tointuitively modify a taxonomy. The interface sup-ports editing actions such as dragging and dropping,adding, deleting, and renaming nodes. When a userput cx under cy, i.e. cx → cy, this action indi-cates that the user wants a relation represented bycx → cy to be true in this taxonomy. We did notexpect users to make all the edits. In a human-computer-interaction cycle, a user is not restrictedto give a certain number of edits. Based on a userstudy (Section 5.5), an average number of edits perinteraction is 3.6, which can be achieved with easeby most users.

The algorithm shown in Figure 2 provides the

Algorithm: Interactive Taxonomy Construction.1. T (S,R) =CreateInitialTaxonomy();2. U(0)={Unmodified Concepts}=C \ S,G(0)={Modified concepts}=S, M(0) = ∅, i = 0;

3. while (not Satisfied) or U(i) 6= ∅4. M(i)=CollectManualGuidance(G(i),U(i));5. W (i)=LearnDistanceMetricFunction(M(i));6. D(i)=PredictDistanceScores(W (i),U(i));7. (G(i+1), U(i+1)) = UpdateTaxonomy(D(i),U(i),G(i));8. i = i+ 1;9. Output G(i) as the taxonomy.

Figure 2: Interactive taxonomy construction procedure.

pseudo code for the interactive taxonomy construc-tion procedure. It starts with automatic constructionof initial taxonomies using the techniques presentedin Section 3 (Line 1). We then capture the user in-puts as manual guidance (Line 4) and make use of itto adjust the distance learning model (Line 5), makenew predictions for semantic distances of other con-cepts (Line 6), and organize those concepts to agreewith the user and update the taxonomy accordingly(Line 7). Line 2 initiates three variables, the unmod-ified concepts U , the modified concepts G, and themanual guidance M , indexed by the iteration num-ber i. The process iterates until the user is satisfiedwith the taxonomy’s organization (Line 3).

Learning and predicting distances have been pre-sented in Section 3.1. In this section, we present howto capture manual guidance (Section 4.1) and updatethe taxonomies accordingly (Section 4.2).

4.1 Manual Guidance as the Training DataTaxonomies are tree-structured. It is not trivial tomodel a taxonomy, especially changes in a taxon-omy, and feed that into a learning algorithm. Inthis section, we propose a general scheme to cap-ture changes, i.e., user inputs during interactions, intaxonomy construction.

We propose to convert a taxonomy into matricesof neighboring nodes. We compare the changes be-tween a series of snapshots of the changing taxon-omy to identify the user inputs. Specifically, beforea user starts editing in an interaction cycle, we repre-sent the organization of concepts as a before matrix;likewise, after the user finishes all edits in one cy-cle, we represent the new organization of conceptsas an after matrix. For both matrices, the (x, y)th

entry indicates whether (or how confident) a rela-tion r(cx, cy) is true. r could be any type of relation

Before human edits

Before Matrix

A3er human edits

A3er Matrix

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

1000001000001100011000001

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

1000001100011000001000001Movie awards

Oscars

Best supporting.

Best pictures

George Clooney

Movie awards

Oscars Best suppor9ng actors

Best pictures George Clooney

Movie awards

Oscars

Best suppor9ng actors Best pictures

George Clooney

mov. osc. sup. pic. geo. mov. osc. sup. pic. geo.

Movie awards

Oscars

Best supporting.

Best pictures

George Clooney

Figure 3: An example taxonomy before and after humanedits (Concepts unchanged; relation type = sibling).

between the concepts. Figure 3 shows an exampletaxonomy’s before and after matrices.

We define manual guidance M as a submatrixwhich consists of entries in the after matrix B; atthese entries, there exist differences between the be-fore matrix A and the after matrix B. Formally,

M = B[r; c]

r = {i : bij − aij 6= 0}c = {j : bij − aij 6= 0}

(4)

where aij is the (i, j)th entry in A, bij is the (i, j)th

entry in B, r indicates the rows and c indicates thecolumns.

Note that manual guidance is not simply thematrix difference between A and B. It is part ofthe after matrix because it is the after matrix thatindicates where the user wants the concept hierarchyto develop. The manual guidance for the exampleshown in Figure 3 is: M = B[2, 3, 4; 2, 3, 4] =

Oscars Best supporting Best pictureOscars 1 0 0Best supporting 0 1 1Best picture 0 1 1

.

When the user adds or deletes concepts, we ex-pand rows and columns in A and B by filling 0 fornon-diagonal entries and 1 for diagonal entries. Theexpanded before and after matrices A′ and B′ areused in the calculation.

For taxonomies with concept changes, we definemanual guidance with concept set change Mchange

as a submatrix which consists of some entries of theafter matrix B; at these entries, there exist differ-ences from the expanded before matrix A′ to the ex-

panded after matrix B′. Note that the concepts cor-responding to these entries should exist in the unex-panded set of concepts. Formally, manual guidancewith concept set change

Mchange = B[r′; c′]

r′ = {i : b′ij − a′ij 6= 0, concept ci ∈ CB}c′ = {j : b′ij − a′ij 6= 0, concept cj ∈ CB}

(5)

where a′ij is the (i, j)th entry in A′, b′ij is the (i, j)th

entry in B′, CB is the set of concepts in the unex-panded after matrix B, r indicates the rows and cindicates the columns.

Based on manual guidance M , we can createtraining data for the supervised distance learningalgorithm (Section 3.1). In particular, we trans-form the manual guidance into a distance matrixD = 1 − M , which is used as the training data.The learning algorithm is then able to learn a goodmodel which best preserves the regularity defined bythe task and the user. The difference is that the train-ing data here is derived from manual guidance whilein the automatic algorithm we use training data fromWordNet and ODP.

4.2 Update the TaxonomyAccording to the algorithm shown in Figure 2, afterlearning W (i), the weight matrix at the ith iteration,from the manual guidance, we can use it to predictthe pair-wise semantic distances for the unmodifiedconcepts and further group them in the taxonomy.

When the pair-wise distance score for a conceptpair (cl, cm) is small (<0.5), we consider the rela-tion between the concept pair is true; when it is big(≥0.5), false. How to organize concepts whose re-lations are true, is decided again by the relation typein the distance matrix. If r is “sibling”, cl and cm areput under the same parent. If r is “is-a”, cm is putunder cl as one of cl’s children. The user interfacethen presents the updated taxonomy to the user andwaits for the next round of manual guidance.

Since only a few changes are made during eachhuman-computer interaction, the learning modelmay suffer from overfitting and the taxonomic struc-ture may change too rapidly. To avoid such is-sues caused by too few manual guidance, we em-ploy background training taxonomy fragments fromWordNet and ODP, to smooth the learning modelsand achieve less variance.

5 Evaluation

We conduct experiments and a user study to evalu-ate the effectiveness of our approach. We have twogoals for the evaluation. One is to evaluate how thebrowsing taxonomies constructed by our approachcompare with those constructed by other baselinesystems. Another is to investigate how well oursystem can learns from task-specifications based onuser supervision.

5.1 DatasetsThe datasets we used in the evaluation are collec-tions of Web documents crawled for complex searchtasks. For each task, we created the dataset by sub-mitting 4 to 5 queries to and collecting the returnedWeb documents from two search engines bing.comand google.com. For example, queries “trip to DC”,“Washington DC”, “DC”, and “Washington” weresubmitted for the task “planning a trip to DC”. In to-tal, we created 50 Web datasets on the topics such asfind a good kindergarten, purchase a used car, plana trip to DC, how to make a cake, find a good wed-ding videographer, write a survey paper for healthcare systems, find the best deals for a Mother’s daygift, write a survey paper for social network, writea survey paper for EU’s finance, and write a surveypaper for information technology.

Around 1000 Web documents are collected foreach dataset. We filter out spams and advertisementsand then search for more relevant Web documentsto make the total number 1000. However, not alltopics can retrieve 1000 documents. Among all 50datasets, the average number of documents is 988.5.The average number of unique words in a dataset is698,875.

5.2 Comparing with Baseline SystemsWe compare the following 5 systems.

• Subsumption: the automatic algorithm pro-posed by (Sanderson and Croft, 1999), themost effective state-of-the-art browsing hier-archy construction technique as reported by(Lawrie et al., 2001).

• KH: the automatic taxonomy construction al-gorithm proposed by (Kozareva and Hovy,2010).

• ME: the automatic taxonomy construction al-gorithm proposed by (Yang and Callan, 2009).This framework does not perform path consis-tency control nor learning from users.

• DistOpt: our automatic taxonomy constructionalgorithm with path consistency control.

• PDistOpt: our interactive approach with humansupervision. The process starts from a flat listof concepts. The user built the browsing taxon-omy from the list in a user study (Section 5.5).

5.3 Browsing EffectivenessA popular measure to evaluate the quality of thebrowsing taxonomies is the expected mutual infor-mation measure (EMIM (Lawrie et al., 2001)). Itcalculates the mutual information between the lan-guage model in a taxonomy T and the languagemodel in a document collection Z. It is defined as:

I(C;V ) =∑

c∈C,v∈VP (c, v)log

P (c, v)

P (c)P (v),

where P (c, v) =∑

d∈Z P (d)P (c|d)P (v|d), C isthe set of concepts in T , V is the set of non-stopwords in Z, and d is a document in Z. EMIMonly evaluates the content of a browsing taxonomy,not its structure. However, it is still popularly usedto indicate how representative a browsing taxonomyis for a document collection.

Table 1 shows the EMIM of the browsing tax-onomies constructed by the five systems under eval-uation. Based on the mean EMIM over the 50datasets, we can rank the systems in terms of EMIMin the descending order as PDistOpt >> DistOpt>> ME > KH >> Subsumption.2 It shows thatDistOpt is the best performing automatic algorithmto generate browsing taxonomies. DistOpt is 109%and statistically significantly more effective thanME (p-value<.001, t-test), 159% and statisticallysignificantly more effective than KH (p-value<.001,t-test), and 17 times and statistically significantlymore effective than Subsumption (p-value<.001, t-test). It strongly suggests that our techniques are

2>> indicates statistically significant difference betweenthe left and the right hand sides (p < .001, t-test) and > indi-cates moderate statistical significance between the left and theright hand sides (p < .05, t-test). We will use the same symbolsthroughout the remainder of this paper.

Table 1: Expected Mutual Information (in 1000*EMIM).

Example dataset Subs. KH ME DistOpt PDistOptkindergarten 0.4 3.8 3.9 5.6 7.3

health care 0.5 2.8 3.1 7.8 8.3used car 0.1 0.2 0.1 2.8 3.6

trip to DC 0.2 4.3 4.5 6.4 6.8finance 0.01 0.01 0.1 0.6 0.6

gift 0.2 1.2 1.2 3.8 4.7social network 0.1 1.5 1.3 2.4 3.2

information 0.3 1.9 2.3 3.5 4.9cake 0.2 1.2 3.1 6.6 6.8

videographer 0.4 1.8 1.6 4.9 5.6Mean of 50 sets 0.24 1.7 2.1 4.4 5.2

more effective than the state-of-the-art systems inconstructing browsing taxonomies.

Moreover, Table 1 shows that the PDistOpt tax-onomies is 18% more effective than the DistOpt tax-onomies in terms of EMIM. The result is also sta-tistically significant (p-value<.01, t-test). It indi-cates that incorporating user preferences in brows-ing taxonomy construction is able to produce evenmore effective browsing taxonomies than all auto-mated methods.

Another popular evaluation measure3 for brows-ing effectiveness is reach time (Carpineto et al.,2009). It is defined as:

treach =1

|R|∑di∈R

L(ci) + pi,

where R is the relevant documents, ci is the conceptthat connects to a relevant document di, L(ci) is thepath length from the root to reach ci, and pi is theposition that di appears in the document cluster as-sociated with ci. Reach time evaluates both the con-tent and the structure of a browsing taxonomy. Thismeasure needs relevance judgements about a queryfor the documents organized by the taxonomies. Weobtained the relevance judgements by using the ma-jority votes from a user study involving 29 subjectsfollowed by expert reviews. Three experts manu-ally examined the majority votes and reached agree-ments on all relevance judgements.

Table 2 elaborates reach time for the systems.Based on the mean reach time over 50 datasets,we obtain a similar ranking of the systems as sug-gested by EMIM. The ranking based on reach time

3Other proposed measures include coverage and compact-ness (Kummamuru et al., 2004).

Table 2: Reach time.

Example dataset Subs. KH ME DistOpt PDistOptkindergarten 14.4 9.8 9.9 8.7 4.3

health care 12.3 9.8 6.8 4.5 3.3used car 15.4 12.4 10.2 8.7 7.6

trip to DC 11.2 10.3 9,8 8.7 5.8finance 24.5 18.3 19.7 18.7 15.6

gift 11.2 8.4 7.7 5.6 5.4social network 14.3 9.8 7.8 7.6 6.8

information 10.6 9.5 8.8 8.9 6.7cake 8.9 4.8 4.5 3.4 3.2

videographer 9.5 8.8 7.6 6.9 4.5Mean of 50 sets 14.2 12.2 9.8 7.2 5.2

0

0.2

0.4

0.6

0.8

1

Path Error Rate

Subsump/on ME KH DistOpt

Figure 4: Path error rate.

is: PDistOpt >> DistOpt >> ME > KH >>Subsumption. It shows that the best performingautomatic system is DistOpt, which on average canproduce taxonomies to reach a relevant document byvisiting only 7.2 nodes, including 5.2 non-leaf con-cepts and 2 documents in the leaf cluster on average.To find all relevant documents in a collection sizedaround 1000, this reach time is very fast. The in-teractive PDistOpt unsurprisingly gives even betterreach time, 5.2 nodes on average.

5.4 Path Consistency

To evaluate how well path consistency is handled,we compare the path error rate generated by our ap-proach and by other baseline systems. This evalua-tion is only applied to automatic algorithms.

The path error is defined as the average number ofwrong ancestor-descendant pairs in a taxonomy. It isonly applied for concepts are not immediately con-nected. It can be judged and calculated as follows.Three human assessors manually evaluated the patherrors by (1) gathering the paths by performing adepth-first traverse in the taxonomy from the rootconcept; (2) along each path, counting the numberof wrong ancestor-descendant pairs; (3) summing up

Subsumption KH ME DistOpt PDistOpt

Per

ceiv

ed B

row

sing

Effe

ctiv

enes

s

1

2

3

4

5

Figure 5: Perceived browsing effectiveness.

the errors that all assessors agree on and normalizingthe sum by the taxonomy size.

Figure 4 shows the path error rate generated byall the automated algorithms under evaluation. Wecan see that DistOpt produces the least path error.The algorithms can be ranked in terms of the abilityto handle path consistency as DistOpt >> ME >>KH >> Subsumption. DistOpt statistically signif-icantly reduces path errors from not using the pathconsistency control (ME) by 500% (p-value<.001,t-test). It strongly indicates that our technique is ef-fective to maintain path consistency. We concludethat DistOpt best handles path consistency amongall the system under evaluation.

5.5 User Study

Besides objective evaluations, we conducted an userstudy consisting of two parts: qualitative compari-son of the systems under evaluation, and using ourtaxonomy construction user interface to interactivelyconstruct personalized browsing taxonomies.

Twenty-nine (Thirty subjects initially, one was ex-cluded because of incomplete data entry) graduateand undergraduate students from various majors intwo universities participated in the study. They wereall familiar with use of computers and highly profi-cient in English. Each user study lasted for 4 hours.

In the first half of the user study, the participantswere first introduced to the taxonomy constructionuser interface for about 10 minutes to get famil-iar with its functions. After that, the participantsperformed an exercise task which lasted about 5minutes and then started the real tasks. For eachdataset, the participants were asked to interactivelywork with PDistOpt to build browsing taxonomies.

Once the real tasks were done, the participants spend5 minutes to answer a questionnaire regarding theirexperience and opinions.

In the second half of the user study, we asked theparticipants to use and compare the provided brows-ing taxonomies with the following task in mind.

Imagine your have a task [task name].You use a browsing taxonomy designedfor the collection of Web documents aboutthis task. Use the browsing taxonomy tofind all useful topics for your task. Iden-tify at least one document for each topic.

For each dataset, we asked the participants to ratethe browsing taxonomies built by the systems un-der evaluation by answering the following questionabout perceived browsing effectiveness - “How welldid the browsing taxonomy help you to complete thetask?”. Ratings in the 5-point Likert-type scale,ranging from “very good”(5), “good”(4), “fair”(3),“bad”(2), to “trash”(1), are used to rate browsing ef-fectiveness perceived by the participants.

5.5.1 Perceived Browsing EffectivenessFigure 5 shows the mean and 95% confidence in-

terval for the perceived browsing effectiveness forbrowsing taxonomies constructed by the systemsunder evaluation. These perceived browsing ef-fectiveness can be ranked in descending order asPDistOpt >> DistOpt > ME >> KH > Subsump-tion. PDistOpt shows the highest mean perceivedbrowsing effectiveness, which is as high as 4.4. Suchhigh rating shows that browsing taxonomy with per-sonalization could well satisfy users’ informationneeds and are perceived as very effective in brows-ing by the users.

5.5.2 Accuracy of System PredictionsWhen a user provided manual guidance to the in-

teractive system, during each human-computer in-teraction cycle, the system made predictions basedon the user’s edits. He or she could directly judgethe correctness of these machine-predicted modi-fications on-the-fly by selecting an option “yes”or “no” from the “Accept the change?” menu.Note that these were personalized tasks and thepredictions were evaluated by the user according

Table 3: Accuracy of system predictions.Max Min Avg

accuracy of system predictions 0.98 0.92 0.94

Table 4: Perceived learning ability.Max Min Avg

perceived learning ability 4.2 2.8 3.61which dataset health care finance -

to his/her own standard. We calculate the ac-curacy of system predictions as: Accuracy =1I

∑Ii=1

number of accepted predictions in ith cyclenumber of predictions in ith cycle , where I

is the total number of human-computer interactioncycles when constructing a browsing taxonomy. Ahigh accuracy indicates that the system learns wellfrom user edits. This evaluation is only applied toPDistOpt.

Table 3 shows that for all datasets, the mean ac-curacy of the system predictions is above 0.92. Theaverage value is 0.94. This high accuracy clearlydemonstrates that the system successfully learnsfrom a user and makes highly accurate predictionson how the user would organize the concepts.

5.5.3 Perceived Learning AbilityAfter completing constructing a browsing taxon-

omy, a participant was asked immediately to ratehow well the system learned from his/her edits. Thequestion was “How well did the system appear tolearn your method of organizing the concepts?”. Wealso used the 5-point Likert-type scale to rate thisperceived system learning ability. This evaluation isonly applied to PDistOpt.

Table 4 shows the max, min, and average re-sponses of perceived system learning ability. Themean perceived learning ability is 3.61, with a stan-dard derivation of 0.45. It suggests that the learningability of the system was only perceived as moder-ately good. This result contradicts with the conclu-sion that we drew based on the more objective mea-sure, accuracy of system prediction (Section 5.5.2).

We further investigate why the participants wereonly moderately satisfied with the system’s learn-ing ability. From the after session questionnaire, wefound that participants thought that some datasetssuch as “finance” were more difficult than otherdatasets such as “health care”. For example, the

dataset “finance” was considered by all participantsas “very difficult” while “health care” was consid-ered as “very easy”. The participants also com-plained that they were not familiar with the difficultdatasets. It is interesting that when a dataset is lessfamiliar for the users, the system was perceived per-forming badly too. It may suggest that when peo-ple are not familiar with the tasks, they provide lesspromising edits, the system learns from the lowerquality training data, and in the end the users per-ceive the output as poor system learning ability.

6 Conclusion

Document collection browsing is another commonuse of taxonomies. Given an arbitrary collection, ataxonomy must suit the specific domain in order tosupport browsing. This paper explores techniquesto quickly derive task-specific taxonomies support-ing browsing in arbitrary document sets. In par-ticular, we uniquely employ pair-wise semantic dis-tance as an entry point to incrementally build brows-ing taxonomies. The supervised distance learningalgorithm not only allows us to incorporate multi-ple semantic features to evaluate the proximity be-tween concepts, but also allows us to learn the met-ric function from personal preferences. Users canthus manually modify the taxonomies and to someextent teach the algorithm to predict his/her way toorganize the concepts. Moreover, by minimizing theoverall semantic distances among concepts and re-stricting minimal semantic distances along a path,we find the best hierarchical structure as the brows-ing hierarchy. It guarantees that semantically closeconcepts are put together so that users will have agood idea about why the concepts are put together.This greatly increases the interpretability of a con-structed browsing hierarchy than the existing ap-proaches. This makes our approach more flexibleand more general to effectively creating browsingtaxonomies to support more complicated and morerealistic tasks such as Web information triage.

Acknowledgments

The author sincerely thanks Prof. Jamie Callan forin-depth discussions about the research and anony-mous reviewers for valuable comments to this paper.

References

J. R. Bellegarda, J. W. Butzberger, Yen-Lu Chow, N. B.Coccaro, and D. Naik. 1996. A novel word clusteringalgorithm based on latent semantic analysis. In Pro-ceedings of the Acoustics, Speech, and Signal Process-ing, 1996. on Conference Proceedings., 1996 IEEEInternational Conference - Volume 01, ICASSP ’96,pages 172–175, Washington, DC, USA. IEEE Com-puter Society.

Matthew Berland and Eugene Charniak. 1999. Findingparts in very large corpora. In Proceedings of the 27thAnnual Meeting for the Association for ComputationalLinguistics (ACL 1999).

Rajendra Bhatia. 2006. Positive definite matrices(princeton series in applied mathematics). PrincetonUniversity Press, December.

Claudio Carpineto, Stefano Mizzaro, Giovanni Romano,and Matteo Snidero. 2009. Mobile information re-trieval with search results clustering: Prototypes andevaluations. Journal of American Society for Informa-tion Science and Technology (JASIST), pages 877–895.

Duen Horng Chau, Aniket Kittur, Jason I. Hong, andChristos Faloutsos. 2011. Apolo: making sense oflarge network data by combining rich user interactionand machine learning. In CHI, pages 167–176.

Gouglass R. Cutting, David R. Karger, Jan R. Petersen,and John W. Tukey. 1992. Scatter/Gather: A cluster-based approach to browsing large document collec-tions. In Proceedings of the fifteenth Annual ACMConference on Research and Development in Informa-tion Retrieval (SIGIR 1992).

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland,Daniel S. Weld, and Alexander Yates. 2005. Unsu-pervised named-entity extraction from the web: an ex-perimental study. In Artificial Intelligence, 165(1):91-134, June.

Christiane Fellbaum. 1998. WordNet: an electronic lexi-cal database. MIT Press.

Roxana Girju, Adriana Badulescu, and Dan Moldovan.2003. Learning semantic constraints for the automaticdiscovery of part-whole relations. In Proceedings ofthe Human Language Technology Conference/AnnualConference of the North American Chapter of the As-sociation for Computational Linguistics (HLT/NAACL2003).

Sanda M. Harabagiu, Steve J. Maiorano, and Marius A.Pasca. 2003. Open-domain textual question answer-ing techniques. In Natural Language Engineering 9(3): 1-38.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings of

the 14th International Conference on ComputationalLinguistics (COLING 1992).

E. H. Hovy. 2002. Comparing Sets of Semantic Re-lations in Ontologies. In R. Green, C. A. Bean,and Myaeng S. H. (eds), editors, The Semantics ofRelationships: An Interdisciplinary Perspective. Dor-drecht: Kluwer.

Zornitsa Kozareva and Eduard Hovy. 2010. A semi-supervised method to learn and construct taxonomiesusing the web. In Proceedings of the 2010 Conferenceon Empirical Methods in Natural Language Process-ing, pages 1110–1118, Cambridge, MA, October. As-sociation for Computational Linguistics.

Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008.Semantic class learning from the web with hyponympattern linkage graphs. In Proceedings of the 46th An-nual Meeting for the Association for ComputationalLinguistics (ACL 2008).

Krishna Kummamuru, Rohit Lotlikar, Shourya Roy,Karan Singal, and Raghu Krishnapuram. 2004. A hi-erarchical monothetic document clustering algorithmfor summarization and browsing search results. Pro-ceedings of the 13th conference on World Wide WebWWW 04, page 658.

Dawn Lawrie, W. Bruce Croft, and Arnold Rosenberg.2001. Finding topic words for hierarchical summa-rization. In Proceedings of the 24th Annual ACM Con-ference on Research and Development in InformationRetrieval (SIGIR 2001), pages 349–357.

LCSH. 2011. Library of congress subject headings.http://www.loc.gov/.

P. C. Mahalanobis. 1936. On the generalised distance instatistics. In Proceedings of the National Institute ofSciences of India 2 (1): 495.

ODP. 2011. Open directory project. http://www.dmoz.org/.

Patrick Pantel and Marco Pennacchiotti. 2006. Espresso:Leveraging generic patterns for automatically harvest-ing semantic relations. In Proceedings of the 44th An-nual Meeting for the Association for ComputationalLinguistics (ACL 2006).

Deepak Ravichandran and Eduard Hovy. 2002. Learningsurface text patterns for a question answering system.In Proceedings of the 40th Annual Meeting for the As-sociation for Computational Linguistics (ACL 2002).

Mark Sanderson and W. Bruce Croft. 1999. Derivingconcept hierarchies from text. In Proceedings of the22nd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval(SIGIR 1999).

Sedumi. 2011. http://sedumi.mcmaster.ca.Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006.

Semantic taxonomy induction from heterogenous evi-

dence. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th AnnualMeeting of the Association for Computational Linguis-tics (ACL/COLING 2006).

Emilia Stoica and Marti A. Hearst. 2007. AutomatingCreation of Hierarchical Faceted Metadata Structures.In Proceedings of the Human Language TechnologyConference (NAACL-HLT).

Idan Szpektor, Hristo Tanev, Ido Dagan, and BonaventuraCoppola. 2004. Scaling web-based acquisition of en-tailment relations. In Proceedings of the Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP 2004).

Yalmip. 2011. http://users.isy.liu.se/johanl/yalmip.

Hui Yang and Jamie Callan. 2008. Ontology generationfor large email collections. In Proceedings of the 8thNational Conference on Digital Government Research(Dg.O 2008).

Hui Yang and Jamie Callan. 2009. A metric-basedframework for automatic taxonomy induction. In Pro-ceedings of the 47th Annual Meeting for the Associa-tion for Computational Linguistics (ACL 2009).

Liu Yang. 2006. Distance metric learning: A com-prehensive survey. http://www.cs.cmu.edu/

˜liuy/frame_survey_v2.pdf.Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti

Hearst. 2003. Faceted metadata for image search andbrowsing. In Human factors in computing systems.ACM.

Constructing Task-Speciﬁc Taxonomies for Document ...infosense.cs.georgetown.edu › publication...

Documents

Transcript of Constructing Task-Speciﬁc Taxonomies for Document ...infosense.cs.georgetown.edu › publication...