WITP Recognizing Citations in Public Comments

Journal of Information Technology & Politics, Vol. 5(1) 2008Available online at http://jitp.haworthpress.com

© 2008 by The Haworth Press. All rights reserved.doi:10.1080/19331680802153683 49

WITP

Recognizing Citations in Public CommentsArguello, Callan, and Shulman Jaime Arguello

Jamie CallanStuart Shulman

ABSTRACT. Notice and comment rulemaking is central to how U.S. federal agencies craft newregulation. E-rulemaking, the process of soliciting and considering public comments that are submittedelectronically, poses a challenge for agencies. The large volume of comments received makes it diffi-cult to distill and address the most substantive concerns of the public. This work attempts to alleviatethis burden by applying existing machine learning techniques to the problem of recognizing citationsentences. A citation in this context is defined as a statement in which the author of the publiccomment references an external source of factual information that is associated with a specific personor organization. The problem is formulated as a binary classification problem: Is a specific person ororganization mentioned in a sentence being referenced as an external source of information? We showthat our definition of a citation is reproducible by human judges and that citations can be detectedusing machine learning techniques with some success. Casting this as a machine learning problemrequires selecting an appropriate representation of the sentence. Several feature sets are evaluatedindividually and in combination. Superior results are obtained by combining feature sets. Syntacticfeatures, which characterize the structure of the sentence rather than its content, significantly improveaccuracy when combined with other features, but not when used in isolation. Although prediction

Jaime Arguello is a Ph.D. student at the Language Technologies Institute at Carnegie Mellon University.His work focuses on text data mining, information retrieval, and natural language processing.

Jamie Callan is a Professor at the Language Technologies Institute, a graduate department in CarnegieMellon University’s School of Computer Science. His research and teaching focus on text-based informationretrieval. His recent research studies advanced search engine architectures, federated search across groups ofsearch engines, adaptive information filtering, text analysis and organization, text mining, and automaticcollection of instructional materials for an intelligent reading tutor. His earlier IR research included firstgeneration Web-search systems, integration of text search with relational database systems, and informationliteracy in K-12 education.

Dr. Stuart W. Shulman is Director of the Sara Fine Institute in the School of Information Sciences at theUniversity of Pittsburgh. He is also the founder and Director of the Qualitative Data Analysis Program(QDAP) at Pitt’s University Center for Social and Urban Research, which is a fee-for-service coding labworking on projects funded by the National Science Foundation, the National Institutes of Health, DARPA,and other funding agencies. He has been Principal Investigator and Project Director on related NationalScience Foundation–funded research projects focusing on electronic rulemaking, human language technologies,coding across the disciplines, digital citizenship, and service-learning efforts in the United States. Dr. Shulmanis the Editor-in-Chief of the Journal of Information Technology & Politics.

This work was supported in part by the eRulemaking project and NSF grants IIS-0240334 and IIS-0429293. We are grateful to the U.S. EPA and the U.S. Department of the Interior’s FWS for providing thepublic comment data that made this research possible. We thank the coders at the Qualitative Data AnalysisProgram (QDAP) for their time and effort in producing our gold standard data and for providing the feedbackthat guided the evolution of our coding scheme. We also thank the anonymous reviewers for theircomments. Any opinions, findings, conclusions, and recommendations expressed in this paper are theauthors’ and do not necessarily reflect those of the sponsors.

Address correspondence to: Jaime Arguello, Carnegie Mellon University, Pittsburgh, PA (E-mail:[email protected]).

50 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

error rate is adequate, coverage could be improved. An error analysis enumerates short-term andlong-term challenges that must be overcome to improve recall.

KEYWORDS. Citation analysis, public comments, e-rulemaking, text mining, information extraction,machine learning

RECOGNIZING CITATIONS IN PUBLIC COMMENTS

U.S. law and standard regulatory practicerequires U.S. regulatory agencies to give noticeof a proposed rule and then to respond to sub-stantive comments from lobbyists, companies,trade organizations, special interest groups, andthe general public before issuing a final rule.Traditionally, public comments were submittedin paper form. However, during the last fewyears, the government has begun to allowcomments to be submitted electronically insome cases. It is now easier for the public toexamine and comment on proposed regulations.High-profile rules attract hundreds of thousandsof e-mail comments. Finding the most substan-tive comments among these large corpora is adifficult task.

One criterion by which a policy maker coulddecide that a public comment is substantive iswhether it references an external source ofinformation. Policy makers might want to focuson evidence-bearing comments for severalreasons. First, policy makers might want toexamine the external information source citedor at least be aware of it. They might care toknow where the public gets its facts. Second, areference to an external source of informationcould be a proxy for the value of the author’scomments. Possibly, an author that referencesan external source of information is betterinformed than one that does not. It may be morelikely that his/her claims and opinions are basedon factual evidence. Finally, citing externalsources may be a mechanism for the author toestablish credibility (Ziman, 1968). The pres-ence of a citation may indicate the author’sintent to write a constructive comment, ratherthan a substanceless rant. Thus, if commentsneed to be prioritized for the policy maker’sconsideration, which is likely the case when

comment volume is high, focusing on commentscontaining references to external evidence mightbe useful.

In this work, we define a citation as a referenceto an external source of information that is asso-ciated with a specific person or organization.The problem is framed as a binary classificationproblem, where the classification question is:“Given a mention of a specific person or orga-nization, is it being referenced as (or associatedwith) an external source of information in thesentence?” The unit of classification is theperson or organization mentioned. The evi-dence that informs the classification decision islimited to the sentence mentioning the personor organization. We show that our definition ofa citation is reproducible by humans and detect-able by automated methods. We also present acomparative evaluation of different machinelearning settings for detecting citations.

From a technical perspective, we exploredthe following research questions:

• Can citations be detected automatically?• If so, what features are most effective and

under what assumptions?

Framing this as a text classification problemrequires selecting an appropriate representationof the sentence. For example, under a tradi-tional bag-of-words representation, the classifi-cation decision is informed by the words in thesentence (e.g., Does the word reported occur inthe sentence?). Another representation mayfocus on the syntactic relations between theperson or organization in question and its sur-rounding context (e.g., Is the entity the subjectof the verb reported?).

In our experiments, we held the classifica-tion algorithm constant and evaluated differentfeature sets individually and in combination. Inparticular, we explored whether syntactically

Arguello, Callan, and Shulman 51

motivated features are required for detectingcitations. This question is important becausegenerating syntactic features requires morework than working under a bag-of-wordsrepresentation. Furthermore, evidence can beborrowed from different parts of the sentence.Under a bag-of-words representation, featurescan originate from the person/organizationpotentially being cited or elsewhere. Weexplored the effects on classification accuracyof focusing on different parts of the sentenceand discuss the implications of using differentfeature sets. Finally, we show that combiningdifferent representations (i.e., feature sets)improves classification accuracy. We exploredtwo alternatives for feature set combination: (a)incorporating different feature sets into a singleclassifier and (b) combining the output of indi-vidual classifiers, each trained on a separate setof features. We show superior results undercondition (b). Our best performing modelcatches about 66% of all true citations and getsabout 80% of citation predictions correct. Our66% coverage estimate is optimistic for reasonsstated later.

The remainder of this article is organized asfollows. The next section introduces the publiccomment corpora used in this work. This isdone early on because the numerous examplespresented throughout the article originate fromthese corpora, and knowing the context of eachexample may be informative to the reader. Thenwe define what is and is not a citation in thiswork. We next discuss the human annotationprocess and trace how our definition of acitation evolved until it produced adequateintercoder agreement. Data preprocessing isdiscussed in the following section. Our algo-rithms and experimental results are discussednext, followed by an in-depth error analysis,which suggests possible next steps. Relatedwork and concluding remarks make up the finaltwo sections.

CORPORA

The data used in this evaluation originatesfrom public comments submitted to the U.S.Department of the Interior’s Fish and Wildlife

Service’s (FWS) proposal to list the polar bearas “threatened” under the Endangered SpeciesAct of 1973 (USDOI-FWS-2007-0008). Thiscorpus (the Polar Bears corpus) was selectedbecause preliminary analysis revealed suffi-cient references to external sources of evidencesuch as reports, research studies, media distri-butions, etc. The Polar Bears corpus containsmore than 540,000 comments that were submit-ted to the FWS by e-mail. These comments tendto focus on the deterioration of the polar bear’shabitat, primarily due to global warming andhunting. Examples originating from the PolarBears corpus are marked with a PB.

Some of the examples presented in this articleoriginate from comments submitted to theEnvironmental Protection Agency (EPA) inresponse to its 2004 proposal for new emissionstandards for hazardous air pollutants fromcoal- and oil-fired utility plants (USEPA-OAR-2002-0056). This corpus (the Mercurycorpus) contains more than 530,000 commentsthat were submitted to the EPA by e-mail.These comments tend to focus on the negativeeffects of mercury contamination from coal-fired power plants. Examples originating fromthe Mercury Corpus are marked with a MR.Both corpora are available for research use.1

DEFINITIONS

What is a Citation?

We define a citation as a mention of aspecific person or organization (also referred toas named entity or just entity from here forward)that is referenced as an external source of infor-mation. We impose the restriction that a citationmust reference a specific person or organizationto avoid ambiguous references, which were fre-quent in the corpora we examined. For instance,an ambiguous reference such as in Studiesshow polar bears are resorting to cannibalismbecause they are starving.PB would not be con-sidered a citation under this definition becausethe studies are not associated with a specificperson/organization. On the other hand, thefollowing example would be considered a cita-tion. In all examples, mentions of a specific


organization are marked with [·]org and men-tions of a specific person are marked with [·]per.

A study released in 2006 by the [US Geo-logical Survey’s Alaska Science Center]orgsuggests that bears in the southern Beau-fort Sea may be turning to cannibalism asa result of the shrinking ice cover cuttingoff access to seal prey.PB

This would be a citation because the study isassociated with a specific organization, the U.S.Geological Survey’s Alaska Science Center.References to ambiguous information sourcesare not called citations based on the assumptionthat policy makers care primarily about specificdata items. Note that it is possible for an ambigu-ous reference to be disambiguated somewhereelse in the comment (e.g., imagine the firstexample being followed by the sentence Onesuch study was produced by the [USGS’s AlaskaScience Center]org). Since we focus on individualsentences in isolation, we would miss such cases.

In the clearest case, a citation is an instancein which the author attributes concrete information(e.g., that mercury increases the risk of kidneydamage) to a specific external source (e.g., theWorld Health Organization):

• The [World Health Organization]org, in1991, concluded that urinary mercuryincreases the risk of kidney damage.MR

• An [Atomic Energy Commission]org’s doc-ument published in 1952 notes that fishconcentrate 150,000 times the poisonsthey ingest.MR

• Furthermore, news ( [NY Times]org, April 7,P A4) shows that text drawn from a 2000-page National Academy of Science reportwas edited in a way that minimizes the riskassociated with mercury exposure.MR

Other types of sentences also qualify as citations.The author can state that he or she examinedsome external source of information withoutmentioning the claims made by the source (e.g.,After watching this week’s edition of NOW on[PBS]org, I am more educated on this issue.MR).The author can explicitly ask the policy makerto consider some external information source

(e.g., Please read the latest [Pentagon]orgreport before rolling back any pollutioncontrol.MR). Finally, the author can simplymention that the information source exists (e.g.,[Judith Bluestone]per’s remarkable study, TheFabric of Autism, will be out in a few monthsand has an impressive array of research cita-tions.MR). All of these qualify as citations.

What Is Not a Citation?

Not every sentence where the author men-tions a person or organization while (possibly)supporting an argument is a citation. For exam-ple, mentions of someone’s actions to supportsome claim are not citations (e.g., The [Bush]peradministration’s refusal to curb global warningis a primary cause of the polar bear’s meltinghabitat and their terrible plight.PB). Such casesare not citations because the claim being madeis derived by the author from the named entity’saction, not by the named entity itself. It is theauthor who is making the connection betweenthe named entity and the argument. A relatedcase, also not a citation, is a mention of a per-son or organization’s reputation (e.g., Given[EPA]org’s tarnished record of improving airquality in our parks, any effort to further . . .will take us in exactly the wrong direction.MR).The EPA’s tarnished record is not informationthat the EPA produced or distributed. Rather, itis the author’s or the public’s perception. Thus,it is not a reference to external evidence.

Another confusable case, not necessarily acitation, is when the author quotes (directly orindirectly) an external entity. One would thinkthat when the author quotes, he/she is implyingthat the person or organization being quoted hasproduced information worth the policy maker’sconsideration. However, consider the followingtwo quotes:

• In the northern territories, where tempera-tures have risen an average of four degreessince 1950, wildlife experts such as Mr.[Mitch Taylor]per say that bears have neverbeen healthier and more plentiful.PB

• [Ghandi]per said that you can really tellabout a people from the way they treattheir animals.PB


The first statement is considered a citation,but not the second. The difference is subtle.With the second quote, rather than telling thepolicy maker about some external source ofinformation, the author is requesting that thepolicy maker “do the right thing.” The author isprobably not seriously asking the policy makerto reference Gandhi in the final rule. Thisdistinction is subjective. Therefore, perfectagreement among human annotators for thistask is unrealistic.

A Note About Citations in Scientific Text

When one first thinks of a citation, academicliterature comes to mind. In a scientific paper, acitation is a meaningful link between the workthat is being presented and work done in thepast. In a public comment, the notion of a cita-tion is different. In academic literature authorscite for different reasons. In the annotationscheme of academic citation function presentedin (Teufel, Siddharthan, & Tidhar, 2006a), cita-tions fulfill four general functions: (a) to pointout weaknesses in the cited approach, (b) tocompare/contrast the work presented with thework cited, (c) to show compatibility with thecited work, and (d) other. This breakdown ofwhy authors cite in academic work does not fitcomfortably into the context of public com-ments. Authors of public comments are not pre-senting or defending work they have done.Rather, they are defending their stance on theproposed regulation. Thus, given this differencein authors’ intent, our definition of a citationdeviates from what is called a citation in aca-demic text. The Related Work section surveysprior work on automatic citation analysis inacademic literature.

HUMAN ANNOTATION

This section describes the human annotationprocess that produced the gold standard dataused for training and testing. Human annotatorswere presented with a set of sentences, eachcontaining one noun-phrase marked as eitherperson ([·]per) or organization ([·]org). Humanannotators were instructed to label a sentence as

a citation if the marked person or organizationis associated with the production or distributionof specific information; otherwise, to label it asnon-citation. The unit of classification was thenamed entity marked. Annotators were shownsentences in isolation, outside the context oftheir public comment. Sentences mentioningmore than one person or organization wereduplicated and a different named entity wasmarked in each copy. For example, the sentenceThe [Center for Disease Control and Preven-tion]org has reported that 1 out of every 6women of childbearing age has mercury levelsthat are higher than the limits established bythe [EPA]org.

MR was presented to the annotators(and our algorithms) as two separate instances:

• The [Center for Disease Control and Pre-vention]org has reported that 1 out of every6 women of childbearing age has mercurylevels that are higher than the limits estab-lished by the EPA.MR

• The Center for Disease Control and Pre-vention has reported that 1 out of every 6women of childbearing age has mercurylevels that are higher than the limits estab-lished by the [EPA]org.

MR

The first instance would be a citation, but notthe second, since the CDC and not the EPA isthe organization responsible for reporting thisstatistic.

During the final coding process, two annotators(called coder A and coder B from here forward)from the Qualitative Data Analysis Program atthe University of Pittsburgh2 annotated a set of6,000 sentences. Agreement between coderswas evaluated on 20 percent of the data (1,200sentences). The remaining 5,800 sentenceswere divided evenly between the two coders.Intercoder agreement was evaluated in terms ofCohen’s Kappa (κ), defined as

where Pr(e) is the expected probability ofagreement due to chance, and Pr(a) is theactual or observed relative agreement between

k = ( ) ( )

1 ( ),

Pr a Pr e

Pr e

−−


the coders. k ranges between +1 when agreementis perfect and −1 when agreement is perfectlynegatively correlated. When k = 0, agreement isat the level that would be reached if each coderannotated at random, consistent with the empir-ical class distribution. The 1,200-sentence setused to measure intercoder agreement wasdivided into two sets of 600 sentences each(Spre and Spost). Agreement was evaluated on setSpre prior to the full annotation task and on setSpost after the full annotation task. These twotests for intercoder agreement can be seen aspre-/posttests, where the post-test measuresagreement after each coder annotated his or herhalf of the 5,800 sentence set. Intercoder agree-ment with respect to k is shown in Table 1.What value of k qualifies as high agreementdepends on the task. However, Carletta (1996,p. 256) notes that “content analysis researchersgenerally think that k >.80 is good reliability,with .67 <k <.80 allowing tentative conclu-sions to be drawn.” Thus, intercoder agreementon both sets is considered good. The higheragreement on the post set, Spost, suggests thismay be a task human coders can improve onwith practice.

Coding Manual Evolution

The coding manual and our definition of acitation underwent two major modificationsprior to the final coding manual used to producethe gold standard data. The next two subsec-tions describe these modifications. This sectiondiscusses what we did wrong and is includedfor researchers interested in similar annotationtasks. Readers less interested in the humanannotation aspect of this work can skip to thenext section without loss of continuity.

Coding Manual Version 1.0

The main difference between our first andfinal attempt was that initially a distinction was

drawn between sentences that attribute a fact toan external source and sentences that referencean external source without attributing a fact toit. Coders were instructed to label sentencesinto three classes: citation, non-citation, andboundary. A citation was defined as a sentencethat attributes specific facts to the person ororganization in question. For example, in thesentence An [Atomic Energy Commission]orgdocument published in 1952 notes that fish con-centrate up to 150,000 times the poisons theyingest.MR, a specific fact (i.e., fish concentrateup to 150,000 times the poisons they ingest) isattributed to the Atomic Energy Commission.The boundary class covered the followingcases, not covered by the citation class becausea specific fact is not attributed to the source:

1. The author gives his or her opinion aboutinformation produced or distributed bysome external source (e.g., Or did nobodyread the [Pentagon]org report on climatechange, which will be devastating for usall.MR).

2. The author tells the policy maker that someexternal information source is worth consid-ering (e.g., Read the recent [Pentagon]orgreport and wake up!MR).

3. The author states that he or she examinedsome information produced or distributedby an external source, without mentioningthe facts (e.g., After watching this week’sedition of Now on [PBS]org, I am moreeducated on this issue.MR).

The non-citation class covered the remainingcases.

Three hundred sentences from the EPA’sMercury corpus were coded by six coders. Forcomparison with the final round of coding, wefocus on agreement between coder A and B.Agreement between A and B was not highenough (k = .625). Table 2 shows the contin-gency matrix between both coders for roundone. As is shown in the table, citation and non-citation were confused only 4 times (2+2). Theboundary class was confused with citation9 times (3+6) and with non-citation 22 times(13+9), indicating that its definition wasproblematic for coders.

TABLE 1. Intercoder Agreement

Set k

Spre 0.871Spost 0.920


An error analysis revealed that coders haddifficulty deciding if a claim attributed to anexternal source is concrete enough to call theinstance a citation rather than boundary. A “fact”attributed to a person or organization in acitation can range from a concrete fact (e.g.,The [CDC]org has cited scientific evidence thatlevels this high increase the risk of brain dam-age in newborns.MR) to claims that either lackcredibility, are subjective, or belong to theauthor rather than the external source (e.g., Arecent [PBS]org report claimed that the Amazonforest will be completely wiped out.MR). For thesecond round of coding, the boundary class wasmerged with the citation class.

Coding Manual Version 2.0

In this round, the citation class was dividedinto primary and secondary. Primary coveredcases where the person or organization pro-duced the information, and secondary coveredcases where the person or organization distrib-uted the information produced by someone else(e.g., a newspaper article covering a scientist’swork). Both primary and secondary weredefined to include cases where no facts areattributed to the source referenced. A sentencehad to just make it clear that some informationwas produced or distributed by the namedentity. Six hundred sentences from the FWS’sPolar Bears corpus were coded by five coders.Again, we focus on the two coders used in thefinal round of coding. Agreement was muchworse than agreement for round one (k = 0.407).Table 3 shows the contingency matrix betweencoder A and B for round two.

The first meaningful result was thatalthough the coders expressed their preference

for distinguishing between primary and secondaryreferences to external data, secondary refer-ences were very uncommon. Coder A found oneand coder B found two. For the final round ofcoding, the primary and secondary classes wereagain merged into a single citation class. Thesecond meaningful result was that coder Acoded 27 citations that B called non-citation.Inspecting the data revealed that coder A mademany false positive mistakes and B many falsenegative ones with respect to the citation class.

Some comments mention the advocacygroup that the author represents, for example,Attached are the comments by the [Safari ClubFoundation]org on the proposed listing of thepolar bear as a threatened species [ . . . ].PB

Coder A considered such references as cita-tions, while B considered them non-citations.Coders were instructed to ignore organizationsthat the author represents, unless they are men-tioned as a source of information. Also, a numberof statements cited Al Gore’s documentary AnInconvenient Truth as an external source (thePB corpus relates to global warming). A consid-ered some references to this documentary ascitations while B coded all references to it asnon-citation. Coders were reminded that amedia distribution can be an external source ofevidence if it is relevant and presented by theauthor as a source of factual information.

DATA PREPROCESSING

As previously mentioned, the Polar Bearscorpus was used to produce our evaluation data.The raw Polar Bears corpus contains 546,900comments, which are a mixture of completelyoriginal comments as well as edited and

TABLE 2. Round-One Coding: Contingency Matrix (Total = 300)

Coder A

citation boundary non-citation

Coder B citation 17 3 2boundary 6 15 13non-citation 2 9 233

TABLE 3. Round-Two Coding: Contingency Matrix (Total = 600)

Coder A

primary secondary non-citation

Coder B primary 10 1 1secondary 2 0 0non-citation 27 0 559


unedited form letters. A form letter is a messagewritten by an advocacy group that someone caneither submit “as is” or after personalizing it.Duplicate text in either edited or unedited formletters was flagged using the Durian duplicatedetection tool (Yang & Callan, 2006). Publiccomments containing no text flagged uniquewere completely ignored. The remaining textwas sentence-segmented using OpenNLP,3

yielding 131,839 unique sentences. To filtertext originating from the e-mail’s footer, allsentences with less than ten tokens(words+punctuation) were removed, resultingin 68,090 sentences. These were then namedentity tagged using BBN Identifinder (Bikel,Schwartz, & Weischedel, 1999). A namedentity tagger marks all proper nouns corre-sponding to a person or organization. Sentenceswithout a mention of a specific person ororganization were filtered out. Finally, eachsentence mentioning more than one person ororganization was duplicated and a differentnamed entity was marked in each copy, as pre-viously illustrated. From this final set, 6,000sentences were randomly sampled. Eachinstance poses the question: Is the tagged per-son/organization being cited?

It should be noted that during the annotationprocess, coders were instructed to handle namedentity tagging errors as follows. If the named entitytagged is off-center, such as in As recently reportedin the media, an official with the [CanadianDepartment of Fisheries]org and Oceans said:We’ve noticed that the ice over the past 4 to 5 yearshas been deteriorating and it’s giving us some con-cern.PB, then the coder should consider the fullnamed entity Canadian Department of Fisheriesand Oceans. However, if the named entity iscompletely incorrect, such as in [Agriculture]organd forest studies have evidenced the importanceof biodiversity in maintaining a healthy planet.PB,then the coder should consider the instance a non-citation, regardless of the context.

ALGORITHMS

The goal of any text classification algorithmis to learn from annotated examples a modelthat can be applied to a previously unseen

example to predict its class, in our case citationor non-citation. This is the typical supervisedlearning framework. A support vector machine(SVM) classifier (for details see Vapnik, 1995)was used in all experiments. SVM classifiershave produced good results in a wide range oftext classification tasks. The SVMlight imple-mentation (Joachims, 1998) was used with alinear kernel. This SVM classifier learns fromtraining data a linear boundary that separatespositive from negative examples (i.e., citationsfrom non-citations) by the widest margin, whilemaking as few misclassifications on the train-ing data as possible (the training data may notbe linearly separable).

Several feature sets were evaluated individu-ally and in combination. Each feature set is adifferent way of representing the contents and/or syntactic structure of the sentence. A singlefeature in a feature set is an atomic unit ofevidence, which a classifier leverages to predictthe unknown class. One way to combine featuresets is simply to incorporate them into a singleclassifier. In the best case, the classifier learnshow to leverage these potentially different sourcesof evidence to produce a reliable predictivemodel. A different way to combine features isvia an ensemble method, depicted in Figure 1.In an ensemble method, each base classifiermakes a prediction and then a meta-classifiertakes as input the base classifiers’ predictions and

FIGURE 1. Typical meta-classification set up.The example to be predicted, xi, is input to allbase classifiers, C1, C2, . . . , Cn. The baseclassifiers’ individual predictions, yC1, yC2, . . . ,yCn, are the input to the meta-classifier, Cmeta,which makes the final prediction, yi.

C1

C2

Cn

.

.

.

Cmeta

yCn

yC2

yC1

yixi


outputs a final prediction. The base classifiers,C1, C2, . . . , Cn, are said to form a committee.The meta-classifier, Cmeta, somehow aggregatesthe predictions of its committee to produce afinal prediction. A meta-classifier, for instance,might simply predict the majority class basedon the predictions of its committee.

In our setup, the base classifiers and themeta-classifier are all SVM classifiers. Baseclassifiers differ from one another only in thefeatures that each uses. Each base classifier isallowed to focus (and possibly become an expert)on a particular set of features. Their output wasaggregated by training the meta-classifier topredict the right output based on the base classi-fiers’ collective predictions. This ensemblemethod is known as stacking (Wolpert, 1990).Separating features into complementary viewsor feature sets to improve accuracy and reducethe burden of producing training data is alsoused in a technique called co-training (Blum &Mitchell, 1998).

The next two subsections describe ourfeatures sets. In the next subsection, we focuson our bag-of-words feature sets. Then, wefocus on our semantically motivated syntacticfeatures. Bag-of-words features focus on thewords in the sentence. Our semantically moti-vated syntactic features focus on the sen-tence’s structure. More specifically, theycharacterize how the person or organizationpotentially being cited fits syntactically intothe sentence.

Bag-of-Words Features

As mentioned previously, one majorresearch question explored in this work iswhether syntactic information is at all requiredto determine that a person/organization in asentence is being cited. To explore this ques-tion, four bag-of-words feature sets were evalu-ated. A bag-of-words model is one that ignoresall ordering and relation information betweenfeatures (the words in the sentence, in our case).These four bag-of-words feature sets differexclusively in the parts of the sentence fromwhere their features originate and are referredto as ALL, ONLY_NE, ALL_BUT_NE, andALL_BUT_NE + ONLY_NE:

• ALL: All the words in the sentence areused as features.

• ONLY_NE: Only the words within thetagged named entity are used as features.

• ALL_BUT_NE: All the words in the sen-tence are used as features, except thosewithin the tagged named entity.

• ALL_BUT_NE + ONLY_NE: All thewords in the sentence are used as features.

However, a word occurring within the taggednamed entity is treated as a different featurethan the same word occurring outside thetagged named entity.

Individually, each bag-of-words classifiersheds light at the nature of the problem. If ALLperforms well enough, then the extra machineryrequired to characterize how the named entityin question fits syntactically into the sentencemay not be justified. The performance ofONLY_NE depends partly on the extent towhich authors reference the same people ororganizations in citations. If authors consis-tently cite the same sources, the ONLY_NEshould perform well. Conversely, if authors citea wide range of sources, but do so in a consistentmanner, then ALL_BUT_NE should performwell. ALL and ALL_BUT_NE + ONLY_NE arerelated. The difference is that ALL_BUT_NE +ONLY_NE distinguishes whether a wordoccurs inside or outside the named entity.

For the implementation of these classifiers,all text was downcased and tokenized by split-ting on white space and punctuation. Stopwords(i.e., topic-general function words) wereremoved and each term was stemmed (i.e., suf-fix-stripped) using the Porter stemmer (Porter,1980). Term stems occurring only once, thushaving no predictive power, were removed,resulting in a vocabulary (i.e., feature space) of3,651 term stems (double this number forALL_BUT_NE + ONLY_NE).

Semantically Motivated Syntactic Features

Generating meaningful syntactic featuresrequired two component technologies: (a) framesemantics and (b) dependency-tree parsing.Each component technology is described and


motivated in the next two subsections. The thirdsubsection describes how these two technologieswere used together to generate useful featuresfor classification.

Frame Semantics

The goal of frame semantics is to manuallyenumerate the full range of semantic and syn-tactic combinatory possibilities of the words ina language (Ruppenhofer, Ellsworth, Petruck,Johnson, & Scheffczyk, 2006). A frame can bethought of as some meaningful event (e.g.,death), action (e.g., forgive), or condition (e.g.,certainty). FrameNet (Ruppenhofer et al., 2006)is a lexical database that maps certain words toframes and specifies for each frame its constituentsemantic roles. A frame has two main ingredi-ents: (a) a fixed set of lexical units (LUs) thatindividually mark the presence of the frame indiscourse and (b) a set of core or optional frameelements (FEs) (a.k.a. semantic roles). Forexample, the apply_heat frame is associatedwith the lexical units bake, boil, brown, cook,and simmer and frame elements cook, food,container, heating_instrument, and duration. Aframe’s lexical unit can be thought of as thecentral word or phrase that triggers the frame intext. Frame elements are the semantic rolesassociated with the frame. For example, theapply_heat frame requires the element that isapplying the heat (the cook) and the elementthat heat is being applied to (the food).FrameNet release 1.34 was used in this work. Itcontains 795 frames, 10,195 lexical units, and7,124 frame elements. This release also con-tains a large set of annotated sentences withframes, LU, and FE each marked and labeled.For example, the sentence

exhibits the apply_heat frame, where the lexicalunit boiled triggers the frame. Jim is the cook,the egg is the food, and for 3 minutes is fillingthe optional duration role. A lexical unit neednot be a verb. The statement frame is associated

with verb lexical units such as acknowledge,address, and say, as well as noun lexical unitssuch as report, denial, and explanation.

Unfortunately, though not surprisingly,no frame in FrameNet maps directly towhat we call a citation, though several framesseem relevant. For example, consider thefollowing snippets from sentences known tobe citations:

1. . . . with the underscoring . . .

2. . . . what the

to be . . .

3. . . . and that. . .

4. . . .

. . .

Sentence (1) exhibits the statement frame inwhich the UN is acting as the speaker. Sentence(2) exhibits the estimation frame in which theU.S. Humane Society is acting as the cognizer.Sentence (3) exhibits the expectation frame inwhich the models by the IPCC is acting as thespeaker. Sentence (4) exhibits theattribute_information frame in which a book bythe Museum of Natural History is acting as thetext. It seems reasonable that, given a new sen-tence, if the entity in question is filling a similarframe-specific semantic role, then it is alsobeing cited. A classifier could leverage fromrules such as “If a sentence exhibits the estima-tion frame and the entity in question is acting asthe speaker then the sentence is a citation.”Thus, what is needed is a mechanism to deter-mine that an entity (e.g., the UN) is filling a cer-tain role (e.g., the speaker role) in a particularinstance of a frame (e.g., the statement frame inSentence 1, above).

Jimboiled theegg for 3 minutescook LU food duration

.{ 123 123 1 244 3444

6 7444444 8444444apply_ heat

UN reportspeaker LU

statement

{ 123

6 74 84

U.S Humane Society estimates. cognizer LU

estima

1 2444 3444 1 24 34

ttion6 7444444 8444444

models by the IPCC predictspeaker LU

expectatio

1 2444 3444 124 34

nn6 744444 844444

accroding to

a book by the Museum of Natural HistorLU

1 244 344

yy

text

attribute_information

1 24444444 34444444

6 744444444 8444444444


The annotated sentences distributed withFrameNet exemplify how a frame-specificframe element may fit syntactically into a sen-tence that exhibits that frame. Specifically,annotated sentences exemplify possible syntac-tic relations between the frame element and thelexical unit. For example, annotations mayshow that in a statement frame, the speaker canoccur as the subject of the verb reported, suchas in

or as the object of the prepositional phrasereported by, such as in

These syntactic relations can be captured by thedependency tree path between the FE and theLE. The next section describes dependencytrees and how they are used to produce featuresfor training a classifier.

Dependency Parse Trees

A dependency tree is a representation of a sen-tence that encodes the syntactic relations betweenthe words in the sentence. Figure 2 shows thedependency parse tree for the sentence In Septem-ber 2005, the U.S. NSIDC reported that summerArctic sea ice had shrunk to low levels.PB Eachword in the sentence is related to one parent,except the main verb, reported. Dependenciesbetween child and parent are labeled according toa fixed set of syntactic relationships, such as nsubj(subject), dobj (direct object), and amod (adverbmodifier). A dependency tree is a connectedgraph, so there is always a path from any word toany other word in the sentence. A word pair canshare more than one connecting path, in whichcase choosing the shortest path has been shown tobe a sensible heuristic for selecting the one thatencodes the most meaningful syntactic relation-ship (Bunescu & Mooney, 2005; Erkan, Ozgur, &Radev, 2007).

Suppose we are given the sentence in Figure 2and we wish to know if there is a statementbeing made and, if so, (a) who is making thestatement and (b) what the statement is. Sup-pose we have the dependency tree representa-tion of the sentence. One annotated sentence inFrameNet exhibiting the statement frame isThey reported no arrests during or after thematch., labeled as

The dependency tree representation of thissentence is shown in Figure 3, along with thespans of text filling the speaker and messageroles. This tree shows they, the speaker, being

different studiesreported that speaker

. . .1 244 344

. . . .speaker

reported by different studies1 244 344

FIGURE 2. Dependency tree representation of“In September 2005, the U.S. NSIDC reportedthat summer Arctic sea ice had shrunk to lowlevels.”

September

in

reported

NSIDC

dep

dep

nsubj

2005

depUSthe

that

dep

shrunk

ice

Arcticsummer sea

dep

had to

levels

low

det nn

nn nn nn

nsubj

auxdep

dep

amod

They reported noarrestsspeaker LU message

statement

{ 124 34 1 24 34

6 774444 84444 duringor

after thematch.


the subject of the verb reported and no arrests,the message, being the direct object of theverb reported. We more concisely denote thesesyntactic rules as reported speaker andreported message. These syntactic rulescan then be applied to the dependency tree inFigure 2 to determine that the U.S. NSIDCis acting as the speaker and that summerarctic sea ice had shrunk to the low levels is thestatement.

The next subsection describes how the pieceswere put together to generate our FrameNet-based features.

Generating Frame-Target-Role Triples

All FrameNet-annotated sentences weredependency parsed using the Stanford parser.5

Each FrameNet sentence is typically associ-ated with one semantic frame, but may exhibitmore than one. Each frame annotation showsthe span of text that is the lexical unit and eachspan of text that corresponds to a frame ele-ment or semantic role. For each marked frameelement (i.e., semantic role), we extracted thedependency tree path from the LU to the FE.6

In total, we were able to process 134,471example sentences exhibiting 592 frames (127sentences per frame on average). Some

annotated sentences were excluded due toparsing failure. A total of 85,093 syntacticpatterns were found. Each syntactic pattern[e.g., reported speaker] describes howa frame-specific semantic role (e.g., thespeaker in a statement frame) may fit syntacti-cally in a sentence exhibiting that particularframe (e.g., the speaker is the subject of theverb reported). It describes the syntactic rela-tionship between the semantic role and thelexical unit.

Using syntactic patterns directly as featuresposed a challenge. As mentioned, the numberof syntactic-pattern features was 85,095, andour entire evaluation set was 6,000 sentences.In text classification, when the number ofdistinct features greatly outnumbers the num-ber of instances in the training data, this is aproblem. Features from the training set are lesslikely to reoccur in the test set, which reducesthe predictive power of a learned model. Oursolution was to resolve each syntactic patternto its corresponding frame-LU-role triple(called frame-target-role triple from hereforward). This reduced the feature set sizefrom 85,095 patterns to 5,027 frame-target-roletriples.

For each instance in the training and test set,its associated frame-target-role triples weregenerated by applying each of the 85,095patterns to the sentence. A pattern is said topositively fit the sentence if the sentence’snamed entity (i.e., the marked person or organi-zation—recall there is only one per sentence)falls into the role slot of the syntactic pattern. Ifa syntactic pattern fit the sentence, then the pat-tern’s corresponding frame-target-role wasincluded as a feature of the sentence. For exam-ple, in the sentence The [IUCN]org is predictinga 30 percent reduction in the polar bearnumbers in the next 45 years.PB, the pattern pre-dicting speaker fits this sentencebecause the marked organization, the IUCN, isthe noun subject of the verb predicting. Thispattern corresponds to the frame-target-roletriple statement-predicting-speaker. Thus, thistriple would be added as one feature of thisinstance. Table 4 shows a few frame-target-roletriples with some of their associated syntacticpatterns.

(nsubj)u ruuuuuuu(dobj)u ruuuuu

FIGURE 3. Dependency tree of FrameNetsentence “They reported no arrests during orafter the match”., where they is marked as thespeaker and no arrests is marked as themessage.

They during

reported

arrests

nsubj

dobj

cc

dep

no

det

or after

conj

match

conj

the

detmessage

speaker

(nsubj)u ruuuuuuu

(nsubj)u ruuuuuuu


It is possible for a syntactic pattern to resolveto multiple frame-target-role triples. For example,the pattern report speaker and the patternreport topic describe the same syntacticpattern. The ambiguity stems from the fact thatX in a noun phrase the X report can be thespeaker of the statement (e.g., the IPCC report)or the topic of the statement (e.g., the polarbear report). In cases where a syntactic patternresolves to multiple frame-target-role triples,each triple was added as a feature, weightedproportional to the level of ambiguity. In thiscase, both triples would be added with a weightof .5 because the ambiguity is between twoframe-target-role triples.

EXPERIMENTAL RESULTS

As previously mentioned, two humans codeda set of 6,000 sentences for training/testing.Two hundred and seventy nine (4.7%) sen-tences were labeled citation, and the remainingwere labeled non-citation. The frequency distri-bution of entities referenced in citations wasvery heavy-tailed, as shown in Figure 4. Alongthe x axis, entities are ranked in descending

order of times they were referenced in acitation. As the figure shows, a few persons andorganizations were cited very frequently andmany were cited only once or twice. The mostfrequently cited entity was the DiscoveryChannel, cited 39 times (14% of all citations).The second most frequently cited entity was AlGore, cited 30 times (11% of all citations). Thethird most cited entity was the IPCC, cited tentimes. Beyond the second-most cited entity, the

(nn)u ruuu(nn)u ruuu

TABLE 4. Example Frame-Target-Role Triples and Some of their Associated Syntactic Patterns

Frame-Target-Role Triple Syntactic Pattern

statement-reported-speaker reported speaker

reported by speaker

reported from speaker

evidence-revealed-support revealed support

revealed by support

revealed in support

attributed_information-according-speaker according to speaker

emotion_directed-pleased-experiencer pleased feel experiencer

pleased was experiencer

pleased seemed experiencer

emotion_directed-pleased-topic pleased sounded about topic

(nsubj)u ruuuuuuu(prep)u ruuuuu (dep)u ruuuu

(prep)u ruuuuu (dep)u ruuuu(nsubj)u ruuuuuuu(prep)u ruuuuu (dep)u ruuuu(prep)u ruuuuu (dep)u ruuuu

(dep)u ruuuu (dep)u ruuuu

(acomp)s uuuuuuuuu (nsubj)u ruuuuuuu

(acomp)s uuuuuuuuu (nsubj)u ruuuuuuu

(acomp)s uuuuuuuuu (nsubj)u ruuuuuuu(acomp)s uuuuuuuuu (dep)u ruuuu (dep)u ruuuu

FIGURE 4. Frequency distribution of persons/organizations cited, in descending order offrequency-based rank.

Frequency Distribution of Person/OrgnizationCited in Descending Order of Rank

051015202530354045

0 20 40 60 80 100 120 140Rank

Fre

quen

cy


number of citations per person/organizationdrops significantly. Eighteen entities were citedtwice (36 references, accounting for 13% of cita-tions). Most entities were cited only once (106references, accounting for 38% of all citations).

Results are presented in terms of precision(P), recall (R), and F-measure (F1), calculatedby micro- and macro-averaging, for reasonsmotivated below. Precision (P) measuresprediction accuracy with respect to the targetclass, in this case the citation class: Of theinstances predicted citation, what percentagewere correct? Recall (R) measures coverage:Of the instances that are true citations, whatpercentage were correctly predicted citation?F-measure (F1) is the harmonic mean of preci-sion and recall. Micro-averaged P, R, and F1are defined as

Macro-averaged P, R, and F1 are defined as

where Spred defines the set of unique namedentities in predicted citations and Strue definesthe set of unique named entities in truecitations.

Micro-averaging treats all citations asequally useful. Correctly predicting a citationthat references a frequently cited entity affectsmicro-averaged P, R, and F1 the same as cor-rectly predicting a citation that references a

rarely cited entity. Micro-averaged metrics aredominated by what happens with the frequentcases (e.g., Discovery Channel, Al Gore,IPCC). In contrast, in computing Pmacro, preci-sion is calculated for each entity x in a predictedcitation (x ∈ Spred) and averaged across entities.Similarly, in computing Rmacro, recall is calcu-lated for each entity x in a true citation (x ∈ Strue)and averaged across entities. Correctly predict-ing a citation that references a rarely citedentity increases macro-averaged P, R, and F1more than correctly predicting a citation thatreferences a frequently cited entity. Macro-averaged metrics are dominated by what happenswith the rare cases. Both micro- and macro-averaged P, R, and F1 are presented because thedistribution of entities referenced in true cita-tions was very heavy-tailed. A classifier thatcatches all references to the Discovery Channeland Al Gore catches 25% of all citations (i.e.,Rmicro = 0.25). However, a different classifierthat also catches 25% of all citations but finds awider variety of entities cited may be preferredin some cases.

Evaluation was performed via tenfoldcross-validation. The data was divided intoten folds. During each cross-validation step,each classifier was trained on nine folds(90% of the data) and evaluated on the held-out fold (10% of the data). This process wasrepeated ten times. The P, R, and F1 valuespresented are the average of the ten cross-validation runs. Where specified, standarddeviation numbers are also derived fromthese ten cross-validation results. The samefolds were used in all experiments, allowinga per-fold comparison across all classifiers.For this reason, significance testing was doneusing a two-tailed paired t-test.

Eleven classifiers were evaluated:

1. ALL2. ALL_BUT_NE3. ONLY_NE4. ALL_BUT_NE + ONLY_NE5. FTR6. FTR + ALL7. FTR + ALL_BUT_NE8. FTR + ONLY_NE9. FTR + ALL_BUT_NE + ONLY_NE

P# correct citation predictions

# citation predictions,

R

micro =

mmicro

micro

# correct citation predictions

# true citations,

F1(

=

=22 P R )

(P R ).micro micro

micro micro

× ×+

P

# correct citation predictions that reference

micro

pred

=∈x S∑ xx

xS

x S

# citation predictions that reference | |

,

R

pred

micro

tr

=∈ uue

# correct citation predictions hat reference # true citat

∑ t x

iion that reference | |

,

F1(2 P R )

(

true

micromacro macro

xS

= × ×PP R )

,macro macro+


10. META (ALL_BUT_NE + ONLY_NE)11. META (FTR + ALL_BUT_NE + ONLY

_NE)

The first four classifiers use only bag-of-wordsfeatures. The fifth classifier, FTR, uses only ourframe-target-role (FTR) triples as features. Thenext four classifiers (6, 7, 8, and 9) individuallycombine a bag-of-words feature set with ourframe-target-role triples. The last two classifi-ers (10 and 11) are meta-classifiers. META(ALL_BUT_NE ONLY_NE) combines theoutput of ALL_BUT_NE and ONLY_NE,while META (FTR+ALL_BUT_NE + ONLY_NE) combines the output of FTR,ALL_BUT_NE, and ONLY_NE. The twometa-classifiers were trained using two tiers ofcross-validation. The second tier does ninefoldcross-validation on the first tier’s training folds.The main point is that these two tiers of cross-validation ensure that the meta-classifier is nottrained indirectly with evidence derived fromthe test set.

Evaluation results in terms of micro- andmacro-averaged P, R, and F1 are given in Table 5and Table 6, respectively. Micro- and macro-averaged F1 is shown graphically in Figure 5.

The best f-measure (F1), both micro- andmacro-averaged, was obtained by META(FTR+ALL_BUT_NE+ONLY_NE), whichoutperformed all other classifiers, includingMETA (ALL_BUT_NE+ONLY_NE), by astatistically significant margin (p <.01). META(FTR + ALL_BUT_NE+ONLY_NE) found

66% of all true citations with an 18% predictionerror rate (i.e., 1−Pmicro = .18). It should benoted that our estimate of recall is optimistic, aswe only consider sentences in which at leastone entity was tagged as a person or organiza-tion. Citations missed due to a miss by thenamed entity tagger are not factored into ourrecall estimate.

Several conclusions can be drawn from theseresults. First, in terms of macro- and micro-averaged F1, all three bag-of-words models thatborrow evidence from the named entity poten-tially being cited (i.e., ALL, ONLY_NE, andALL_BUT_NE + ONLY_NE) outperformedALL_BUT_NE, which ignores the entity beingcited (p < .05). In fact, ONLY_NE, which con-siders only features occurring within cited enti-ties in the training set, was statisticallyindistinguishable from ALL and ALL_BUT_NE + ONLY_NE, which consider the entiresentence.

In this dataset, 62% of citations referenced anamed entity that was cited more than once.This helps explain ONLY_NE’s performancein terms of micro-average F1. It is surprising,however, that ONLY_NE performed as well asit did in terms of macro-averaged F1. Again,macro-averaged F1 is dominated by precisionand recall with respect to entities cited rarely(i.e., have few training examples). Of the 140unique named entities mentioned in the 279sentences labeled citation by the human coders,only 20 (14%) were referenced in more thanone citation and not mentioned in a single

TABLE 5. Micro-Averaged Evaluation Results (± Standard Deviation)

Pmicro Rmicro F1micro

BOW ALL 0.869 ± 0.096 0.517 ± 0.082 0.642 ± 0.069ALL_BUT_NE 0.818 ± 0.136 0.352 ± 0.084 0.487 ± 0.099ONLY_NE 0.806 ± 0.045 0.549 ± 0.087 0.648 ± 0.056ALL_BUT_NE+ONLY_NE 0.917 ± 0.090 0.492 ± 0.101 0.633 ± 0.089

FTR FTR 0.643 ± 0.189 0.233 ± 0.090 0.339 ± 0.122FTR+BOW FTR + ALL 0.869 ± 0.100 0.53 ± 0.088 0.654 ± 0.080

FTR + ALL_BUT_NE 0.829 ± 0.132 0.388 ± 0.095 0.523 ± 0.106FTR + ONLY_NE 0.828 ± 0.047 0.578 ± 0.109 0.675 ± 0.074FTR + ALL_BUT_NE + ONLY_NE 0.921 ± 0.084 0.506 ± 0.094 0.648 ± 0.086

Meta META (ALL_BUT_NE + ONLY_NE) 0.813 ± 0.052 0.613 ± 0.109 0.694 ± 0.073META (FTR+ALL_BUT_NE + ONLY_NE) 0.813 ± 0.053 0.664 ± 0.107 0.725 ± 0.062


non-citation. These can be considered easycases for ONLY_NE. So, how couldONLY_NE achieve a macro-averaged recall of.445? The answer is that there was significantvocabulary overlap between unique entitiesreferenced in citations, for two reasons. First,synonymous or coreferential unique entities(e.g., Discovery Channel, Discovery TV Chan-nel, and The Discovery Channel or Al Gore and

Gore) were not manually mapped to a commoncanonical form. In other words, it was possiblefor two unique entities to refer to the sameactual person or organization. Second, evenacross conceptually distinct entities, there wassome vocabulary overlap with words such ascenter, association, institute, magazine, acad-emy, report, service, and federation. Of the 106unique entities that were referenced in only one

TABLE 6. Micro-Averaged Evaluation Results (± Standard Deviation)

Pmacro Rmacro F1macro

BOW ALL 0.825 ± 0.124 0.427 ± 0.069 0.557 ± 0.063ALL_BUT_NE 0.788 ± 0.153 0.288 ± 0.093 0.416 ± 0.114ONLY_NE 0.806 ± 0.076 0.445 ± 0.093 0.564 ± 0.057

FTR FTR 0.642 ± 0.190 0.256 ± 0.098 0.362 ± 0.129FTR + BOW FTR + ALL 0.826 ± 0.131 0.445 ± 0.080 0.573 ± 0.081

FTR + ALL_BUT_NE 0.807 ± 0.143 0.332 ± 0.099 0.464 ± 0.115FTR + ONLY_NE 0.844 ± 0.074 0.482 ± 0.120 0.603 ± 0.079FTR + ALL_BUT_NE + ONLY_NE 0.900 ± 0.114 0.409 ± 0.079 0.557 ± 0.082

Meta META (ALL_BUT_NE + ONLY_NE) 0.822 ± 0.081 0.528 ± 0.117 0.632 ± 0.073META (FTR + ALL_BUT_NE + ONLY_NE) 0.820 ± 0.086 0.587 ± 0.112 0.674 ± 0.057

FIGURE 5. Micro- and macro-averaged F1 results.

0.000.100.200.300.400.500.600.700.800.901.00

FTR

ALL_B

UT_NE

FTR+ALL

_BUT_N

E

ALL_B

UT_NE + O

NLY_N

EALL

FTR + ALL_B

UT_NE + O

NLY_N

E

ONLY_N

E

FTR_ALL

FTR_ONLY

_NE

META (ALL

_BUT_N

E + ONLY

_NE)

META (FTR + ALL

_BUT_N

E + ONLY

_NE)

Micro-averaged F1 Macro-averaged F1


citation (potentially difficult cases forONLY_NE), 68 (64%) contained at least onenon-stopword term stem that appeared inanother named entity that was also cited.

Second, as is shown in Figure 5, FTR wasthe only classifier for which macro-averagedF1 was higher than micro-averaged F1. There-fore, of the classifiers evaluated, it was theleast influenced by entities cited in the train-ing data. Not being influenced by the entitiescited in the training data is a desired propertyfor model transfer, training a model on onecorpus and applying it to another corpus. Forinstance, a classifier such as ONLY_NEmight not transfer well across corpora inwhich different entities are referenced in cita-tions. FTR’s performance was low relative tothe other classifiers. However, this result sug-gests that syntactic features, which ignorecontent and focus on structure, have potentialvalue.

Third, both meta-classifiers outperformedtheir multiple-feature-set, single-classifier coun-terparts. In terms of micro- and macro-averagedF1, META (ALL_BUT_NE + ONLY_NE) out-performed ALL_BUT_NE + ONLY_NE andMETA (FTR + ALL_BUT_NE + ONLY_NE)outperformed FTR + ALL_BUT_NE + ONLY_NE (p <. 01). Combining these feature sets ina meta-classification framework worked betterthan merging the feature sets at the input of asingle classifier.

Finally, although FTR performs worse than allother classifiers, frame-target-role triples addedvalue in a meta-classification framework. META(FTR + ALL_BUT_NE + ONLY_NE outper-formed META (ALL_BUT_NE + ONLY_NE)in terms of micro- and macro-averaged F1(p < .01). The only difference between thesetwo meta-classifiers is that the better perform-ing one uses FTR’s predictions as an additionalinput.

Learning Curves

Figures 6 and 7 show micro- and macro-averaged f-measure (F1) as a function of theamount of training data used to train eachmodel, from 10% to 90% of the data in 10%increments. The total amount of training data

was 6,000 sentences. We focus on FTR, ALL,ALL_BUT_NE, ONLY_NE, FTR +ALL_BUT_NE + ONLY_NE, META(ALL_BUT_NE + ONLY_NE), and META(FTR + ALL_BUT_NE + ONLY_NE).

These learning curves reveal several trends.First, in terms of micro- and macro-averagedF1, ONLY_NE and both meta-classifiers per-form at comparable levels with little trainingdata, but the meta-classifiers improve overONLY_NE with more training data. The differ-ence in performance is greater in terms ofmacro-averaged F1, reinforcing the claim that bycombining feature sets both the meta-classifiersgravitate less toward citations referencingentities cited in the training data. Second, at lowlevels of training data (10%–30%) FTR outper-formed ALL_BUT_NE. However, with moretraining data, ALL_BUT_NE outperformedFTR in terms of micro-averaged F1. Thissuggests that although ALL_BUT_NE ignoresfeatures from within the tagged named entity, itis still biased toward citations referencing enti-ties in the training data.

ERROR ANALYSIS

Frame-Target-Role Triples

There are at least two reasons why the FTRclassifier alone did not perform better: (a) lowcoverage and (b) ambiguous frame-target-roletriples. In terms of coverage, of the 279 sen-tences labeled citation by the coders, 123 (44%)had no matching syntactic pattern, thereby noframe-target-role triple. For those sentences,the classifier had to make a blind prediction.There were two reasons for a sentence havingno matching syntactic pattern. The first is thatour inventory of frame-target-role triples is lim-ited by what is covered in FrameNet. Not everyword maps to a lexical unit in FrameNet. Forexample, the verb publish is not associated witha frame in FrameNet. There were eight citationsin which the referenced person or organizationappeared as the subject of the verb publish.Limited coverage is a known problem withFrameNet, and prior work has focused onextending it (Green, Dorr, & Resnik, 2004).


The second reason is that the annotated corporadistributed with FrameNet do not represent allthe syntactic possibilities for a frame-target-role triple. This problem potentially could bealleviated by requiring only a soft matchingbetween the syntactic context of a cited entityand a pattern in our inventory.

As expected, some frame-target-role triplesprovide more evidence of a citation than others.

Consider the two frame-target-role triples state-ment-report-speaker and statement-said-speaker.The first triple provides stronger evidence of acitation because the lexical unit report inher-ently suggests that the claim being made isconcrete and not obvious. Indeed, the firsttriple was associated with 15 citations andonly two non-citations. However, speakersoften make use of less reliable frame-target-role

FIGURE 6. Learning curves: micro-averaged F1.

0.00

0.20

0.40

0.60

0.80

1.00

10% 20% 30% 40% 50% 60% 70% 80% 90%% data used for training

F1

FTR

ALL

ALL_BUT_NE

ONLY_NE

FTR + ALL_BUT_NE +ONLY_NE

META (ALL_BUT_NE +ONLY_NE)

META (FTR + ALL_BUT_NE+ ONLY_NE)

FIGURE 7. Learning curves: macro-averaged F1.

FTR

ALL

ALL_BUT_NE

ONLY_NE

FTR + ALL_BUT_NE +ONLY_NE

META (ALL_BUT_NE +ONLY_NE)

META (FTR + ALL_BUT_NE+ ONLY_NE)10% 20% 30% 40% 50% 60% 70% 80% 90%

% data used for training

0.00

0.20

0.40

0.60

0.80

1.00

F1


triples in citations. Consider the following twosentences. The first one is a citation, but notthe second.

1. . . . an official with the [Canadian Depart-ment of Fisheries and Oceans]org said:we’ve noticed that the ice over the past fouror five years has been deteriorating. . .PB

2. President [Bush]per in his presidentialcampaign said that he was campaigningon an environmental platform.PB

The only way to tell that the first statement is acitation is to consider the content of the infor-mation being attributed to the speaker. How-ever, syntactically motivated features such asthose used by FTR focus on structure ratherthan content.

Improving Overall Recall

Our best performing classifier, META(FTR+ALL_BUT_ENT+ONLY_NE), foundapproximately 66% of all citations (an optimis-tic estimate, as previously mentioned) with pre-diction accuracy of 81% micro-averaged.Ninety-four citations were missed. Seventy-seven out of these 94 misses (87%) were pre-dicted non-citation by all three base classifiersthat produced the three input features to thismeta-classifier (FTR, ALL_BUT_NE, andONLY_NE). Fifty-eight of these 94 misses(62%) were references to entities that were ref-erenced only once. These were difficult casesfor ONLY_NE, which focuses only on termsfrom the marked named entity. Forty-one ofthese 94 misses (44%) were sentences forwhich no syntactic pattern matched. These werefeatureless data points for FTR.

An improvement in recall would likely comefrom an improvement in named entity taggingaccuracy. Remember that annotators wereinstructed to label named entity tagging errorsas non-citations. Mistakes by the named entitytagger pose a challenge because they make non-citations look like citations. For example, in ourevaluation, all base classifiers (and hence ourbest meta-classifier) missed the citation [NickLunn]per shows the polar bears of Hudson Bay

coming on land earlier as the sea ice meltsearlier.PB The reason was that four non-citationsshared a similar local context:

1. [Science]org shows that polar bear popu-lations are declining as a result of globalwarming.PB

2. [Science]org shows us a shrinking worldfor the polar bear.PB

3. [Research]org shows that the sea ice in theArctic has retreated farther . . .PB

4. [Science]org shows that polar bears arethreatened with extinction from . . .PB

These were labeled non-citation by the humancoders because the first word in each wasmistakenly labeled as an organization. Becausethe verb shows co-occurred more often withnon-citations than citations, this feature and thesyntactic pattern shows speaker wereassociated with the non-citation class, decreas-ing recall.

Difficult Cases

In some cases, the difference between acitation and a non-citation can be subtle. Thereis no reason to believe that such cases areuncommon across different corpora. Considerthe following sentences:

1. . . . an official with the [Canadian Depart-ment of Fisheries and Oceans]org said:we’ve noticed that the ice over the past fouror five years has been deteriorating. . .PB

2. [Carl Sagan]per said: there are no uselessthreads in the fabric of life.PB

3. Like [Ben Franklin]per said: we all hangtogether or we hang separately.PB

The first reference is a citation. The second andthird are not. However, the difference betweenthe first quote and the last two is not directlyaccessible to an automated system that lacksdomain (or even common) knowledge and usesevidence only from the sentence in question. Ahuman might draw a distinction between thefirst statement and the last two based on threecriteria:

(nsubj)u ruuuuuuu


1. Authority: It is easy for a human to guessthat the Canadian Department of Fisheriesand Oceans is more authoritative thanCarl Sagan or Ben Franklin when itcomes to the polar bear and its habitat.

2. Relevance: The information attributed tothe Canadian Department of Fisheriesand Oceans is more relevant than theclaims attributed to Carl Sagan and BenFranklin. It uses language that moreclosely matches the language central tothe debate (e.g., words such as ice anddeteriorating rather than fabric andhang).

3. Vagueness: The last two statements arevague, meaning that their interpretation ismore open-ended than the claim made inthe first sentence. The information attrib-uted to the Canadian Department of Fish-eries and Oceans is more concrete.

Relevance is something that has been widelystudied in text classification and informationretrieval. A possible improvement could comefrom adding a feature that quantifies the simi-larity between the language of the sentence inquestion and the language characterizing theproposal, which is exemplified in the text fromthe Federal Register, the publicly accessible doc-ument that describes the proposed regulation.

Measuring authority and vagueness are moredifficult problems. Quantifying authority mightrequire an external resource such as the Web orthe Wikipedia.7 Some named entities refer-enced in citations in the Polar Bears corpus (e.g.,the IPCC, the United Stated Geological Survey,the Canadian Wildlife Service, and the NaturalResources Defense Council) have Wikipediaentries that are linked (directly or indirectly) tothe Wikipedia article on polar bears. A direct orindirect link between two Wikipedia articlesmarks a relationship between the source andtarget entity. Entities more tightly linked to thepage on polar bears may be more authoritativethan those less tightly linked. However, thisapproach is at the mercy of what is covered inWikipedia and may not lead to a corpus generalsolution.

Of the three dimensions listed above, vague-ness seems to be the most difficult to quantify

by automated means. In fact, we are unaware ofany work that has attempted to automaticallyquantify degree of vagueness in text. Someprior work has focused on classifying state-ments into subjective and objective (Wiebe,Wilson, Bruce, Bell, & Martin, 2004). How-ever, subjectivity and vagueness are not exactlythe same phenomenon. A citation may contain aspecific subjective component (i.e., the authorcites hard evidence and then expresses his orher opinion). Another possibly relevant vein ofresearch is that of automatically recognizingmetaphors in literary text. Pasanek and Scully(in press) show that an SVM classifier using abag-of-words representation can tell metaphorsand non-metaphors apart.

RELATED WORK

To our knowledge, no prior work hasfocused specifically on detecting sentences thatmake reference to an external source of infor-mation in informal text such as public com-ments. Most prior work on automatic citationanalysis focuses on formal text, such as aca-demic literature. A citation in academic text isdefined as a meaningful link between the workbeing presented and work done in the past andis usually presented in a stylized manner. In thenext subsection, we review prior work on cita-tion analysis in academic text. Then, we surveyrelevant syntax-based extraction technologythat has been applied to similar problems.

Citation Analysis in Academic Text

Citation analysis in academic text can bedivided into four problems: (a) recognizinglinks between documents, (b) locating citationsin the body of the text, (c) disambiguatingwhich paper (from the references section) acitation in the body refers to, and (d) determin-ing the purpose or function of a citation. SeeTeufel, Siddharthan, and Tidhar (2006b) for anexample of work on predicting citation func-tion. With respect to the first task, academicpapers contain a references section that lists,usually in a consistent format, the papers citedin the body of the document. Thus, establishinga link between a paper and prior work comes


down to correctly parsing the references section.Task (b), locating citations in the body of thetext, sounds like it would be related to ourwork. However, the vast majority of citations inthe body of academic text are formal citations,using the terminology from Powley and Dale(2007). Formal citations mostly follow aconsistent, recognizable format (e.g., theAuthor-Year pair). Powley and Dale (2007)report high accuracy (i.e., greater than .99 pre-cision and greater than .96 recall) in detectingformal citations.

Some citations in academic literature areknown as informal citations, such as Thisapproach has many strong points, but does notprovide a very satisfactory account of the adher-ence to discourse conventions in dialogue.,borrowed from Siddharthan & Teufel (2007).Informal citations look more like citations inpublic comments. Prior work on detectinginformal citations treats it as a two step process.First, high-recall/low-precision heuristics areused to collect candidate informal citations. Forexample, Kim & Webber (2006) examine sen-tences with the pronoun they. Siddharthan andTeufel (2007) focus on a predefined set ofreferring expressions, including pronouns andwork nouns such as approach, study, investigation,and result. In the second step, the problem isformulated as multiclass classification, wherethere is one class for every document in thereferences, one class allotted to the current doc-ument, and one class for “no-document.” Indi-rectly, this two-step process is doing informalcitation detection. If the referring expressionrefers to a paper in the references, then it is acitation. Detecting citations in public commentscannot be framed the same way as detectinginformal citations in academic text. Public com-ments do not contain a references section, and,more generally, a finite set of citable sourcesdoes not exist for any corpus.

Syntax and Extraction

Motivated by the argument that syntax (thegrammatical relation between words) is impor-tant for recognizing events in text, recent workhas focused on dependency-tree paths forextraction and classification. The common

approach, as in our FTR approach, is to usesyntactic patterns as features in a machinelearning setting. Approaches differ in how theydecide which types of syntactic patterns areworth adding as features and which ones arenot. For example, Yangarber (2003) andStevenson and Greenwood (2005) focus onsubject-verb-object relations, where each pat-tern is a verb and its subject and/or object. Themotivation behind their approach is that theentities of interest favor certain predicates morethan others (e.g., organizations tend to acquire,merge, and sell but not eat, sleep, and marry).This intuition also motivates Riloff’s system,Autoslog (Riloff, 1993), which focuses on afixed set of grammatical relations, includingprepositional phrases and noun phrases.

One important hurdle in using syntacticinformation for extraction is that not everypart of the syntactic pattern is important.Also, a syntactic pattern may include two ormore words. Longer syntactic patterns are lessfrequent and have potentially less predictivepower. Recent efforts have focused on softmatching between syntactic patterns to avoidthis sparcity problem. Erkan etal. (2007) usedependency parsing to determine if two pro-teins mentioned in the same sentence are saidto interact in some metabolic process. Thepath in a dependency parse from one proteinto the other is used as evidence of their rela-tion. Their soft-matching approach uses editdistance, which quantifies the similaritybetween two ordered sequences. Bunescu &Mooney (2005) use the dependency tree pathbetween two entities in a sentence to classifytheir relation with respect to a fixed set ofrelations of interest (e.g., person is affiliatedwith organization or person has a family con-nection with person). Their dependency pathis augmented with semantic information, tomake a matching more plausible. In ourapproach, frame-target-role triples areresolved by applying our inventory of patternsin an exact match strategy. A possibleimprovement could come from implementinga soft-matching heuristic. However, combin-ing FTR with simpler representations allevi-ates some of the shortcomings of hardmatching.


CONCLUSION

We investigated the problem of manuallyand automatically recognizing citations in pub-lic comment corpora, addressed here for thefirst time. A significant portion of this workfocused on refining our citation annotationmanual up to the point where it producedacceptable inter-annotator agreement. We usedexisting machine learning techniques to learn toautomatically detect citations from training data.Several feature sets were evaluated individuallyand in combination. Each bag-of-words featureset focused on a different part of the sentence inquestion (e.g., the entire sentence, only thepotentially cited named entity, and the entiresentence, except the named entity). OurFrameNet-motivated features focused on howpotentially cited named entities fit syntacticallyinto their local context. We obtained our bestresults by combining feature sets in a meta-classification framework. This classifier found66% of all citations (an optimistic estimate ofrecall) with prediction accuracy of 81%.Although this is an encouraging result, morework is needed to produce a solution that doesnot require extensive human annotation to pro-duce training data.

One worrisome result was that macro-averaged recall was substantially lower thanmicro-averaged recall for most algorithms. Aclassifier that finds more references tounknown external sources (i.e., named entitiesnot seen in training data citations) might bemore valuable to a policy maker. Also, a classi-fier that favors references to entities alreadyseen in the training data is less likely to transferwell across corpora. Ideally, we want a modelthat can be trained on one corpus and used todetect citations in a different corpus. Otherwise,human effort must be expended to producetraining data for each new corpus. In our evalu-ation corpus, 62% of citations referenced anentity that was cited more than once. The PolarBears corpus may not be an outlier in thisrespect. The challenge is one of representation.Under a bag-of-words representation, if authorsconsistently cite the same entities and do so inan inconsistent way, a model will favor features

(i.e., terms) related to the entities cited in thetraining data. However, while these features areeffective for the same corpus as the trainingdata, they might not transfer well to a differentcorpus in which different entities are cited.Syntactic features, which ignore all content andfocus on structure, might be more transferableacross corpora (macro-averaged F1 was higherthan micro-averaged F1 using our syntactic fea-tures). However, more work is needed toincrease the coverage and accuracy of our syn-tactic-based model.

An attractive research direction to explore infuture work is that of co-training (Blum &Mitchell, 1998). In co-training, different classi-fiers leverage distinct feature sets and are boot-strapped to make use of unlabeled data toimprove performance. We obtained our bestresult by combining classifiers in a meta-classi-fication framework, where each base classifierfocused on a different feature set. This positiveresult indicates that our base classifiers madedifferent types of mistakes, which is a desiredproperty for co-training to work. A confidentprediction made by one classifier could be usedas training data for a different classifier. Abootstrapping framework could also be usefulin model transfer. For instance, a model basedon syntactic features might transfer betteracross corpora than a model based on contentfeatures (different corpora cover different top-ics), though it may suffer from low recall. If agood content-based model can be learned withlittle training data, then a few highly reliablepredictions from a syntax-based model couldprovide enough traction for a higher recall,content-based model to be learned.

NOTES

1. http://erulemaking.cs.cmu.edu/data.php2. http://www.qdap.pitt.edu.3. http://opennlp.sourceforge.net/4. http://framenet.icsi.berkeley.edu5. http://nlp.stanford.edu/downloads/lex-parser.

shtml6. For consistency, each dependency-path pat-

tern arbitrarily starts with the lexical unit and con-cludes with the frame element.

7. http://en.wikipedia.org/wiki/polar_bear


REFERENCES

Bikel, D. M., Schwartz, R. L., & Weischedel, R. M.(1999). An algorithm that learns what’s in a name.Machine Learning, 34 (1–3), 211–231.

Blum, A., & Mitchell, T. (1998). Combining labeled andunlabeled data with co-training. In Proceedings ofCOLT (pp. 92–100). New York: Morgan KaufmannPublishers.

Bunescu, R. C., & Mooney, R. J. (2005). A shortest pathdependency kernel for relation extraction. In Proceedingsof HLT (pp. 724–731). Association for ComputationalLinguistics.

Carletta, J. (1996). Assessing agreement on classificationtasks: The kappa statistic. Computational Linguistics,22(2), 249–254.

Erkan, G., Ozgur, A., & Radev, D. R. (2007). Sem supervisedclassification for extracting protein interaction sentencesusing dependency parsing. In Proceedings of EMNLP(pp. 228–237). Association for Computational Linguistics.

Green, R., Dorr, B. J., & Resnik, P. (2004). Inducingframe semantic verb classes from WordNet andLDOCE. In Proceedings of ACL (pp. 375–382). Asso-ciation for Computational Linguistics.

Joachims, T. (1998). Making large-scale support vectormachine learning practical. In B. Scholkopf,C. Burges, & A. Smola (Eds.), Advances in kernelmethods: Support vector machines. MIT Press.

Kim, Y., & Webber, B. (2006). Automatic reference reso-lution in astronomy aricles. In Proceedings ofCODATA. The Committee on Data for Science andTechnology.

Pasanek, B., & Scully, D. (in press). Mining millions ofmetaphors. In Literary and linguistic computing. OxfordJournals.

Porter, M. F. (1980). An algorithm for suffix stripping.Program, 14(3), 130–137.

Powley, B., & Dale, R. (2007). Evidence-based informa-tion extraction for high accuracy citation and authorname identification. In Proceedings of RIAO. C.I.D.

Riloff, E. (1993). Automatically constructing a dictionaryfor information extraction tasks. In Proceedings ofAAAI (pp. 811–816). AAAI Press/The MIT Press.

Ruppenhofer, J., Ellsworth, M., Petruck, M. P. L.,Johnson, C. R., & Scheffczyk, J. (2006). Framenet II:Extended theory and practice, main project document.

Siddharthan, A., & Teufel, S. (2007). Whose idea was this,and why does it matter? Attributing scientific work tocitations. In Proceedings of NAACL/HLT (pp. 316–323).Association for Computational Linguistics.

Stevenson, M., & Greenwood, M. A. (2005). A semanticapproach to IE pattern induction. In Proceedings ofACL (pp. 379–386). Association for ComputationalLinguistics.

Teufel, S., Siddharthan, A., & Tidhar, D. (2006a). Anannotation scheme for citation function. In Proceedingsof SIGDIAL (pp. 80–87). Association for Computa-tional Linguistics.

Teufel, S., Siddharthan, A., & Tidhar, D. (2006b, July).Automatic classification of citation function. InProceedings of EMNLP (pp. 103–110). Association forComputational Linguistics.

Vapnik, V. (1995). The nature of statistical learningtheory. New York, NY: Springer-Verlag.

Wiebe, J., Wilson, T., Bruce, R., Bell, M., & Martin, M.(2004). Learning subjective language. ComputationalLinguistics, 30(3), 277–308.

Wolpert, D. H. (1990). Stacked generalization [Tech. Rep.No. LA-UR-90-3460]. Complex Systems Group, LosAlamos, NM.

Yang, H., & Callan, J. (2006). Near-duplicate detection byinstance-level constrained clustering. In Proceedingsof SIGIR (pp. 421–428). Association for Computa-tional Machinery.

Yangarber, R. (2003). Counter-training in discovery ofsemantic patterns. In Proceedings of ACL (pp. 343–350).Association for Computational Linguistics.

Ziman, J. M. (1968). Public knowledge: An essay con-cerning the social dimensions of science. Cambridge,England: Cambridge University Press.

WITP Recognizing Citations in Public Comments

Documents

Transcript of WITP Recognizing Citations in Public Comments