Generation and Evaluation of Indexes for Chemistry Articles

P1: LMW/SRK P2: RSA/ASH P3: RSA/ASH QC:

Journal of Intelligent Information Systems KL399-01-Hodges January 16, 1997 17:31

Journal of Intelligent Information Systems 7, 57–76 (1997)c© 1997 Kluwer Academic Publishers. Manufactured in The Netherlands.

Generation and Evaluation of Indexesfor Chemistry Articles

JULIA HODGES [email protected] YIE [email protected] of Computer Science, Mississippi State University, Box 9637, Mississippi State, MS 39762-9637

SONAL KULKARNI [email protected], Inc., 8000 Towers Crescent Drive, #1400, Vienna, VA 22182

RAY REIGHART [email protected]. 7, Chemical Abstracts Service, 2540 Olentangy River Road, Columbus, OH 43202-1505

Abstract. This paper describes AIMS (Assisted Indexing at Mississippi State), a system that aids human docu-ment analysts in the assignment of indexes to physical chemistry journal articles. There are two major componentsof AIMS—a natural language processing (NLP) component and an index generation (IG) component. The focusof this article is the IG. We describe the techniques and structures used by the IG in the selection of appropriateindexes for a given article. We also describe the results of evaluations of the system in terms of recall, precision,and overgeneration. We provide a description of a graphical user interface that we have developed for AIMS.Finally, we discuss future work.

Keywords: automatic indexing, document analysis, user interface

1. Introduction

An ongoing research effort called the KUDZU (Knowledge Under Development from ZeroUnderstanding) project in the Intelligent Systems Research Laboratory of the Departmentof Computer Science at Mississippi State University addresses the extraction of knowl-edge from natural language text using methods that are as fully automated and domain-independent as possible (Boggess et al., 1991; Hodges and Cordova, 1993; Boggess et al.,1995). Currently, these techniques are being adapted for the development of a system calledAIMS (Assisted Indexing at Mississippi State) that aids human document analysts in theassignment of indexes to journal articles that have been published in the area of physicalchemistry (Boggess and Hodges, 1994). This work is being done in collaboration with theChemical Abstracts Service (CAS) of the American Chemistry Society in Columbus, Ohio.CAS employs about 400 document analysts; about 60 of them read and index physicalchemistry journal articles. Typically, these analysts hold a Ph.D. in chemistry.

AIMS consists of a natural language processing (NLP) component and an index gener-ation (IG) component. The primary focus of this paper is the IG. We provide a detaileddescription of the techniques and structures used by the IG in the generation of appropriateindexes for an article. We then report the results of evaluations of the success of the system



58 HODGES ET AL.

in terms of recall, precision, and overgeneration. We also describe the development of agraphical user interface for AIMS. Finally, we describe future work. We shall begin ourdiscussion with a description of similar research efforts.

2. Other index generation efforts

Other efforts at CAS have addressed ways of improving the indexing process by providingsome sort of automated tool for the document analysts. Zamora and Blower (1984) devel-oped programs that use computational linguistic techniques to extract facts about syntheticreactions reported in theJournal of Organic Chemistry. They used a simple model for thetext describing a chemical reaction. For example, they assumed that a paragraph describinga chemical reaction would describe the formation of only one product. Furthermore, theyassumed that the structure of the text would consist of “a heading, synthesis description,workup, and characterization of the product”. This allowed the procedures that they de-veloped for the purpose of identifying the components of a reaction to depend only on thesyntactic structure, with no need for additional semantic information. Zamora and Blowerrecognized that the semantic problems would be much more complex if their proceduresdid not assume this simple text model.

In another project at CAS, Ledwith (1988) developed a concept-oriented database thatlinked the index terms together to form an information retrieval aid. He defined a concept-oriented database in this context as “a collection of documents (or document surrogates) andauxiliary data”. The auxiliary data which is associated with each document is “conceptualinformation describing the semantic content of the document”. The conceptual informationserves as a key for retrieving the document. The concept-oriented database included arepresentation of all the concepts known to the database, synonyms for each concept, linksbetween related concepts (e.g., genus-species), and a description of each index entry (definedin terms of the concepts and their relationships with other concepts). The index entries hadalready been produced by document analysts. Ledwith built a hypertext browsing systemthat allowed a user to browse through the various concepts represented in the database,then retrieve documents related to those concepts. Unlike the work reported here, Ledwithmade no attempt to provide an automated system to process the documents themselves andgenerate the indexes. His work was limited to linking index terms that had already beengenerated by document analysts in order to form a search aid.

Ai et al. (1990) extended the work begun by Zamora and Blower on the development of aseries of programs that produce a summary of a chemical reaction reported in a paper inTheJournal of Organic Chemistry. The summary includes an identification of the substancesinvolved in the chemical reaction—their roles (e.g., reactives, reagents, or solvents) andquantities—and a description of the procedure that was followed—the order in which sub-stances were mixed, the duration of each individual reaction step, and the temperature usedduring each step. The researchers chose to limit their study to the experimental sectionsof articles because “the subject matter is restricted and the method of presentation is quitestylized and predictable”.

A number of researchers have defined probabilistic models for the purpose of assigningindexes to documents. Some researchers have determined the relevance of an index by



GENERATION AND EVALUATION OF INDEXES 59

estimating the probabilistic weights for various indexes based on the number of queries thatrelate to a specific document or document component (Maron and Kuhns, 1960; Kwok,1986). Fuhr and Buckley (1991) have described a probabilistic learning approach to theselection of indexes for documents. They definedrelevance descriptionas an abstractionfrom term-document pairs, thus providing parameters for their probabilistic model that arenot restricted to relevance information extracted from a specific document. A relevancedescription “comprises a set of features that are considered to be important for the task ofassigning weights to terms with relation to documents. So a relevance descriptionx(ti , dm)

contains values of attributes of the termti , the documentdm and their relationship”. Theapproach used by Fuhr and Buckley is similar to approaches used in pattern recognition.One may even describe their model as a learning method because relevance feedback datais used to derive the probabilities.

Fagin (1987) has described some experiments in automatic phrase indexing in whichthe document analysis algorithm considered the syntactic structure of both the documentand the query. In one set of experiments, he used a statistical method which made use ofsimple text characteristics such as word frequency. In another set of experiments, he useda syntactic method which made use of “augmented phrase structure rules” to identify keyphrases from parse trees generated by the syntactic analyzer. Fagin’s results indicated that asyntax-based indexing method has certain benefits not available with a statistical approach.

Based on Fagin’s work, Salton and Buckley (1990) introduced an automatic text matchingsystem that is based on phrase identification, phrase refinement, phrase weighting, and aglobal comparison of sets of weighted phrases. They conducted experiments to showthe relationships between global text similarities, local text similarities, and the relatedsubject matter. Salton and Singhal (1995) developed methods for “identifying important textpassages in large texts”. One of their objectives is to provide useful strategies for providingselective text access based on the needs of the user. Their text traversal strategy is based inpart on the generation of a “text theme”, which they define as “an area of concentration ina running text consisting of sets of related text excerpts covering a common related subjectarea”.

FERRET (Mauldin, 1991) is a full text information retrieval system that is a combinationof text parser, case frame matcher, and query parser. It uses a “partial understanding of itstexts to provide greater precision and recall performance than simple keyword matchingtechniques”. Precision and recall are performance evaluation metrics used in evaluatinginformation retrieval systems. Recall is the proportion of the correct indexes which havebeen generated. Precision is the proportion of the generated indexes which are correct. Wediscuss these metrics as they apply to our system in Section 5 of this paper. In FERRET,Mauldin addressed problems such as polysemy and synonymy. Polysemy refers to wordshaving multiple meanings, which is something that can reduce the precision of an infor-mation retrieval system. This is due to the fact that the system may retrieve a number ofirrelevant documents that use different definitions of the search word from what was in-tended. Synonymy refers to having multiple words or phrases to describe the same concept,which is something that can reduce the recall of an information retrieval system. This mayhappen if the system bases its search on some synonym of the original search word withthe result that there are no documents with that word found.



60 HODGES ET AL.

WorldViews (Ginsberg, 1993) is an experimental system which provides automatic index-ing and information retrieval by using “a structured set of subject headings or descriptors”,or what Ginsberg calls a “thesaurus”. The automatic indexing process consists of assigningthesaurus entries to documents in such a way that these entries describe the contents of thedocuments.

Soderland and Lehnert (1994) describedinformation extraction systemsas systems thatanalyze “real world text such as newswire stories”. Such a system “returns a structuredrepresentation of just the information from the text that is relevant to a user’s needs, ignoringirrelevant information”. An information extraction system must first determine the objectsbeing discussed in a particular text, then identify relationships among the objects that maybe explicitly stated in or inferred from the text. This second step, referred to as discourseanalysis, is very similar to the identification of objects and their relationships that is doneby an earlier system developed in the KUDZU project (Boggess et al., 1995; Cordovaand Hodges, 1992a, 1992b). This system was capable of building a knowledge base byprocessing a natural language technical text (in machine-readable form).

Soderland and Lehnert (1994) have reported the initial development of Wrap-Up, a train-able discourse component for an information extraction system. Wrap-Up utilizes machinelearning techniques to discover a set of classifiers for a training set of texts. It also suggests afeature set for each classifier it discovers. The intent is to reduce the amount of hand craftingrequired in information extraction systems in order to provide domain-specific informationand to make the information extraction systems more portable from domain to domain.

3. Overview of AIMS

Originally, the KUDZU research effort involved the development of a computer systemcapable of extracting knowledge by processing natural language text that is in machine-readable form. The primary testbed wasThe Merck Veterinary Manual(Fraser and Mays,1986), although some testing was also done with technical material from other fields inorder to demonstrate the domain independence of the system. An important feature of thisearly KUDZU system was its ability to bootstrap itself. The system is initialized with onlya description of the types of objects and relationships to be stored in the knowledge baseand an empty knowledge base. It then builds the knowledge base “from scratch” by pro-cessing the text (Agarwal, 1995; Agarwal and Boggess, 1992, Boggess et al., 1991, 1995;Cordova and Hodges, 1992a, 1992b; Hodges and Boggess, 1992; Hodges and Cordova,1993). The techniques that were used are similar to those used in the automatic generationof sublanguage semantic patterns (e.g., Hirschman, 1986; Sager, 1987). We have adaptedthe techniques and structures used in the early KUDZU system so that we may apply themto the current problem of identifying indexes for scientific articles.

CAS document analysts index about 600,000 articles from journals, books, and confer-ence proceedings, published in 50 languages, each year. These articles cover all areas ofchemistry. The indexes are published and added to STN International, a network of about175 scientific and technical databases.

Our testbed is a large body of physical chemistry journal articles previously indexed byCAS human document analysts to relate to about 200 concepts taken from a hierarchy of




more than 2,000 concepts. We have computerized this hierarchy as well as a set of words orphrases known by the document analysts to map onto the index concepts. We refer to thesetwo structures as the concept hierarchy and the phrase mappings, respectively. During apreprocessing stage, we have augmented the phrase mappings file with a set of abbreviationsand acronyms provided by CAS.

The NLP component processes each article by using Brill’s tagger to tag each word withsyntactic and semantic tags (Brill, 1992). For example, “calibration” and “spectroscopy”are labeled “noun/method” to identify them as activities associated with experiments. “Ki-netic energy” and “viscosity” are labeled “noun/property” to identify them as properties ofsubstances. The NLP component then uses a simple parser that we developed to producevery flat parse structures. Various specialized programs are included in the NLP compo-nent to handle very specific tasks such as the attachment of prepositional phrases to theappropriate sentential elements and the identification of those sentential elements beingjoined by a conjunction. More detail about the NLP component has been reported else-where (Agarwal and Boggess, 1992; Boggess et al., 1991, 1995). An overview of the entiresystem architecture is provided in (Hodges et al., 1996).

The IG component combines domain-specific information such as the concept hierarchyand the phrase mappings provided by CAS with the parsed text produced by the NLPcomponent in order to recommend subject matter (or concept) indexes for an article. Anoverview of the architecture of the AIMS system is shown in figure 1. Some of the techniquesused by the IG are adaptations of techniques that were developed in the early KUDZU project(Hodges and Cordova, 1993). In the next section, we shall describe the IG component insome detail.

4. The index generator

The indexes produced by the human document analysts at CAS must be uniform. Forthis reason, CAS has defined a number of files that contain information to ensure that thevarious analysts adhere to the accepted terminology in the physical chemistry domain. Thesefiles were not necessarily organized in a manner that made them usable by an automatedprocedure because they had been used only as resource material by the CAS documentanalysts. CAS made these files available to those members of the research group located atMississippi State University, who then restructured them so that they could be efficientlyused by AIMS.

One of the files consists of aconcept hierarchy. CAS provided a list of the 208 mostfrequently-occurring concept headings in physical chemistry. Some examples are: electricpotential, energy level transition, fluorescence, and kinetics. Associated with each conceptis information that must be present before a document analyst would create an index forthat concept. For example, in another of the CAS-provided files there is a collection ofphrase mappings—phrases whose presence suggests the possibility of a particular conceptbeing referenced. Because of the number of variations on the phrases, we map the dif-ferent variations of a phrase to what we refer to as theroot phrase, then we map the rootphrase to the set of concepts that may be indicated by the presence of that phrase in anarticle.



62 HODGES ET AL.

Figure 1. Architecture of AIMS.

The physical chemistry journal articles are arranged in various categories orsectionsaccording to the subarea of physical chemistry being discussed in this article. (This is doneto aid chemists in finding the specific information they are looking for. Section assignmentis based on the major thrust of an article as suggested by the author’s emphasis.) Thedifferent concepts in the concept hierarchy are also associated with the different sections.The IG should not generate an index for a concept based only upon the fact that a phrasewhich can map to that concept appears in the article. An additional constraint is that thesection to which that article has been assigned and the section associated with that conceptmust be the same.

A phrase may map to more than one concept, and a concept may have more than one asso-ciated phrase. Thus other information must be used to disambiguate the concept references.




Figure 2. Architecture of the index generator.

CAS has provided a variety of disambiguating information used by document analysts suchassynonymsfor the concepts that indicate a more specific concept in the hierarchy. Thebasic architecture of the index generator component of AIMS is shown in figure 2. In alater section, we provide a description of the algorithm used by the IG in generating theindexes for a particular article.

4.1. Complete structure of an index entry

We frequently use the terms “concepts” and “indexes” as though the two terms are inter-changeable. But in fact, the concept is only one part of what is considered a concept indexby CAS. When the CAS document analysts produce a concept index, it consists not onlyof the concept name (which is divided into six parts), but also includes what they call thecontext. The complete structure of a CAS concept index entry is:

CTH: the concept heading (exactly 1)HOM: homograph (0 or 1)MDF: modifier (0 or 1)QLF: qualifier (0 or 1)CAT: category (0 or 1)



64 HODGES ET AL.

LTR: limiter (0 or 1; usually 1)free text: uncontrolled vocabulary (1)

For example:

CTH: Abstraction reactionHOM: nilMDF: nilQLF: nilCAT: nilLTR: intramolecular, photochemicalfree text: of hydrogen, in polymethylene linked xanthone and xanthene, mechanism of

A concept index entry is a hierarchy of 2–7 levels. Depending on the concept heading(CTH), some levels are required, some are optional, and some are disallowed. This ar-rangement maps well to a frame-like structure that we used in the early KUDZU system,so we have adapted that structure for AIMS. A frame system is a well-known knowledgerepresentation technique in artificial intelligence that “consists of a collection of objects(bird, ostrich, and so on), each of which consists of slots (covering, instance-of, and so on)and values for these slots (feathers, other objects, and so on)” (Ginsberg, 1993). In oursystem, we refer to the various components of an index entry as slots. Each of the differentslots plays a particular role in the definition of an index entry.

A homographdistinguishes between concept headings which are the same but meandifferent things in different contexts. For example, the conceptmoldshas more than onedefinition, depending on context:

CTH: molds CTH: moldsHOM: fungus HOM: forms

Homographs rarely appear and so are likely to continue to be ignored by the index generator.Included in the synonyms file is a list of the allowedmodifiersand limiters for each

concept. Modifiers are intended to help to define the context in which a concept is beingreferenced. Limiters are intended to enhance the uniformity of the indexes produced bydifferent document analysts by using terminology that is consistent even though differentauthors may express the same ideas in different ways. Document analysts use limiters(along with the concept headings) to retrieve indexable ideas that are more specific than theconcept headings alone. Some examples of modifiers and limiters are:

CTH: medical goodsLTR: sutures

CTH: globulinsLTR: calcium-binding

CTH: glycoproteinsMDF: specific or classLTR: thrombospondins




CTH: oilsMDF: glyceridic

CTH: oilsMDF: essential

Categoriesindicate subdivided general subject headings. That is, categories refer totypes of substance classes. Some examples are: Acids, Alcohols, Alkanes.

Qualifiersrefer to an article’s description of what happened to a substance class. Theymust be inferred from what the author of the article did to the substance class. For example:

analysis (The substance class wasdetected, determined, or analyzed.)biological studies (The substance class was studied in connection with a living thing,

such as a study of nutrition.)preparation (There was somesynthesis, distillation, purification, or separationof a

chemical in one of the substance classes.)properties (The article reported on the determination of chemical or physical

properties such as the determination of an atomic structure or crystal structure.)

The free text is uncontrolled vocabulary used to define the context of the index entry. Wehope that we will eventually be able to use the free text to limit the context of an article, thusreducing the number of incorrect indexes being generated. We anticipate, however, that thebenefit of using the free text will be limited because of its vague, uncontrolled nature.

The current version of AIMS does not fill the category, qualifier, or free text slots. Weplan to investigate ways to infer this information from the presence of certain verbs or nounsthat indicate specific processes.

4.2. Selection of candidate concepts for indexes

The index generator processes the syntactic structures produced by the NLP component.It looks for phrases that are included in the CAS phrase mappings and that have beentagged by the NLP component with a semantic tag of interest. Currently, the IG considersthe tags “substance”, “property”, “instrument”, “process”, and “method” to be interestingbecause these labels are consistent with the guidelines in the CASEditorial Analysis Manual(Chemical Abstracts Service, 1994). If the IG encounters a phrase with one of these tags,then it chooses the most likely of the different concepts to which this phrase can map, withthe choice being based upon a weighting scheme described later in this section. Only thoseconcepts that are associated with the same section (i.e., subarea of physical chemistry) towhich the article has been assigned are considered.

The IG generates a complete set of recommended indexes for an article by processingthe article one noun phrase at a time. Each noun phrase is checked to determine if it is oneof the phrases in the phrase mappings file. If it is, then the IG considers all of the conceptsto which that phrase maps as potential indexes, eventually choosing the one considered tobe the best candidate. Then it proceeds to the next noun phrase in the article, continuing inthis fashion until it has generated all of the recommended indexes for that article.



66 HODGES ET AL.

Figure 3. Determination of verified indexes.

As shown in figure 2, the IG first generates a list of what we refer to asverified indexes.This process is shown in greater detail in figure 3. The IG compares a noun phrase from theparsed article with the phrase mappings, producing a list of potential indexes (i.e., conceptheadings on which to index). This initial set of indexes is then verified by comparing it withthe synonyms file. The synonyms file provides more specific information that aids in theavoidance of the generation of irrelevant concepts. It contains the allowed modifiers (MDF)and limiters (LTR) for each concept heading (CTH) as well as some phrases that indicate aparticular combination of CTH, MDF, and LTR. For example, the phrase “decomposition”maps to the concept headings: decomposition, dissociation, photolysis, proteins, radiolysis.The synonyms file indicates that “photolysis” should not be generated by the presence ofthe phrase “decomposition” unless at least one of these modifiers is present: laser-induced,light-induced, photochemistry, photo-, laser.

The next step is to determine the final recommended index entry including its modifierand limiter. This process is shown in greater detail in figure 4. The verified indexes arearranged into an ordered list by weight, with the weights based upon statistics producedby a modified version of a statistical program supplied by CAS. The statistics include the




Figure 4. Determination of recommended index with modifier and limiter

number of “good hits” and “false hits” for each of the phrases computed over a collection of2,000 physical chemistry articles. A “good hit” indicates that a reference to a concept wascorrectly indicated by a phrase. A “false hit” indicates that a phrase incorrectly indicatedthe presence of a reference to a concept.

The modified statistical program computes a weighted sum of the good hits and falsehits for each concept in each of the following sections of an article: the title, the abstract,the body (including the introduction), and the figures. A high weight is given to a conceptwhose mapping phrase appears in the title of the article. A smaller weight is assigned if themapping phrase appears in the abstract or conclusion of the article. An even smaller weightis assigned if the phrase appears in the body of the article or in a figure. The program assignsas the final weight for a concept the ratio of the number of good hits to the total number ofhits (good and false). The formulas used in the calculation of the final weight are:

GoodHit= GT∗WT+GA ∗WA +WG∗WB+GF∗WF

FalseHit= FT∗WT+ FA ∗WA + FB∗WB+ FF∗WF

Weight= GoodHit/(GoodHit+ FalseHit)



68 HODGES ET AL.

Gx represents the number of good hits found by the statistical program in portionx of thearticle, wherex may beT for title, A for abstract,B for body (including the introduction),or F for figure. Similarly,Fx represents the number of false hits in portionx of the article,andW x represents the weight assigned to having a mapping phrase appear in portionx ofthe article. Currently, the weights are quite simple. WT is 4, WA is 3, WB is 2, and WFis 1.

Once the IG has generated an ordered list of all the concepts that have been indicated byat least one of the phrases in the phrase mappings file, it determines the final recommendedindex (including its limiter and modifier, if any). First, the IG chooses some particu-lar number of the indexes with the highest weights to continue to consider as candidateindexes.

Currently, this particular number is 1. The IG then checks the concept hierarchy todetermine if the parent concept should replace this concept as the one on which to index.If three or more sibling concepts in the hierarchy were among the initial candidates, thenthe IG recommends that the parent concept be chosen as an index. This reflects the CASphilosophy that the indexes should be at a more specific level in the concept hierarchy unlessthere is a strong indication that the more general concept is more appropriate as an index(Chemical Abstracts Service, 1994).

After the recommended concept heading on which to index has finally been selected, theIG checks the synonyms file again to determine the modifier and limiter for this conceptheading. Thus the final index entry that is generated currently consists of the concept headingalong with the limiter and modifier slots.

The statistics being used by the IG in choosing the most likely concept on which to gen-erate an index, given a particular mapping phrase, are the results of running the previously-described statistical program on 2,010 articles. The program must process the articlesrepeatedly, once for each concept. In the Perl version of the program, it took 2.5 hours ona Sun Sparcserver 4/690 (with 128 M memory and 8.0 G total disk space, running underSunOS 4.1.3) to generate the statistics for the concept headingabstractionusing only 700articles. Our modified version of the program, which is written in C++, requires almost5.5 hours to generate the statistics for the same concept heading by processing all of the2,010 articles. The amount of time required to generate the statistical information is quitelarge because each of the 208 concepts may have as many as 200 phrases that map to thatconcept, and each phrase may, in turn, have a number of variations. In order to generatethe statistics for all of the 208 concept headings of interest to us, we have initiated multipleruns (with each run generating the statistics for a different concept) each day. Yet it stilltakes quite a while to complete all of the concept headings. The time required to generatean entire set of statistics has made the prospect of generating new statistics an unattractiveone. This is a problem that we will be addressing in the future.

4.3. Focusing on key sections of an article

Initially, we designed the AIMS IG to consider only the title and the abstract of an article ingenerating indexes. Our idea was to begin with a minimum number of sections of an articlesince there is no need to process an entire article if our success rates are just as good for




key sections of an article. The scientists at CAS encouraged us to include the introduction,too, since they thought that it was quite likely that the introduction would be more thoroughthan the abstract in providing an overview of the content of the article. Our early test runsindicated that they were correct. Including both the abstract and the introduction producedmore of the indexes that had been produced by a human document analyst (and which weconsidered to be “correct”) than considering only the abstract.

The next section that we considered as a possible key section of an article was theconclusions. This allowed us to recognize some correct indexes that had previously beenomitted. We are now investigating the use of the entire body of an article (text only),with some mechanism for recognizing pieces of the article (whether entire sections or onlyparts of sections) that are most likely to be useful. We have conducted some experimentswith different combinations of portions of an article to determine the impact of includingor excluding various portions of articles. These are described and discussed in the nextsection of this paper.

In addition to the “correct” indexes, AIMS also generates additional indexes that are notin the set generated by a human document analyst. (Although there are cases where theextra indexes are not incorrect, but actually quite reasonable indexes for the given article,we still count them as incorrect.) We would like to reduce the number of extra indexesbeing generated without also losing some of the correct indexes that we have identified. Wenoticed that some of the extra indexes were being generated by phrases which served only asincidental or historical references. For example, an article may mention a specific chemicalsubstance simply as an illustration of a particular procedure even though that substance isnot important to this article. Or there may be a number of references to work done by otherresearchers, or to prior work done by the authors of the article, that should not result in thegeneration of indexes. We are now investigating how we might recognize which portionsof an article to ignore because they contain only incidental references. It may be that wecan ignore an entire section of an article (e.g., Literature Review), or it may be that we willhave to consider such references on a case-by-case basis.

5. Results of evaluation

We evaluate how successful AIMS is in generating indexes by generating indexes forarticles previously indexed by the human document analysts. We consider the set of indexesgenerated by a human document analyst for a particular article to be the “correct” set ofindexes even though it is very possible that there are other indexes that would be reasonable.For example, the CAS document analysts are instructed to keep the number of indexes thatthey generate to a minimum for a very pragmatic reason—to reduce the size of the printedindexes. A five-year collection of indexes takes up more than 30 feet of shelf space. As theemphasis continues to shift from hard copy indexes to on-line indexes, this restriction willnot be as important. And we expect that AIMS would be graded as being more successfulas a result. But currently we have chosen to adhere to this criterion for correctness althoughit is quite stringent.

We use the performance evaluation metrics commonly used in the evaluation of textunderstanding and message understanding systems in the evaluation of AIMS (Lehnert and



70 HODGES ET AL.

Sundheim, 1991). Specifically, these metrics are:

recall= number of correct indexes generated/total number of correct indexesprecision= number of correct indexes generated/total number of indexes generated

overgeneration= number of incorrect indexes generated/total number of indexesgenerated= 1− precision

That is, recall is the proportion of the correct indexes that are generated, precision is theproportion of the generated indexes that are correct, and overgeneration (the inverse ofprecision) is the proportion of the generated indexes that are incorrect.

We defined a set of four experiments on a total of 116 parsed articles. The articles rangedin length from 19 words (an article that consisted only of a correction to a previous article)to 9,112 words, with an average length of 3,401 words. Each of the four experimentsrepresented the use of different sections of the article:

• Experiment 1: Abstract only• Experiment 2: Abstract and introduction• Experiment 3: Abstract and part of introduction• Experiment 4: Abstract and conclusion• Experiment 5: Abstract, part of introduction, and conclusion

All of the experiments included the title in the generation of indexes.For Experiments 3 and 5, we manually selected those portions of the introduction that

were obvious references to the contents of the article rather than incidental references. Wewanted to determine the effect this would have on the recall, precision, and overgenerationrates before we automated the selection.

For this set of experiments, we accepted a concept as a candidate index if it was generatedat least once, no matter in which section it was generated. As described earlier, each nounphrase which is a phrase (or variation of a phrase) in the phrase mappings file can generateone or more concepts as candidate indexes. We chose as the recommended index thatcandidate which had the highest weight and could be verified in the synonyms file.

The results of the four experiments are shown in Tables 1 and 2. Theaverage rates, whichare shown in Table 1, refer to the mean values of the rates of the individual articles includedin that experiment. For example, the average precision rate for Experiment 1 is the mean ofthe precision rates for the 116 articles included in Experiment 1. Theoverall rates, whichare shown in Table 2, refer to the computation of the recall, precision, and overgenerationrates for all of the articles included in that experiment. For example, the overall recall ratefor Experiment 1 is the number of correct indexes generated for all 116 articles divided bythe total number of correct indexes for all 116 articles. If the same index is generated formore than one article, then it is counted each time rather than being counted only once.

In terms of both average and overall precision rates, the experiment that involved theuse of only the title of an article and the abstract (Experiment 1) was more successfulthan the others. The lowest precision rates resulted from using the title, the abstract, andthe introduction (Experiment 2). This may be a reflection of the fact that many authorsinclude references to work by other researchers in the introduction of a paper. Therefore theIG component was generating indexes from incidental references. The fact that manually




Table 1. Average rates for experiments.

Experiment Precision Recall Overgeneration

1 66.45% 71.79% 33.55%

2 61.71% 85.50% 38.29%

3 64.42% 84.76% 35.58%

4 63.37% 87.36% 36.63%

5 62.17% 90.08% 37.83%

Table 2. Overall rates for experiments.

Experiment Precision Recall Overgeneration

1 64.83% 63.31% 35.17%

2 57.11% 78.43% 42.89%

3 59.28% 78.29% 40.72%

4 61.95% 84.03% 38.05%

5 60.34% 88.24% 39.66%

selecting only a portion of the introduction (Experiments 3 and 5) resulted in higher precisionrates than those found in Experiment 2 lends some support to this interpretation.

The low precision rates reflect a decision that we made early in the AIMS project. Wepreferred to have the system overgenerate rather than fail to identify all of the “correct”indexes. It seems more desirable to provide a CAS document analyst with an overly generouslist of indexes from which he/she can prune the less useful indexes rather than to fail tosuggest all of the useful indexes. We think that we have been successful in this since ourrecall rates are generally quite good (especially the average recall rate of about 90% andoverall recall rate of about 88% for Experiment 5). Also, we know that our precision ratesuffers somewhat from the very narrow definition that we have used for a “correct” index.We are now working on ways to increase the precision rate without reducing the recall rate.

For example, we continue to refine the auxiliary information provided by CAS (suchas the phrase mappings) to include information that was accidentally omitted or that rep-resents information generally known by the document analysts and thus not representedexplicitly in the auxiliary information. Also, we are working to determine methods forrecognizing incidental references based upon context. We are also trying to improve ourtagging techniques since a number of the errors made by the IG component are the directresult of tagging errors in the parsed structures produced by the NLP component. Onepossibility with which we are currently working is the incorporation of neural networkalgorithms with more traditional tagging algorithms.

6. AIMS graphical user interface

We decided that the interface for the AIMS system needed to be a simple windows-basedgraphical user interface (GUI) to make it as easy (and therefore fast) as possible for a user



72 HODGES ET AL.

to modify the list of indexes generated for an article (Kulkarni, 1995). At CAS’s request,we implemented the GUI using Tcl/Tk (Ousterhout, 1994). We designed the GUI using theFusion object-oriented design methodology (Coleman et al., 1994). For the layout of thevarious windows, we followed the guidelines available from the University of Colorado inan electronic document calledAstrophysics Data System Service Development/OperationsGuide. The particular section that we used is Section 2.3.2.5 GUI Development. The tableof contents for the entire document may be found on the World Wide Web at:

http://adswww.colorado.edu/devguide/devguide.html

Section 2.3.2.5 is located at:

http://adswww.colorado.edu/devguide/part2/2.3/2.3.2/2.3.2.5/html

Although these guidelines were defined for interfaces developed using Motif, they wereapplicable to our interface because our widgets look quite similar to Motif widgets. Usingthese guidelines was also a request made by CAS.

We watched document analysts as they indexed articles and consulted with CAS staff atvarious phases of the development of the GUI to make sure that it would meet their needs.The graphical user interface (GUI) that we have developed for AIMS allows the user todisplay the indexes generated for a particular article, highlight the phrases that indicated thepresence of a particular concept, and modify any of the index information (such as addingan additional limiter in the LTR slot). The interface also allows the user to display all of thecandidate entries that were considered for the various slots of an index. A user may “paste”selected text from an article into a slot, thus minimizing the amount of typing that the usermust do.

When a user selects a particular article to index, then a frame such as the one shown infigure 5 appears on the screen. The user may select from the list of concept headings (CTH)on which indexes have been generated in order to see the entire index entry (homograph,modifier, etc.). Another frame on the screen shows the contents of the article. The usermay select an index that is on the list and delete it. The user may also choose to add a newindex to the list. He/she may either type the entries for the various slots of the index orselect appropriate text from the article and use the “paste” button to paste it into an indexslot. If the user clicks on a particular slot, then the phrase in the article which caused thatparticular slot entry to be generated is highlighted.

The “next” and “previous” buttons allow a user to browse through the list of indexes bygoing to the next or previous index in the list, respectively. Pressing the “list” button causesall of the possible values for the selected slot to be displayed in a listbox window. Forexample, selecting the CTH slot and pressing the “list” button causes all of the conceptsin the set of candidate concepts from which that concept was chosen to be displayed. Theuser may select entries from this window and paste them into appropriate slots in the index.

Pressing the “done” button causes the changes to the index entries to be saved. Pressing the“cancel” button prevents the changes from being saved and gives the user the option of choos-ing another article. Pressing the “help” button provides the user with the option of choosinga particular type of widget in the frame and displays the help information for that widget.




Figure 5. Frame for index entries.

The AIMS GUI was tested by its developer, Sonal Kulkarni, for both completeness andconsistency. More detail on the design, implementation, and testing of the GUI may befound in (Kulkarni, 1995).

7. Future work

We hope to be able to reduce the overgeneration by making greater use of thecontextslots(such as limiters and modifiers) in an index entry. We defined some focus mechanisms inthe early KUDZU system (Cordova, 1992) that we will try to adapt for use in the indexgenerator.

Currently we choose to index on the parent concept in the concept hierarchy if three ormore of its children have been indicated as potential index concepts. This simple heuristicis applied no matter to what portion of the hierarchy the concept belongs, something whichis not an accurate reflection of how a document analyst would handle such a situation. Wewant to try to incorporate a more sophisticated heuristic into the determination of when toindex on a parent concept rather than its children, although it is not clear at this time justhow much improvement this might provide.

We plan to experiment with various ways in which the set of candidate indexes is gene-rated. In the experiments reported here, we accepted a concept as a candidate index if itwas generated at least once by some phrase in the article. We plan to try various approaches



74 HODGES ET AL.

such as accepting a concept generated in one particular section of an article only if it is alsogenerated in some other section. This would be rather like the parliamentary procedurerequiring a second to a motion before allowing the motion to be considered.

We also plan to experiment with more sophisticated ways of choosing the recommendedindex from the set of candidate indexes. For example, we may define some threshold valuefor the weight, not accepting any concept with a weight below that threshold even if it hasthe highest weight in its set of candidate indexes. We also plan to experiment with thepossibility of recommending more than one index from a given set of candidate indexes ifseveral concepts in the set have very high weights, with little distinction among the top few.

As previously discussed, we are aware that some of our overgeneration is due to the pre-sence of mapping phrases in incidental references or historical references in an article. Weintend to investigate ways in which we can recognize and ignore such incidental references.

As we process more articles, we need to be able to update the statistics because they arecrucial in the calculation of the weights, and thus the selection of concepts on which to index.We would like to be able to modify the weights dynamically by using some machine learningtechniques. For example, as document analysts use AIMS as an aid to the assignment ofindexes to an article, they may choose to modify the indexes recommended by AIMS byadding additional indexes or removing some of the indexes. If AIMS could appropriatelymodify the statistical information (including the various weights that are used), then it couldimprove its accuracy through experience.

We also want to investigate the possibility of using machine learning techniques toupdate the phrase mappings file, and perhaps other files used as resources, based uponmodifications that a document analyst makes to the list of recommended indexes. If AIMScould get feedback from the analysts as to why they have chosen to make changes in thelist of recommended indexes (such as asking them to highlight the phrase that led them tosuggest an index not recommended by AIMS), then it could update the phrase mappingsfile, the synonyms file, and/or the file of acronyms and abbreviations.

8. Summary

In this paper, we have described AIMS (Automated Indexing at Mississippi State), anautomated system intended to aid human document analysts in the generation of a set ofindexes for articles that have been published in a physical chemistry journal. We haveprovided a description of the techniques used by the component of AIMS known as theindex generator (IG). We have also presented the results of a set of four experiments inwhich we evaluated AIMS in terms of precision, recall, and overgeneration. AIMS hasbeen successful in the generation of more than 80% of the concepts that were generatedby a human document analyst. We are currently attempting to increase the precision (andthus reduce the overgeneration) without reducing the recall rate. We have described someof the techniques that we are currently investigating or plan to investigate in the near futurein order to accomplish this. We have also provided a general description of the graphicaluser interface that we have developed for AIMS.

Some of the techniques used by the natural language processing (NLP) component and theIG are adaptations of techniques developed for an earlier research prototype. Our approach




has much in common with researchers working in the areas of document understanding andmessage understanding.

Acknowledgments

We are indebted to those members of this research group who have developed the naturallanguage processing component of the system. Lois Boggess, who is co-principal investi-gator on this project, has primary responsibility for the NLP component of AIMS. Two ofher graduate students, Rajeev Agarwal and Lynellen Smith, have made major contributionsto this portion of the work. We are also grateful to the American Chemical Society, andChemical Abstracts Service in particular. They have provided us with over 2,000 indexedarticles from theJournal of Physical Chemistry, their concept hierarchy, sets of usefulphrases that map to the concepts, and other information used by their document analysts toselect which of several candidate indexes to choose. They have also been very generous insharing their expertise with us throughout the development of the AIMS system.

This work is supported in part by grant number IRI-9314963 from the Interactive SystemsProgram of the National Science Foundation and the Advanced Research Projects Agency(NSF/ARPA Joint Initiative on Human Language Technology).

References

Agarwal, Rajeev (1995). Semantic feature extraction from technical texts with limited human intervention. Ph.D.Dissertation. Mississippi State University.

Agarwal, Rajeev and Boggess, Lois (1992). A simple but useful approach to conjunct identification. InProceedingsof the 30th Annual Meeting of the Association for Computational Linguistics, Newark, DE.

Ai, C.S., Blower, P.E., Jr., and Ledwith, R.H. (1990). Extraction of chemical information from primary journaltext.Journal of Chemical Information and Computer Sciences, 30, 163–169.

Boggess, Lois, Agarwal, Rajeev, and Davis, Ronald (1991). Disambiguation of prepositional phrases in auto-matically labelled technical text. InProceedings of the Ninth National Conference on Artificial Intelligence.Anaheim, CA.

Boggess, Lois and Hodges, Julia (1994). A Knowledge-Based Approach to Indexing Scientific Text. InWorkshopNotes: ARPA Workshop on Human Language Technology. Princeton, NJ: Merrill Lynch Conference Center.

Boggess, Lois, Hodges, Julia, and Cordova, Jose (1995). Automated Knowledge derivation: Domain-independentTechniques for Domain-Restricted Text Sources.International Journal of Intelligent Systems, 10(10), 871–893.

Brill, Eric (1992). A simple rule-based part of speech tagger. InProceedings of the Speech and Natural LanguageWorkshop(pp. 112–116).

Chemical Abstracts Service (1994). Selection of Index Entries.Editorial Analysis Manual, Vol. II, Ch. 2, Columbus,Ohio: Chemical Abstracts Service.

Coleman, Derek, Arnold, Patrick, Bodoff, Stephanie, Dollin, Chris, Gilchrist, Helena, Hayes, Fiona, and Jeremaes,Paula (1994).Object-Oriented Development: The Fusion Method, Englewood Cliffs, NJ: Prentice Hall.

Cordova, Jose L. (1992). A Domain-Independent Approach to Knowledge Acquisition From Natural LanguageText. Ph.D. Dissertation, Mississippi State University.

Cordova, Jose L., and Hodges, Julia E. (1992a). The automated building of a knowledge base through naturallanguage text analysis. Paper presented atSoutheast Cognitive Science Conference, Atlanta, GA.

Cordova, Jose L., and Hodges, Julia E. (1992b). The automatic initialization of an object-oriented knowledgebase. InProceedings of the 30th Annual Southeast Conference of the ACM(pp. 401–404).

Fagin, Joel L. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison ofSyntactic and Non-Syntactic Methods. Ph.D. Dissertation, Cornell University.



76 HODGES ET AL.

Fuhr, Norbert and Buckley, Chris (1991). A probabilistic learning approach for document indexing.ACM Trans-actions on Information Systems, 9(3), 223–248.

Ginsberg, Allen (1993). A Unified Approach to Automatic Indexing and Information Retrieval.IEEE Expert, 8(5),46–56.

Ginsberg, Matt. (1993).Essentials of Artificial Intelligence, San Mateo, CA: Morgan Kaufmann Publishers, Inc.Hirschman, Lynette (1986). Discovering Sublanguage Structures. In R. Grishman and R. Kittredge (Eds.)Analyzing

Language in Restricted Domains: Sublanguage Description and Processing. Lawrence Erlbaum Associates:Hillsdale, NJ.

Hodges, Julia and Boggess, Lois (1992). Extracting information from technical text. Paper presented atSoutheastCognitive Science Conference, Atlanta, GA.

Hodges, Julia E., Shiyun Yie, Ray Reighart, and Boggess, Lois (1996). An Automated System That Assists in theGeneration of Document Indexes. Submitted for Publication.

Hodges, Julia E. and Cordova, Jose L. (1993). Automatically Building a Knowledge Base Through NaturalLanguage Text Analysis.International Journal of Intelligent Systems, 8(9), 921–938.

Kulkarni, Sonal (1995). Indexer: A Tool to Access Index Information from An Object-Oriented Knowledge Base.Master’s project. Mississippi State University.

Kwok, K.L. (1986). An Interpretation of Index Term Weighting Schemes Based on Document Components. InProceedings of the 1986 ACM Conference on Research and Development in Information Retrieval(pp. 275–283).

Kwok, K.L. and Grunfeld, L. (1994). Learning from Relevant Documents in Large Scale Routing Retrieval. InWorkshop Notes: ARPA Workshop on Human Language Technology, Princeton, NJ: Merrill Lynch ConferenceCenter.

Ledwith, Robert H. (1988). Development of a Large, Concept-Oriented Database for Information Retrieval. InProceedings of the ACM SIGIR 11th International Conference on Research & Development in InformationRetrieval(pp. 651–661).

Lehnert, W. and Sundheim, B. (1991). A Performance Evaluation of Text-Analysis Technologies.AI Magazine,12(3), 81–94.

Maron, M.E. and Kuhns, J.L. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval.Journalof the ACM, 7, 216–244.

Mauldin, Michael L. (1991). Retrieval Performance in FERRET: A Conceptual Information Retrieval System.Paper Presented at the14th International Conference on Research and Development in Information Retrieval.Chicago: ACM SIGIR.

Ousterhout, John K. (1994).Tcl and the Tk toolkit, Reading, MA: Addison-Wesley Publishing Company.Sager, Naomi (1987). Computer Processing of Narrative Information. InMedical Language Processing: Computer

Management of Narrative Data, Reading, MA: Addison-Wesley Publishing Co.Salton, Gerard and Buckley, Chris (1990). A Note on Term Weighting and Text Matching. Technical Report

TR90-1166, Cornell University.Salton, Gerard and Singhal, Amit (1995). Selective Text Retrieval. Technical Report TR95-1549, Cornell Univer-

sity.Soderland, Stephen and Lehnert, Wendy (1994). Wrap-Up: A Trainable Discourse Module for Information

Extraction,Journal of Artificial Intelligence Research, 2, 131–158.Zamora, Elena M. and Blower, Paul E., Jr. (1984). Extraction of Chemical Reaction Information from Primary

Journal Text Using Computational Linguistic Techniques,J. Chem. Inf. Comput. Sci., 24, 176–181.

Generation and Evaluation of Indexes for Chemistry Articles

Documents

Transcript of Generation and Evaluation of Indexes for Chemistry Articles