User-Sensitive Summarization Thesis Proposal -...

User-Sensitive Summarization

Thesis Proposal

Noemie Elhadad

[email protected]

Department of Computer ScienceColumbia University

New York, NY 10027

Contents

1 The Problem 11.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Challenges and Contributions . . . . . . . . . . . . . . . . . 21.3 Structure of the Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 4

3 The Framework 63.1 PERSIVAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 TAS: A User-sensitive Summarizer . . . . . . . . . . . . . . . . . . . 7

3.2.1 Input/Output Characteristics . . . . . . . . . . . . . . . . . . 73.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Tailoring the Content for Individual Users 104.1 Content Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Content Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Organizing the Summary Content 145.1 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Tailoring Wording for Classes of Users 176.1 Sentence Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.1.1 Sentence Alignment . . . . . . . . . . . . . . . . . . . . . . . . 186.1.2 Learning of Rewriting Rules . . . . . . . . . . . . . . . . . . . 19

6.2 Lexical Choice for Medical Concepts . . . . . . . . . . . . . . . . . . 206.2.1 Choosing a Verbalization . . . . . . . . . . . . . . . . . . . . . 206.2.2 Acquiring the Term Verbalizations . . . . . . . . . . . . . . . 22

i

6.2.3 Augmenting the Dictionary . . . . . . . . . . . . . . . . . . . 22

7 Evaluation 247.1 Intrinsic Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.1.1 Content Selection . . . . . . . . . . . . . . . . . . . . . . . . . 257.1.2 Content Organization . . . . . . . . . . . . . . . . . . . . . . . 257.1.3 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2.1 Plan for Physicians . . . . . . . . . . . . . . . . . . . . . . . . 267.2.2 Plan for Lay People . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Status and Thesis Timeline 28

9 Conclusion 309.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A Input Preprocessing 32

B TAS Input/Output Example 33

C Definitions Example 35

D Concept Matching 36

E Technical/Lay Text Example 37

F Technical/Lay Sentences Examples 40

ii

Chapter 1

The Problem

1.1 Problem Description

Text summarization has emerged as an increasingly established field over the courseof the past ten years. We may soon reach a stage where researchers will be able todesign, and provide everyday users with, robust text summarization systems. Users oftext summarization are many and range from the Internet surfers lacking the time tolocate and digest all the latest news available on the web1 to scientists unable to keeppace with the burgeoning number of technical publications who must, nonetheless,be familiar with the latest findings in their fields.

Given texts to summarize, there is no a priori criteria for determining relevancefor the summary. When humans summarize texts, they identify relevant informationthat they think will be of interest to the readers of the summary. Summarization isnot only a function of the input documents but also of the reader’s mental state: whothe reader is, what his knowledge before reading the summary consists of, and why hewants to know about the input texts. This fact has been long acknowledged by boththe psycho-linguistic and the computational-linguistic communities (Kintsch, 1988;Edwards and Smith, 1996; Gerrig, Kuczmarski, and Brennan, 1999; Sparck-Jones,1999). However, both communities agree that trying to model the reader’s mentalstate is far too complicated, if not entirely impossible. Given this dilemma, most ofthe computational linguistic research in summarization has assumed that the “readervariable” is a constant and has focused on defining a general notion of salience, validfor all readers.

In this thesis, I investigate strategies to take user characteristics into account inthe summarization process. Acquiring a user model is by itself a subject of research.I do not focus on ways to acquire a user model, and I assume that there is an existinguser model in my framework. Rather, my focus is on the challenges entailed in

1Examples of applications targeted at such users include NewsBlaster http://newsblaster.cs.columbia.edu or, more recently, the similar NewsInEssence http://www.newsinessence.com.

1

incorporating knowledge about the user into summarization strategies and providingthe user with a text relevant to his needs.

Two types of user tailoring are examined in this thesis: individualized, i.e., thespecific facts in which the reader is interested, and class-based, i.e., the degree of ex-pertise of the reader. My research framework consists of a digital library that providestailored access to medical literature for both physicians and patients. When treatinga specific patient, physicians will want to keep abreast of the latest findings that per-tain to their patient. Likewise, patients may want to access the latest findings thatare relevant to their medical situation but may be hindered by the jargon commonlyused in technical medical texts. My summarizer attempts to provide both types ofusers with tailored syntheses of the latest findings in clinical studies. Such tailoringis accomplished at the individual level by taking advantage of the existing patientrecord in the digital library. The summarizer also adapts the language in which thesummary is generated by using class-based information, i.e., whether the user is aphysician or a patient.

1.2 Research Challenges and Contributions

Individual-based User Modeling. There are several challenges in incorporatingindividual-based modeling. The first one consists of deciding which stages of thesummarization process should be affected by the individual characteristics of the user.The second challenge is to determine the degree of abstraction and inference needed toachieve satisfactory modeling. Given input texts to summarize and a patient recordin textual format, one must choose a representation of the information that allowsfor efficient and accurate matching of information.

One contribution of my thesis work is to show how individual characteristics candetermine what content should be selected for summarization. While this has beeninvestigated in information retrieval and traditional semantic-to-text generation, in-corporating the user in the selection of content from textual input is a novel approachin summarization. Besides this improvement at the functional level, my work con-tributes to the field of summarization by providing an adequate representation of theinformation conveyed in the input texts and the patient record. Using informationextraction techniques and relying on an existing ontology, the summarizer operatesover a data-structure that is between full semantic analysis and agnostic extractedtext. This representation enables the use of simple matching strategies between the

2

user model and the input texts, as well as content organization strategies to producecoherent summaries.

Class-based User Modeling. Class-based modeling can be viewed as a text sim-plification process. Given a summary originally produced for expert users, the goalis to generate a simplified version readable by lay users. While most of the content isalready selected in the original summary, not all the pieces of information should beincluded in the simplified version. Some details might be too technical for instance,and including them in a simplified version would only confuse the reader. On theother hand, it might be necessary to introduce additional content such as backgroundknowledge to enhance the reader’s comprehension. Other challenges inherent to textsimplification impact the content organization and its verbalization. Given a textwritten for one audience, it might be more efficient to reorder the information it con-tains when adapting it to the needs of a different audience. Moreover, wording isdrastically different between a complex and a simplified text, whether one looks atthe syntactic or lexical levels. All of these decisions — which information to dropand which to add and how to adapt the discourse structure and the wording — arefunctions of the user’s expertise level.

I propose to approach the simplification process as applying a set of rewriting ruleswhich affect both the technical lexical items and the technical sentences as wholes. Iplan to learn the rewriting rules by relying on a comparable corpus of technical textsand their simplified versions targeted at lay users. Lexical rules will include termsimplification and definition insertion, while sentence rules will consist of paraphrasingthe input sentence into a lay equivalent. As a direct result of the text simplification,one contribution of my research will be to collect instances of technical/lay sentencepairs automatically from the comparable corpus. More generally, my summarizer willcontribute to the field by being able to generate summaries at the appropriate levelof expertise of the reader.

1.3 Structure of the Proposal

After reviewing related work, I introduce the framework for my research, the digitallibrary PERSIVAL and the text summarizer TAS (Technical Article Summarizer). Ithen describe the main steps of TAS: content selection (chapter 4), content organi-zation (chapter 5), and realization (chapter 6). In chapter 7, I focus on evaluationplans. Finally, in chapter 8, I report on my progress so far, and suggest a timelinefor the proposed work.

3

Chapter 2

Related Work

Related work falls into two broad categories: (1) user modeling with respect to textgeneration, and (2) applications such as text summarization and text simplificationwith respect to the recent advances in paraphrasing.

One of the first generation systems to take user characteristics into account wasPauline (Hovy, 1988). Given the same set of facts, different descriptions of one eventcould be generated depending on the user’s viewpoint. Both the content and thewording were affected by this implicit user model. Using the tutorial system TAI-LOR and an explicit user model as a framework, Paris (1993) argued that, whengenerating explanations to users with different levels of expertise, most of the stagesin the generation pipeline are affected, especially the content selection and the con-tent organization phases. Elhadad (1993) focused on how to make lexical choice forevaluative expressions flexible enough to satisfy a speaker’s argumentative intent,showing that pragmatic situations have a direct impact on the choice of words andtheir syntactic realizations.

One domain that benefits clearly from tailoring the output to the user’s char-acteristics is the medical domain. Many generation systems have been developed,typically relying on existing patient information to encode both class-based and indi-vidual characteristics of the users (whether patients or physicians) (Carenini, Mittal,and Moore, 1994; Osman et al., 1994; Binsted, Cawsey, and Jones, 1995; Cawsey,Jones, and Pearson, 2000; Lennox et al., 2001). In work to date, the user modelaffects only the content planning stage. Several domain-dependent content plans arepre-written; depending on the user characteristics, a specific plan gets instantiated.

Most text summarizers to date do not contain user characteristics as parame-ters of their algorithms. Multi-document summarizers present their output either byextracting original sentences (Goldstein et al., 1999; Lin and Hovy, 2002) or by re-generating the important sentences into a more condensed and fluent text (Radev andMcKeown, 1998; Barzilay, McKeown, and Elhadad, 1999). My summarizer followsthe latter strategy by reusing phrases from the input texts.

4

I am aware of only one summarization system, aside from that described herein,that tailors its output to the user. SumItBMT (Becher, Endres-Niggemeyer, andFichtner, 2002) is a multi-document summarizer of medical articles for physicians.In contrast to my framework, which employs a pre-existing user model that persistsacross sessions, the SumItBMT user model is acquired at the beginning of each sessionby asking the physician to fill out a form. The form is used to select relevant piecesof information in the input articles. The user is then presented with a succession ofextracted text fragments from the input articles. SumItBMT does not attempt toassemble the pieces of information into a single summary. In addition, the summariesare targeted to physicians only, and are not adaptable to the comprehension level ofnon-physicians.

One characteristic of the proposed work is its ability to generate the same con-tent for different classes of users (physicians or patients). Learning rules to rephrasetechnical language into a lay version relates to text-to-text generation, especially textsimplification, and paraphrasing. While many agree that automatic text simplifica-tion is a valuable application, few researchers have actually proposed solutions to thisproblem. In their pioneering work, Chandrasekar and Bangalore (1997) propose amethod to induce simplification rules automatically at the syntactic level. They relyon a manually built corpus of sentences paired with corresponding manually simpli-fied sentences. Carroll et al. (1999) describe a set of rules to simplify newspapertexts for aphasic readers using a set of manually defined syntactic and lexical rules.In contrast, I plan to learn rewriting rules from pre-existing text examples in anunsupervised manner. In addition, I plan to investigate rules for different levels ofgranularity, from sentences to lexemes.

Text simplification can benefit from the research in paraphrasing, since one wayto simplify a text is to utilize paraphrases to transform a technical text into a simplerversion. However, if there has recently been more work in paraphrase acquisition (Inuiand Hermjakob, 2003), the generation of paraphrases is as yet scarcely investigated.Barzilay and Lee (2003) proposed an unsupervised method for learning sentence-levelparaphrases. In their framework, paraphrases are not classifiable into a syntax/lexicalparadigm; sets of sentences that convey similar information are stored together in alattice.

5

Chapter 3

The Framework

The application I present in this thesis is TAS (Technical Article Summarizer), auser-sensitive multi-document summarization system. It is part of a larger projectcalled PERSIVAL.

3.1 PERSIVAL

PERSIVAL is designed to provide tailored access to a distributed digital library ofmultimedia medical literature. It is an interdisciplinary project that involves re-searchers in computer science, electrical engineering, medical informatics and libraryand information science.

A key feature of PERSIVAL is the ability to present information relevant to theuser’s query given the context of patient information. PERSIVAL links to the largeonline patient record database available at the New York Presbyterian Hospital, whichserves as part of the user model (Hripcsak, Cimino, and Sengupta, 1999). The inter-action with PERSIVAL begins with access to a specific patient record. After viewingthe patient record, the user may decide to access the online medical literature andpose a question in natural language. The Query Formulation module helps the user toformulate a good question related to the patient information (Mendonca et al., 2001)and translates the natural language question into a query. The query is then sent toa search engine, which allows access to distributed online textual resources (Green,Ipeirotis, and Gravano, 2001), as well as a library of digital echocardiograms. Theresults of the text search are re-ranked by matching the articles returned against thepatient record, scoring those articles which discuss results related to the patient’scase as more relevant (Teufel et al., 2001). A text summarizer (Kan, 2003; Elhadadand McKeown, 2001) and a video summarizer (Ebadollahi et al., 2001) each generatea summary of the relevant results. The resulting multimedia summary and searchresults are presented using a sophisticated layout component (Lok and Feiner, 2002).

Depending on both the user’s expertise and the type of question asked by the user,

6

PERSIVAL invokes different summarizers. For lay people who want to get overviewsabout specific diseases, Centrifuser (Kan, 2003) provides indicative summaries ofconsumer health articles. For users who want to know about the findings in thetechnical literature, TAS produces briefings of clinical studies that are relevant to thepatient in question. In addition, TAS adapts its language to the level of expertise ofthe user. When the user is a physician, then the language used in the clinical studiesis appropriate; when the user is a lay person (e.g., the patient himself), then TAS usesa simpler language to convey the information in the summary. In the next section, Idescribe TAS, its inputs/output and its architecture, in more detail.

3.2 TAS: A User-sensitive Summarizer

3.2.1 Input/Output Characteristics

The query. The types of questions for which TAS is called are open-ended (forinstance, “What is the best treatment for atrial fibrillation given this patient?”).Input articles do not explicitly answer such questions; rather a set of findings ispresented to the reader. Similarly, TAS does not aim to provide an explicit answerto the user’s question. It behaves as an access point to the results of the study orstudies that are relevant to the patient, and lets the user infer his own answer. Whenorganizing the summary content, the terms used in the query terms are taken intoaccount to increase/decrease the importance of individual pieces of information.

The articles. Input articles come in an HTML format. In a preprocessing stage,each input article is transformed into an XML file, in which the title, authors and sec-tions are identified. In addition, the words are tagged with part-of-speech information.Using the comprehensive medical ontology UMLS (National Library of Medicine,1995; McCray and Bodenreider, 2002), medical terms are identified and tagged withtheir unique UMLS concept identifier, or CUI. For instance, the phrases “coronaryartery disease,” “coronary heart disease,” and the acronym “CAD” are all encoded inUMLS under the same CUI, C0010068, along with 36 other spellings/terminologiesfor this specific concept. In addition, some terms have associated values; for instance,in the phrase “diastolic blood pressure of 90 mm Hg,” “90 mm Hg” is a value forthe term “diastolic blood pressure.” Both quantitative and qualitative values areidentified.

7

The set of documents to be summarized is the result of a search, given the userquery, on the digital library restricted to medical journals. However, there are manydifferent types of publications (letters to the editor, case reports, reviews, and clin-ical studies). Based on the results of initial user studies conducted to gather TASspecifications, TAS is restricted to summarizing clinical studies. To ensure that thedocuments fed to TAS will only be clinical studies, articles that do not fall into thiscategory are filtered out automatically using a classifier. Once the clinical studies areidentified, their main clinical task (i.e., prognosis, treatment, or diagnosis) is identified(see Appendix A for details on the two classifications).

The user model. The user model contains two types of data: who the user is(physician vs. lay user) and patient information. When the user is a physician, thisinformation refers to a patient under care, while for a lay user, it will most likely pointto the user’s own patient information. This information is extracted from the pre-existing patient record. The patient record, in its raw form, contains a large amountof data collected over time in many reports. Some are in tabular form (e.g., laboratorytests), while others (e.g., discharge summaries) make use of non-structured text. Theuser model contains a list of medical terms present in the most recent patient reports,along with their possible associated value.

Output Characteristics The summaries produced by TAS are briefings presentingresults reported in clinical studies (as opposed to the patient group descriptions,methods or discussion of the studies). Figure 3.1 shows an extract of a summaryoutput currently generated by TAS for a physician treating a specific patient (thefull summary is available in Appendix B). The results included in the summaries areones that directly pertain to the patient information stored in the user model. Inthis example, the patient has coronary artery disease, hypertension, diabetes, and ahistory of smoking. The summary does not contain repetitive information (repetitionsare identified and fused) and signals to the reader any possible contradictory resultsfound in the input articles.

In addition to pertaining to patient information, the wording of the TAS sum-maries will be a function of the type of reader. Figure 3.2 shows a target simplifica-tion of the sentences presented in Figure 3.1. The first sentence is simpler than in theequivalent summary generated for physicians. One may also notice that medical termsare either simplified (diabetes mellitus is now referred to as diabetes), or definitionsare supplemented (see atrial fibrillation or left ventricular ejection fraction).

8

Multivariate analysis identified coronary artery disease to predict atrial fibril-lation [6,7]. Left ventricular ejection fraction, hypertension, diabetes mellitus,smoking were not found to predict atrial fibrillation [1].

Figure 3.1: Extract from a summary generated for a physician treating a patient withatrial fibrillation.

Researchers report that patients with coronary artery disease are more likelyto have atrial fibrillation – an irregular heart rhythm that increases the riskof stroke [6,7]. Left ventricular ejection fraction, which is a measure of theefficiency with which blood is pumped out of the heart, high blood pressure,diabetes, and smoking did not predict atrial fibrillation [1].

Figure 3.2: Extract from a target summary generated for a lay user.

3.2.2 Architecture

Merging& Ordering

summ

ary

ContentIdentification

ContentFiltering

articles

patient

record

templates

GenerationSurface

Content

ContentOrganization

Selection

Figure 3.3: TAS architecture

TAS follows a pipeline architecture shown in Figure 3.3. Starting at the contentselection stage, a set of templates is instantiated for each input article (ContentIdentification). The templates that are not matching with the patient informationstored in the user model are filtered out (Content Filtering). During the contentorganization stage, the relevant templates are clustered into semantically related unitsand then ordered (Merging and Ordering). Finally, the internal structure is realizedinto an English text using phrasal generation (Surface Generation).

9

Chapter 4

Tailoring the Content forIndividual Users

In this chapter I describe the content selection stage of TAS. It is fully implemented.The main goal of TAS is to present relevant information to the user. User studiesconducted with physicians in the initial phase of the project suggested that a piece ofinformation in a particular clinical study is relevant if (1) it conveys a result and (2)the result pertains to the patient under care. The first condition describes the typeof information the summaries should contain, while the second condition indicatesthe criteria for relevance. The two conditions are orthogonal, and a natural strategyhere is to divide the task of selecting summary content into two successive steps:first identify and extract content units, or Results, then filter them to keep only thepertinent units of information as part of the summary. Only the input articles needto be accounted for in the first step; user modeling is handled exclusively in step two.

4.1 Content Identification

The content identification step is best described as an information extraction problem.Figure 4.1 shows examples of sentences that convey results. A sentence is considereda content unit, or Result, if it reports parameters (such as disease, therapy, or patientcharacteristic) related to a finding, or outcome. Formally, a Result is represented asthe triplet: (Parameter(s), Relation, Finding(s)). A manual analysis of a corpus ofclinical studies identified six types of relations between parameters and findings: as-sociation, prediction, risk, absence of association, absence of prediction, and absenceof risk. The relations were reviewed and approved by the physicians on the PER-SIVAL team. The relations are not independent from one another; for instance, theprediction relation assumes an association. However, they reflect the language usedby physicians when reporting results. Preserving this piece of information all the wayto the generation step allows for the production of language that best matches theinput text.

10

Multivariable analysis identified diabetes, estrogen therapy (adjusted risk ratio 0.38, 95%confidence interval 0.19 to 0.79) and left ventricular ejection fraction < 40% as independentcorrelates of cardiovascular death or myocardial infarction during follow-up.

Both smoking status and insulin sensitivity were independently related to both insulinrelease and lipid levels.

New acute myocardial infarction or death was predicted by ST-segment depression(OR 2.00, 95% CI 1.20 to 3.40; P = .008), prior angina (OR 2.70, 95% CI 1.34 to 5.57;P = .001), and age > 65 years (OR 1.64, 95% CI 1.00 to 2.70; P = .01).

Significant multivariable predictors of later atrial fibrillation included advanced age,higher peak creatine kinase levels, worse Killip class and increased heart rate.

Figure 4.1: Result sentence examples. The parameters are in italics, and the findingsare in bold.

Since clinical studies follow strict stylistic conventions, it is known in advance thatthe Result sentences will be reported in the “Abstract” and the “Results” sections.However, not all the sentences in these sections report Results; out of an average of20 sentences, approximately a third are actual Results. To gather these, the sen-tences are first parsed using a shallow syntactic parser to identify noun and verbphrases. Because existing state-of-the-art parsers are trained on newspaper texts,their accuracy drops significantly when used for technical medical texts. To overcomethis problem, the shallow parser CASS (Abney, 1996) was customized to the style ofclinical studies. Templates are then instantiated using a set of extraction patterns.

Forty two patterns were manually written by analyzing a small corpus of articles.1

Examples of patterns are shown in Figure 4.2. They rely on shallow syntactic infor-mation. Because both the parameters and the findings can be verbalized by nounphrases with the same semantic types (any medical term), the patterns are global— they fill at once all the slots in the template. Figure 4.3 shows an example of aninstantiated template.

1While there has been work done on learning extraction patterns (Riloff, 1993; Riloff, 1996), itis not clear how to adapt these techniques easily to the present task. Such techniques commonlyassume that there is a one-to-one mapping between the possible semantic tags of the lexicon andthe slots of the templates to instantiate. In our case, this assumption does not hold. Given thisdifficulty, the learning of patterns in the medical domain is a challenging task which I do not addressin this thesis.

11

analysis <VX HEAD=’identified’>.*</VX> (<NG>.*</NG>) as (<NP HEAD=’.*’>.* </NP>) of (<NG>.*</NG>)

(<NG>.*</NG>) <VX HEAD=’related’>.*</VX> to (<NG>.*</NG>)

(<NG>.*</NG>) <VX HEAD=’predicted’>.*</VX> by (<NG>.*</NG>)

predictors of (<NP>.*</NP>) <VX HEAD=’included’>.*</VX> (<NG>.*</NG>)

Figure 4.2: Extraction pattern examples. They are encoded as regular expressions.The elements are words (W), noun phrases (NP) and noun groups (NG). Findingsare in bold; parameters are in italics.

4.2 Content Filtering

The individual-based user modeling takes place at the content filtering stage. On onehand, the content identification stage provides a shallow semantic representation ofthe results — the concepts are identified and indexed by their CUIs, their roles aretagged as parameters or findings, and the relations that link concepts are classified inthe six types mentioned previously. On the other hand, the user model also providessome shallow semantics, but about the patient record, in the form of a list of conceptsindexed by their CUIs. Given a set of Results, only those that pertain to the patientinformation stored in the user model are kept. The filtering takes advantage of thesemantics provided by both the instantiated Result templates and the user model. Inow describe the filtering strategy which, given a template and the user model, returnsa filtered template. The strategy relies on a function that matches the conceptspresent in the template with the user model.

A template is filtered by making recourse to the parameter(s) only, not the find-ing(s). Consider the template in Figure 4.3, which includes death as one of its findings.From a medical standpoint, this finding, which represents an outcome, will only be-come relevant to the patient if (s)he fits any of the parameters that predict the givenoutcome. Furthermore, some parameters, standing alone, relate to given outcomes;other parameters only do so in combination. For example, smoking can predict lungcancer, while smoking combined with pregnancy is associated with retardation of fetalgrowth. The dependence or independence of parameters is identified at the contentidentification stage and is stored in the template under the ANALYSIS TYPE slot.In the case of independent parameters, like in Figure 4.3, filtering simply keeps the

12

TEMPLATE Id: 12FILE No: ahj 137 02 0424SENTENCE No: S-98RELATION: predictionANALYSIS TYPE: independentFINDING(S):

ITEM:CUI: C0155626LEX: new acute myocardial infarction

ITEM:CUI: C0011065LEX: death

POSITIVE PARAMETER(S):ITEM:

CUI: C0520887LEX: ST-segment depression

(OR 2.00, 95% CI 1.20 to 3.40; P = .008)ITEM:

CUI: C0002962LEX: prior angina

(OR 2.70, 95% CI 1.34 to 5.57; P = .001)ITEM:

CUI: C0001779LEX: age > 65 years

(OR 1.64, 95% CI 1.00 to 2.70; P = .01)

TEMPLATE Id: 12FILE No: ahj 137 02 0424SENTENCE No: S-98RELATION: predictionANALYSIS TYPE: independentFINDING(S):

ITEM:CUI: C0155626LEX: new acute myocardial infarction

ITEM:CUI: C0011065LEX: death

POSITIVE PARAMETER(S):ITEM:

CUI: C0520887LEX: ST-segment depression

(OR 2.00, 95% CI 1.20 to 3.40; P = .008)ITEM:

CUI: C0002962LEX: prior angina

(OR 2.70, 95% CI 1.34 to 5.57; P = .001)

Figure 4.3: Template before and after content filtering. The template is instantiatedfrom the sentence “New acute myocardial infarction or death was predicted by ST-segment depression (OR 2.00, 95% CI 1.20 to 3.40; P = .008), prior angina (OR2.70, 95% CI 1.34 to 5.57; P = .001), and age > 65 years (OR 1.64, 95% CI 1.00to 2.70; P = .01).”

parameters that are relevant to the patient and throws away all others. In the caseof dependent parameters, the whole template is discarded unless all the parametersare relevant to the patient.

The relevance of a parameter to the user model is determined via a decision tree(see Appendix D). The tree encodes questions such as “is the concept mentionedin the user model?”, “what is the semantic type of the concept?”, “is there a valueassociated with the concept?”, and “does the value of the concept in the user modelmatch the value of the concept?”.

13

Chapter 5

Organizing the Summary Content

Once the different pieces of information have been selected from the input articles andfiltered by the user model, the next task is to organize these individual content unitsto present to the user. Previous work on user modeling and generation showed thatthe content organization phase can be affected by user type. Paris (1993) showed thatinstructional texts must be organized differently if generated for experts or novices.In this part of the work, I do not attempt to incorporate any user modeling. Thesummary content is organized identically, whether generated for physicians or laypeople.1 The content organization component is fully implemented.

Traditionally, content ordering rules are identified by manually analyzing a corpusof target texts. More recently, ordering constraints were learned from examples oftarget texts (Duboue and McKeown, 2001; Kan and McKeown, 2002; Lapata, 2003).These approaches, however, rely on the presence of target texts which contain thesame type of information that is aimed to be generated. One of the difficulties inmulti-document summarization is that this assumption does not always hold. To myknowledge, there is no available corpus of human-written summaries that synthesizeresults from different clinical studies.

In previous work (Barzilay, Elhadad, and McKeown, 2002), we investigated strate-gies for ordering common information across input news articles and identified threestrategies: chronological, majority-based, and cohesion-based. While in the news do-main, it makes sense to attempt to order events chronologically, it does not in theTAS framework because the content units are time-independent scientific findings.The majority-based strategy tries to find a global ordering of the content units thatdeparts in a minimal fashion from the orderings of the same content units in the inputarticles. The strategy relies on the fact that all the input articles contain the contentunits. While this assumption is correct for a summarizer that summarizes commoninformation, it does not hold for TAS. The input articles contain few repetitions orcontradictions. As a matter of fact, most of the content units selected by TAS areconveyed in only one input article. Therefore, the majority ordering strategy has nomajority to work from. The last strategy, cohesion-based, augments the chronological

1Nor does any individual-based tailoring occur at the content organization stage.

14

one by improving the coherence of the summaries: blocks of content units that mustbe conveyed together are first identified and are then ordered chronologically. Theordering strategy in TAS is similar to the cohesion-based strategy: first, blocks of con-tent units that must be conveyed together are identified (Merging). Each block is thenassigned a priority score. The blocks are presented in order of priority (Ordering).

The effect of the content organization stage in TAS is two-fold: (1) the contentunits that repeat or contradict one another are identified, and (2) an order in whichthe content units should be presented to the user is established.

5.1 Merging

The goal of the merging step is to improve the coherence of the resulting summary.While the different pieces of information come from different input articles, they allpertain to the user model. In addition, the articles were selected (at the search stageof the PERSIVAL architecture) as relevant to the input question. Therefore, it is safeto assume that the content units are not a random set, and it is possible to obtain asomewhat coherent structure out of them.

The templates instantiated at the content selection are first broken down into in-dividual content items. For instance, the filtered template of Figure 4.3 can be brokendown into four separate content items: (C0520887 [ST-segment depression], predic-tion, C0155626 [acute myocardial infarction]), (C0520887 [ST-segment depression],prediction, C0011065 [death]), (C0002962 [angina], prediction, C0155626 [acute my-ocardial infarction]), and (C0002962 [angina], prediction, C0011065 [death]).

To identify the blocks of related content items that should be conveyed together,the individual items are clustered using hierarchical complete-link clustering. Thesimilarity function between two content items is computed as the sum of the valueof several features. Features are based on whether the parameters are the same,whether the two findings are the same, and how similar the relations are.2 In thecase of content items with an “association” relation, since it is bidirectional, the tags“finding” and “parameters” are interchangeable; the similarity function takes thisfact into account.

Because the similarity function assigns higher weights to two content items withthe same parameters/findings and relations, repetitions are more likely to be grouped

2Each possible relation pair was manually assigned a weight. For instance, (prediction, prediction)is weighted higher than (prediction, association).

15

together. Similarly, contradictions are more likely to be clustered. A contradictionis defined here as two content items that share the same parameters and findingsbut have contradictory relations (e.g., “association” and “not association”). Thisdefinition is arguably conservative. It would be possible to consider a less conservativedefinition of a contradiction given the available data structure. For instance, one coulddefine a contradiction as any two content items that have a contradictory relation,share the same parameters, and contain findings that are related semantically: (chestpain, predict, atrial fibrillation) and (chest pain, not predict, arrhythmia). However,TAS makes the choice to be overly conservative in its inference. The rationale is topresent only obvious contradictions as such and to let the physician decide on theless obvious ones, instead of possibly presenting misleading information on the basisof erroneous medical inferences.

The merging step achieves two purposes: (1) it allows for the easy identification ofstrictly identical content items (that is, repetitions either across or inside articles) andof contradictory content items, and (2) it dynamically groups together the contentitems that are semantically related to each other. This step is equivalent to a dynamicparagraph planning, where each cluster represents a paragraph.

5.2 Ordering

The ordering step is concerned with assigning a priority to each cluster of relatedcontent units. The cluster with the highest priority will be presented first to the user,the second highest, second, and so on. As acknowledged in previous work, there isno such thing as an ideal ordering. Rather, TAS aims to find an ordering that allowsthe reader to understand the summary, by improving its cohesion.

During the initial user studies, physicians mentioned the need to see any rep-etitions and contradictions appear early on in the summary. In addition, a set ofresults is likely to be more important when it reflects input from multiple articlesor at least from multiple analyses within the same article. In accordance with theseconsiderations, each cluster of content items gets assigned a priority based on thefollowing counts: the number of content items that contain the terms asked about inthe input user query; the number of repetitions present in the cluster; the number ofcontradictions present in the cluster; the number of different articles represented bythe content items in the cluster; the number of distinct sentences represented by thecontent items in the cluster.

16

Chapter 6

Tailoring Wording for Classes ofUsers

In this chapter, I describe the strategies used for realizing the summary content intoan English text. The realization of text targeted to physicians is already implementedin the current version of TAS. The realization for lay users is part of my proposedwork.

Traditionally, realization from semantic concepts into English is achieved by firstanalyzing a corpus written in the domain and genre specific to the target texts. Basedon the analysis, a mapping dictionary is built where each semantic unit correspondsto one or many verbalizations. This process is manual and known to be tedious. Inthe framework of TAS, because the semantic units are extracted from the differentinput texts at the content selection stage, a set of candidate verbalizations is alreadyavailable for each semantic unit. However, the task is not entirely completed; one muststill choose among the available verbalizations, as one might be more appropriate thananother in the context of the text summary or given the user type. An additionalchallenge in the framework of TAS is that, because the candidate verbalizations comefrom the input articles, they are likely to be understood by medically skilled readers,but not by lay users.

The main contribution of the generation stage for TAS is to take into account thelevel of expertise of the user. Using a comparable corpus of technical and lay texts,I plan to learn rules for text-to-text generation. The translation of technical into laylanguage can be seen at a more abstract level as a genre-to-genre transformation.

In the TAS framework, there are two types of semantic units: concepts, such asparameters or findings, and relations, such as association and risk. Relations typicallymap to verb phrases, while concepts are best verbalized as noun phrases. The verbphrases constitute sentences with open slots for their arguments. These are filled bythe noun phrases verbalizing concepts. The following sections focus on the generationof sentences and concepts. For each one, I describe, in turn, strategies for acquiringthe candidate verbalizations for different user types and the strategies for the lexicalchoice.

17

6.1 Sentence Generation

Relations in the TAS framework are verbalized as sentences. For instance, a predic-tion relation between two parameters and one finding can be verbalized as “Predic-tors for <finding> included <parameter1> and <parameter2>,” where <finding>,<parameter1>, and <parameter2> are concepts to be verbalized. There are multipleways to verbalize any given relation. While it is a considerable subject of research,choosing the most appropriate sentence among the verbalization candidates is not amajor focus of this work. Nevertheless, in an attempt to produce a varied language,TAS follows a simple strategy: select the least recently used verbalizations among thecandidate verbalizations, and choose one of them randomly.

The question I focus on is how to acquire such candidate verbalizations, both forphysicians and lay users. TAS takes advantage of the library of patterns used atthe content selection stage. The same patterns that were employed at the extractionstep to instantiate templates now constitute entries in the verbalization dictionary.The acquisition of verbalizations has been fully implemented when generating textfor physicians.

Because these verbalizations reflect the style of the input clinical studies, there isno guarantee that they are understandable by lay users. The main challenge, then,becomes how to acquire additional verbalizations that fit the language which lay usersexpect and understand. I propose to learn the lay verbalizations automatically, rely-ing on the technical verbalizations. One can view two versions of the same piece ofinformation verbalized in technical or lay language as paraphrases, or even transla-tions, of one another. Given enough instances of such sentence pairs, I plan to learnrules to map technical verbalizations to lay ones.

6.1.1 Sentence Alignment

It is possible to collect manually instances of sentences pairs (technical version, layversion) where both versions convey the same information; however, it would bebeneficial to do so automatically. In previous work (Barzilay and Elhadad, 2003), welooked at the task of aligning sentences in encyclopedia articles about different citieswritten for adults and children. Given two texts which convey essentially the sameinformation (e.g., the adult version and the children version), we identified sentencespairs which convey the same information. The search for sentence pairs is based ona weak lexical similarity measure (cosine metric) and the topics of the paragraphs

18

in which the sentences appear (examples of topics in our encyclopedia domain arehistory of the city, demographics, and places of interests). Topics are identified inan unsupervised manner. The rationale for relying on the topical structure of thetexts is that two sentences are more likely to convey the same information if theircontexts (approximated by the paragraphs they belong to) relate the same topics.The algorithm has three stages: (1) identify the topical structures of each corpus(children and adult), (2) map the paragraphs that have high lexical similarity andrelated topics, and (3) for each mapped paragraph pair, align sentences using dynamicprogramming and lexical similarity.

I plan to adapt this algorithm to the medical domain. I will rely on a compara-ble corpus of clinical study abstracts and summaries written for a lay audience (seeAppendix E for an example of a text pair and aligned sentences). Preliminary workhas indicated that simply applying the algorithm to the new domain does not yieldsatisfying results. Indeed, there are a number of departures from the encyclopediadomain. Paragraphs in the lay texts are very short, often one-sentence long. Moreimportantly, the technical and lay texts have very little lexical similarity, renderingsimple similarity measures such as cosine unusable. I have tried to inject some knowl-edge into the similarity measure by substituting medical terms with their CUIs. But,this has decreased the accuracy of the similarity. This is due mainly to the fact thatwhen the original study uses a specific term (e.g., acute myocardial infarction), thecorresponding lay sentence will refer to a hypernym (myocardial infarct), which isencoded as a different CUI in the UMLS. My plan is to continue looking for waysto inject knowledge, in a reasonable way, into the similarity measure. I also plan toinvestigate whether it is possible to iterate the algorithm by making the learning oftopical structure and sentence alignment a co-training process.

6.1.2 Learning of Rewriting Rules

Given pairs of (lay, technical) sentences, I plan to learn rewriting rules that trans-form a technical sentence into a simpler version. I plan to investigate two differentapproaches. A first approach consists of learning paraphrases at the sentence level, sothat a given technical sentence is transformed, as a whole, into a lay sentence. How-ever, there might not be enough training data available to be able to discriminatebetween idiosyncratic parts of sentences and the ones contributing to the paraphrase.A second possible approach consists of learning rewriting rules of smaller scope. Inthis framework, a technical sentence will be rewritten by applying several rules suc-

19

cessively until it reaches a “simplicity” threshold. One advantage of the second frame-work is that there are some rules that, as humans, we know would be helpful, andthat can easily be included in the set of the automatically learned ones, renderingthe system as a whole more accurate. For instance, it seems obvious that stripping atechnical sentence of its parenthetical phrases brings it closer to its lay version.

6.2 Lexical Choice for Medical Concepts

I first describe the algorithm for verbalizing a concept, assuming that a mappingdictionary is available. I then report how the dictionary is acquired. Finally I proposeways to augment the dictionary with more verbalizations appropriate for lay usersand additional characteristics useful for lexical choice.

6.2.1 Choosing a Verbalization

A concept is here defined as a medical term along with any associated attribute (avalue or a definition). For instance, in the verbalization “age > 65 years,” the concept“age” has a numerical value associated (“> 65 years”). The concept “angina” in “priorangina,” has a modifier for value (“prior”). The following verbalization “cardioversion(an electrical shock delivered to the heart to restore a normal rhythm)” consists ofa concept followed by a definition. A verbalization will be chosen by making thefollowing decisions:

• What is the best term verbalization? If the user is a physician, TAS considersconciseness as the most important quality, and so the shortest verbalization ischosen. For lay users, the most familiar verbalization will be considered the best.For instance, if the candidate verbalizations for the concept C0004238 are “afib,”“AF,” “auricular fibrillation,” and “atrial fibrillation,” then the verbalizationselected for physicians will be “AF,” while “atrial fibrillation” would be pickedout for lay users.

• Is a value needed? If so, what is the best way to verbalize the value? Atthe content selection stage, both the concept and its associated value, if any,are extracted. The value can simply be verbalized the way it appeared in theoriginal text. However, if the value is numerical, it might be more accessiblefor a lay user to verbalize it as an adjective. For instance, “blood pressure >

20

90 mm Hg” could be more readable for a lay reader if verbalized as “elevatedblood pressure.”

• Is a definition needed? If so, what type of definition is best to insert, and whatis the best way to insert it? If the user is not a physician, a concept might beunfamiliar, no matter how it is verbalized. Adding a definition might help thereader understand the text. For instance, the unfamiliar term “cardioversion”can be supplemented by a definition.

This third set of decisions (definition insertion) is complex. The initial challengeis how to approach this problem: treat each question separately — that is, first decidewhether a definition is needed, then which definition is more appropriate, finally howto insert it — or approach it as one global decision. The next challenge concernsthe features which help make these decisions. I plan to investigate (1) features ofthe concept itself, such as whether the concept is mentioned for the first time in thesummary or whether it is mentioned at all in the patient record, (2) features of thecontext of the concept, such as the topic of the sentence the concept belongs to, and(3) the type of definition to insert. For instance, given the term “Addison’s disease,”there are several definitions available. The definition, “a degenerative disease that ischaracterized by weight loss, low blood pressure, extreme weakness, and dark brownpigmentation of the skin,” focuses on the symptoms of the disease and will be referredto as a symptom definition, while others can focus on different aspects (see Append-ing C for examples of types). It could be more appropriate to insert a symptomdefinition when the surrounding sentence conveys information about diagnosis of thedisease for instance, as opposed to a sentence about treatment options. A last chal-lenge concerns the insertion of the definition. Inserting a complex linguistic structure,such as a definition, in isolation from the rest of the sentence the concept belongs tocould prove to be a naive strategy. Clearly, the surrounding sentence constrains thepossible syntactic realizations of the definition. What should be done if the sentenceand the definition at hand are not compatible is an open question. One would needto modify either the sentence or the definition to reach a compatible pair.

All of the above decisions about terms, values, and definitions rely on a prioriknowledge about a concept, such as its possible verbalizations, the degrees of fa-miliarity of its candidate verbalizations, the possible mappings between modifyingnumerical measurements and descriptive adjectives, and its available associated defi-nitions. I explain next how these pieces of information are collected.

21

6.2.2 Acquiring the Term Verbalizations

One possible approach to verbalizing terms is to take advantage of the existing UMLSdictionary. My approach, however, consists of acquiring the verbalization candidateson the fly from the input articles. At the end of the content selection stage, theinstantiated templates contain the extracted phrases for the different types of concepts(parameters and findings). The phrase corresponding to a certain CUI becomes oneof its candidate verbalizations. These verbalizations comprise a subset of the UMLSverbalizations. This way, the chosen verbalization is guaranteed to be closer to theones used in the input articles, rendering the overall summary less artificial.

However, the available candidate verbalizations may not be understandable bylay users. I propose to collect off-line a set of possible term verbalizations valid forlay users to add as candidate verbalizations. Using a large corpus of medical textstargeted at lay users, the medical terms are identified along with their CUIs (in thesame way the technical input articles are preprocessed for TAS). For each CUI, onlyone term, the one with the highest degree of familiarity, is kept as a “lay alternative”in addition to the other candidate verbalizations.

6.2.3 Augmenting the Dictionary

Familiarity Property. Deciding whether a given verbalization of a medical termis familiar to a reader can be challenging. I propose to approximate the degree offamiliarity of a term by recourse to its frequency, as recorded in a large corpus ofmedical texts written for lay people. Given two candidate term verbalizations, theone with the higher frequency will be considered the most familiar one. Additionalfeatures, especially user-based as opposed to class-based features, can be consideredto predict the familiarity of a term, and if time permits I will investigate which onesare most predictive.1

Value Verbalizations. The goal of adding value verbalizations is to replace thepossible numerical measurements extracted from the input text (e.g., “heart rate of150 bpm”) with gradable adjectives — attributive adjectives that denote measure-ments (e.g., “elevated heart rate”). A simple way to do so would be to refer to alist of normal/abnormal ranges for different clinical variables. However, it is surpris-ingly difficult to find such list. As an alternative, I plan to investigate whether it is

1Patients, for example, often become quasi-experts with respect to their medical conditions andwill be more familiar with technical terms commonly employed in conjunction with their ailments.

22

possible to learn the mapping between measurements and adjectives automatically.In preliminary work, all the occurrences of numerical measurements associated withall the medical terms present in a large corpus of clinical studies were collected andnormalized. In parallel, all the adjectives associated with the terms were also col-lected. One can hypothesize that while writing, authors of clinical studies can at anytime either choose to give a numerical measurement or employ an adjective. If thisassumption is correct, then it should be possible to map the distribution of numericalmeasurements in a large corpus to the distribution of the ordered gradable adjectivesfor any given medical concept.

Definitions. I plan to take advantage of the glossary provided by the DEFINDERsystem (Muresan and Klavans, 2002), which gathers terms and definitions fromconsumer-oriented medical articles. Each term can have many definitions associ-ated with it, but not all convey the same type of information (see Appendix C foran example). I plan to train a classifier to determine a definition type in a super-vised fashion. I will look at lexical features in the definition and the semantic type ofthe term defined. This information will in turn be used to investigate which type ofdefinition is most appropriate in a given context.

23

Chapter 7

Evaluation

I plan to conduct both intrinsic and extrinsic evaluations of TAS. Traditionally, theevaluation of a summarizer is known to be a hard task (Sparck-Jones, 1999; Teufel,2001). The main reason is that salience criteria are not well specified for most sum-marizers, and it is not known in advance what will be the task in which the summarieswill be needed by users. In the TAS framework however, we have a more precise ideaof user needs. The salience criteria are known (any result that pertains to the patientrecord stored in the user model is considered salient and should be selected). Theintrinsic evaluation of the content selection is, therefore, more straightforward andcan be, to some extent, automated. For the evaluation of the other components ofTAS (as well as the overall evaluation), I do not propose any novel automatic strategy.Rather, I plan to rely on human judgments. The challenge will consist of creatingevaluation scenarios and questions that allow the judges to evaluate the summariesin a reliable way.

7.1 Intrinsic Evaluations

In this section, I focus on the evaluation of the different components of TAS.

Input Errors. Both the information in the user model and the input articles gothrough several preprocessing steps before being used by TAS. An essential step isthe identification of medical terms. A term is considered accurately identified if thelexemes referring to it are marked and the accurate corresponding semantic UMLStag is found. In a corpus of 661 terms, the automatic term identification achieved83.6% precision and 95.5% recall. Identification of values corresponding to termsachieved 85.4% accuracy.

24

7.1.1 Content Selection

Shallow Parsing. The selection of results relies on patterns that use shallow syn-tactic information, viz., noun phrases and verb phrases. While the verb phrases aresimilar in the medical domain and in more typical ones, noun phrases are harder toidentify accurately. The customized version of CASS yields better results in identify-ing noun phrases than off-the-shelf parsers. When evaluated on 70 sentences, whichcontained 501 noun phrases, the shallow parser identified noun phrases with 82.4%recall and 73.6% precision. The identification of the noun phrase heads achieved97.1%.

Content Identification. Content identification is an information extraction task.It relies on extraction patterns. Manual extraction was performed on a test set and isused to compute precision and recall of the patterns employed by TAS. The test setconsisted of 40 clinical studies from seven different journals. The extraction achieved89% precision and 65% recall. An instantiated template was considered accuratewhen all its slots got instantiated correctly.

Content Filtering. To determine whether the filtering strategies allow the iden-tification of pertinent information, I will investigate evaluation scenarios in whichmedical experts get to simulate the content filtering stage. It is a hard problem, andit is not certain that I will be able to obtain such results from physicians. If I succeed,I will use this gold standard to determine precision and recall counts. It will be alsopossible to compare our strategy to different baselines or alternative strategies. Forinstance, since the input articles are supposed to be globally relevant to the patientrecord, one possible baseline is to never filter any extracted result.

7.1.2 Content Organization

As mentioned before, there are no existing instances of summaries that synthesizeresults of different clinical studies. In addition, in previous work we have confirmedthe intuition that there is no single good ordering of information. As a consequence,to build an adequate gold-standard corpus of valid orderings would be too expensive,both from time and effort standpoints. I plan to ask judges to grade a set of automat-ically produced summaries. Grading will be based only on the flow of information.The judges will also be asked a small set of questions, such as “did the summary con-tain any repetition of information?” and “was there any contradictory informationthat was not explicitly signaled to the reader?”. While it would be ideal to have this

25

component judged by physicians, it is not necessary as it was for the content filtering.If I cannot obtain judgments from physicians, I will rely on lay subjects.

7.1.3 Realization

Definition Insertion. For the concepts whose realizations contain a definition, Iplan to evaluate two questions: (1) was a definition needed? and (2) was an adequatedefinition selected given the context? These questions are equally valid for TASsummaries and for any other text targeted at lay readers. Since the evaluation relieson human judgments, it might be better to test these questions on such naturallyoccurring texts instead of the artificially generated TAS summaries. This way, thejudges will not be distracted by the possible mistakes present in the summaries. Ipropose the following methodology. Taking texts written by humans for lay users,such as the ones in my lay corpus, I will first strip any existing definitions out of thetext. I will run my definition insertion tool on the texts to decide which terms needto be defined and which definition is most appropriate. I will then include manuallythe chosen definition in the original lay text. The subjects will be asked to judge thecoherence of the resulting text.

Sentence Alignment. I will follow the same evaluation methodology describedin (Barzilay and Elhadad, 2003). A test set of article pairs has been manually aligned.I will compare the precision/recall counts of the automatic alignment with two state-of-the-art alignment tools for monolingual comparable texts, SimFinder (Hatzivas-siloglou et al., 2001) and Decomposition (Jing, 2002), and a simple baseline based oncosine lexical similarity.

7.2 Extrinsic Evaluation

7.2.1 Plan for Physicians

The goal of the extrinsic evaluation is to determine whether TAS can help physiciansat the point of patient care. The main hypothesis I plan to test is whether TAS is ahelpful tool for accessing relevant information in a more efficient manner.

Working together with the cognitive scientists in PERSIVAL, we will recruit sub-jects that are medically trained (either residents or physicians) and will run a set ofevaluation scenarios. A scenario will consist of a patient record, a medical question

26

that the subject would want to address for the given patient, and a list of articlestypically returned from a search engine which contain some answer to the question.For each scenario, each subject will be presented with either the raw list of articles,displayed as returned from a search engine, or the summary produced by TAS, alongwith the links it provides to the articles. We plan to have a cognitive walk-throughwith the subjects, measuring and comparing how many actions are needed in eachcase to determine relevant information for the given scenarios. We also plan to havethe subjects answer a small questionnaire to get qualitative feedback on TAS.

7.2.2 Plan for Lay People

There are two claims I wish to evaluate: adapting the summary text for lay people(1) preserves the information conveyed in the original text and (2) helps the readerto understand the information better.

To evaluate whether there is no significant change in information when simplifyingthe summary, we will ask for the help of medical experts. Since the simplificationwill by its nature modify the given information, I do not plan to ask physicians for aprecise quantification of the information altered or lost. As a matter of fact, in thecomparable corpus of lay/technical texts, the authors of the lay version often changethe original information drastically. In my evaluation scenario, the physicians willbe provided with a set of sentences and their transformation and will be asked toassess whether the simplified version conveys essentially the same information as theoriginal text.

The claim about text readability is more complex to evaluate.1 I plan to approxi-mate the readability of a text by measuring the difference between its similarity witha corpus of lay texts and its similarity with a corpus of technical texts. One way todo so is to train a genre classifier on a mixed corpus of technical and lay texts. Givena generated text, if it is classified as lay, it can be assumed to have the qualities ofa lay text. If time permits, I will also conduct a user study with lay readers. Thesubjects presented with either original summaries or simplified versions will be askedcomprehension questions.

1I do not plan to use readability metrics. See (Klare, 1963; Redish, 2000) for detailed critiquesof readability measures.

27

Chapter 8

Status and Thesis Timeline

So far, I have worked on the following tasks:

• Corpus collection of clinical studies. The corpus contains 30,000 technicalarticles from 22 medical journals specialized in cardiology.

• Corpus collection of lay/technical articles. The comparable corpus cur-rently contains 372 article pairs. The lay articles are from ReutersHealth andare summaries of the corresponding technical articles. The technical originalarticles are from medical journals. An additional 840 lay articles are available.

• Basic processing tools adapted to medical texts. Part-of-speech tagging,medical term identification, and identification of modifiers and numerical valuesassociated with terms have been implemented. The shallow syntactic parserCASS was also adapted to the style of clinical studies. This has been evaluated.

• Content selection. The module is fully implemented and running. The con-tent extraction step was evaluated. The content filtering is not evaluated yet.

• Content organization. The module is fully implemented and running.

• Surface realization. The module is implemented and running for physiciansonly.

The timeline for my thesis is as follows:

• Mar. 2004 : Integration of TAS with PERSIVAL. I have already started tointegrate TAS with the overall PERSIVAL system. Currently, we have identifiedan encoding of the input parameters for TAS. I now need to coordinate with thelayout team of PERSIVAL and come up with a valid encoding of TAS output.

• Apr. 2004 - Jun. 2004 : Evaluation for physicians.

• Jul. 2004 - Sep. 2004 : Definition insertion and concept verbalization.

– Train a classifier to determine the type of a definition,

– Investigate features for determining which type of definition is more ap-propriate to insert in a given context,

28

– Investigate features for deciding how to best insert a definition given itssyntactic structure, the term to define, and the syntactic structure of thesentence surrounding the term.

• Oct. 2004 - Feb. 2005 : Sentence alignment and rewriting.

– Adapt the sentence alignment algorithm developed in previous work to themedical domain,

– Based on the training instances of aligned sentence pairs, I will investigatethe learning of rewriting rules to transform technical to lay sentences.

• Mar. 2005 - May. 2005 : Evaluation for lay users.

• May. 2005 - Aug. 2005 : Thesis write-up.

29

Chapter 9

Conclusion

9.1 Contributions

At the functional level, the two main contributions of my thesis work are:

• a multi-document summarizer of clinical studies which synthesizes the resultsreported into a single, fluent text summary.

• a user-sensitive summarizer which adapts its content to the interests of the userand its language to the level of expertise of the user.

The technical contributions of my thesis work are the following:

• Content unit representation — using information extraction techniques andrelying on the UMLS ontology, the summarizer operates over a data-structurethat is between full semantic analysis and agnostic extracted text. The rep-resentation allows for easy implementation of filtering strategies and contentorganization strategies.

• Dynamic content organization — coherence is approximated by clusteringthe selected content units based on their semantic similarity. The ordering relieson an automatically assigned priority weight for each cluster.

• Text summary re-generation — the summary sentences are produced byreusing extracted phrases from the input articles and mixing them with gener-ation templates.

• Text-to-text generation — I plan to learn rewriting rules to transform atechnical text into a lay version. This mapping from one genre to the other willtake place at the lexical level and at the sentence level.

• Sentence alignment for comparable texts — to collect instances of (tech-nical, lay) sentence pairs for the learning of rewriting rules, I propose to align acomparable corpus of paired technical and lay texts automatically. The align-ment relies on the topical structures of technical texts and lay texts.

30

9.2 Limitations

TAS has the following limitations in its current design:

• Content extraction patterns — because the extraction patterns were man-ually written, their coverage is not optimal. Learning more patterns automati-cally is not a focus of this thesis.

• User model representation — the patient information stored in the usermodel is an approximation of the actual patient record. By using a list of termsand their associated values extracted from the record, the local context aroundthe term is preserved, but not the global context. For instance, looking at aconcept in the list such as “heart rate = 70 bpm,” one cannot determine inwhich medical context the measurement for the heart rate was performed: atrest, after exercising, pre-surgery, etc.

• Individual-based user modeling — the user’s interests are taken into ac-count at the content selection stage, but not at successive stages. While itseems obvious that individual-based user modeling should take place, at a min-imum, at the content selection stage, such modeling may theoretically also beperformed at other stages of the generation process.

• Class-based user modeling — the level of expertise of the user is taken intoaccount at the realization stage, but not at previous stages. As in the case ofthe previous limitation, such modeling could also be performed at other stagesof the generation process.

TAS is a domain-dependent summarizer. For TAS to be ported to a different domain,the following is required:

• User model — TAS expects an existing user model which contains informationthat is directly comparable to the information conveyed in the input texts.

• Ontology — TAS relies on the UMLS as the ontology for the domain. The CUIsprovide an abstraction over the content units and the user model and allow forconcept matching (both between content units and between the content unitsand the user model).

• Degree of abstraction — the type of information to be conveyed in thesummary can be represented by a data-structure, such that at least shallowsemantics can be automatically obtained to allow for further processing (contentfiltering and content organization).

31

Appendix A

Input Preprocessing

Article Genre Classification

Genre classification takes an article and classifies it into one of the four followingcategories: clinical study, review, letter to the editor, and case report. The classifierwas trained on 2,700 articles and tested on 1,000 articles. To get the labels automat-ically, we took advantage of the meta-data field for publication type present for somearticles indexed in the medical search engine PubMed1.

Features included presence of an abstract, words in the abstract, number of wordsin the article. The genre classification achieves a general accuracy of 96% on thetesting set. The specific classification for the “clinical study” category achieved 92.2%precision and 97.7% recal.

Article Clinical Task Classification

The clinical task classification takes as input a clinical study and determines its mainclinical task: prognosis, treatment, or diagnosis. The training set (500 clinical studies)and the testing set (200 clinical studies) were manually annotated. The guidelinesand the interface for the annotation are available at http://www.cs.columbia.edu/~noemie/tas/clinical_task/.

Features included words present in the abstract and the ratio of the differentsemantic types of the medical terms in the abstract. The clinical task classificationachieved a general accuracy of 84.1%.

1http://www.ncbi.nlm.nih.gov/PubMed/

32

Appendix B

TAS Input/Output Example

TAS Inputs

• Patient record1–

44 year old female with unstable angina, past medical history of coronary arterydisease, status post myocardial infarction in 2000, status post CABG in 2000, statuspost angioplasty with stent placement in 2001, diabetes for 11 years, hypertension,peripheral vascular disease. Her hospital course was complicated by atrial fibrillationrequiring cardioversion. The patient smoked tobacco and quit in 1997 and also quitalcohol drink. She came to the hospital because of shortness of breath, increasingdyspnea and chest pain. Left ventricular ejection fraction is 35%. There is a chanceof recurrent atrial fibrillation.

• User Query – tell me more about atrial fibrillation.

• Input Articles – ten clinical studies provided by PERSIVAL.

1. Prophylactic Oral Amiodarone Compared With Placebo for Prevention of AtrialFibrillation After Coronary Artery Bypass Surgery

2. Intravenous amiodarone for prevention of atrial fibrillation after coronary arterybypass grafting

3. Intravenous sotalol decreases transthoracic cardioversion energy requirement forchronic atrial fibrillation in humans: assessment of the electrophysiological ef-fects by biatrial basket electrodes

4. Intraoperative amiodarone as prophylaxis against atrial fibrillation after coro-nary operations

5. Oral d,l sotalol reduces the incidence of postoperative atrial fibrillation in coro-nary artery bypass surgery patients: a randomized, double-blind, placebo-controlledstudy

1The actual patient record is very long, this is a synopsis of the main characteristics of the patient.

33

6. Patient Characteristics and Underlying Heart Disease as Predictors of RecurrentAtrial Fibrillation After Internal and External Cardioversion in Patients Treatedwith Oral Sotalol

7. Spontaneous Conversion and Maintenance of Sinus Rhythm by Amiodarone inPatients With Heart Failure and Atrial Fibrillation: Observations from the Vet-erans Affairs Congestive Heart Failure Survival Trial of Antiarrhythmic Therapy(CHF-STAT)

8. Efficacy and safety of sotalol versus quinidine for the maintenance of sinusrhythm after conversion of atrial fibrillation

9. Efficacy of amiodarone for the termination of persistent atrial fibrillation

10. Prospective Comparison of Flecainide Versus Sotalol for Immediate Cardiover-sion of Atrial Fibrillation

TAS Output

Following is the summary currently generated by TAS for physicians. Notice that notall the input articles contributed to the output summary.

Atrial fibrillation is associated with patient age, hospital stay, increased cost, andmortality rate [1,2,6,7,9].Multivariate analysis identified coronary artery disease to predict atrial fibrillation[6,7]. Left ventricular ejection fraction, hypertension, diabetes mellitus, smoking werenot found to predict atrial fibrillation [1].Left atrial diameter ¡ 4.0 cm is a predictor for conversion [8,9]. Left atrial size ¿ 60mmpredicts atrial fibrillation [6,7].Amiodarone and conversion to sinus rhythm are associated [7]. Sex, age, and base-line heart rate are not associated with conversion [9]. Heart failure does not predictconversion to sinus rhythm [7].Sotalol was associated with decreasing the incidence of atrial fibrillation, and toler-ated recurrences [5,8].In a univariate analysis, coronary artery disease and age predict recurrence [6,7]. Ageand atrial fibrillation predict sinus rhythm maintenance [8].

34

Appendix C

Definitions Example

Shown here are multiple definitions for the terms Addison’s disease provided bythe system DEFINDER. One can notice the different types of semantic informationpresent in the definitions. Definition 1 describes the symptoms of the disease, Defi-nitions 2, 3, and 5 mention primarily the casuse of the disease. Definition 4 focuseson the demographics of the disease.

1. a degenerative disease that is characterized by weight loss, low blood pressure,extreme weakness, and dark brown pigmentation of the skin.

2. a disease caused by partial or total failure of adrenocortical function, which ischaracterized by a bronze-like pigmentation of the skin and mucous membranes,anemia, weakness, and low blood pressure.

3. a rare disease that results from a deficiency in adrenocortical hormones.

4. an endocrine disorder that affects about 1 in 100,000 people.

5. a glandular disorder caused by failure of function of the cortex of the adrenalgland and marked by anemia and prostration with brownish skin.

35

Appendix D

Concept Matching

Following is the algorithm used for matching a given concept against the user model.A concept contains a medical term, identified by its CUI and its semantic type,and a possible associated value. Depending on the concepts, several configurationscan arise. If the concept is mentioned in the patient record, but does not have anyassociated value, then it is considered matching. Examples of such concepts aredisease (e.g., diabetes). If the concept has a value and is mentioned in the patientrecord, then it matches only if the value of the concept and the one mentioned inthe record are similar. Examples of such concepts are laboratory results (e.g., leftventricular ejection fraction). Finally, some concepts which are not mentioned in thepatient record, are nevertheless relevant to the patient. For instance, any conceptwith a semantic type of Body Part should be considered matching. I refer to suchsemantic types as global.

inputs : ParConcept: a concept from a template parameterUserModel: the user model

output: match/ noMatch

UMConcept ← lookup(ParConcept.CUI, UserModel );if UMConcept then

if (! ParConcept.value) thenreturn match ; // the CUIs match and there is no value to compare

if valuesMatch( ParConcept.value, UMConcept.value) thenreturn match ; // the CUIs match as well as their associated values

elseif isGlobalSemanticType( ParConcept.semantic type) then

return match ;

endreturn noMatch ;

36

Appendix E

Technical/Lay Text Example

Here is the abstract of a clinical study published in the journal Lancet, as exampleof technical text. The second text is a summary of the same study published byReutersHealth; it is targeted to a lay audience. The sentence numbers are given inbrackets. A set of manually identified sentence pairs that convey similar informationis then provided as an example of sentence alignment.

Technical Version from the Lancet

[1] The combination of fibrinolytic therapy and heparin for acute myocardial infarction failsto achieve reperfusion in 40-70% of patients, and early reocclusion occurs in a substantialnumber. [2] We did a randomised, open-label trial to compare the thrombin-specific anti-coagulant, bivalirudin, with heparin in patients undergoing fibrinolysis with streptokinasefor acute myocardial infarction.

[3] 17073 patients with acute ST-elevation myocardial infarction were randomly assignedan intravenous bolus and 48-h infusion of either bivalirudin (n=8516) or heparin (n=8557),together with a standard 1.5 million unit dose of streptokinase given directly after the an-tithrombotic bolus. [4] The primary endpoint was 30-day mortality. [5] Secondary endpointsincluded reinfarction within 96 h and bleeding. [6] Strokes and reinfarctions were adjudi-cated by independent committees who were unaware of treatment allocation. [7] Analysiswas by intention to treat.

[8] By 30 days, 919 patients (10.8%) in the bivalirudin group and 931 (10.9%) in theheparin group had died (odds ratio 0.99 [95% CI 0.90-1.09], P=0.85). [9] The mortality ratesadjusted for baseline risk factors were 10.5% for bivalirudin and 10.9% for heparin (0.96[0.86-1.07], P=0.46). [10] There were significantly fewer reinfarctions within 96 h in thebivalirudin group than in the heparin group (0.70 [0.56-0.87], P=0.001). [11] Severe bleedingoccurred in 58 patients (0.7%) in the bivalirudin group versus 40 patients (0.5%) in theheparin group (p=0.07), and intracerebral bleeding occurred in 47 (0.6%) versus 32 (0.4%),respectively (p=0.09). [12] The rates of moderate and mild bleeding were significantlyhigher in the bivalirudin group than the heparin group (1.32 [1.00-1.74], P=0.05; and 1.47

37

[1.34-1.62], p < 0.0001; respectively). [13] Transfusions were given to 118 patients (1.4%)in the bivalirudin group versus 95 patients (1.1%) in the heparin group (1.25 [0.95-1.64],P=0.11).

[14] Bivalirudin did not reduce mortality compared with unfractionated heparin, but didreduce the rate of adjudicated reinfarction within 96 h by 30%. [15] Small absolute increaseswere seen in mild and moderate bleeding in patients given bivalirudin. [16] Bivalirudin isa new anticoagulant treatment option in patients with acute myocardial infarction treatedwith streptokinase.

Lay Version from ReutersHealth

[1] Heart attack patients treated with the blood-thinning drug bivalirudin were 30% lesslikely to have a second heart attack than those treated with the more traditional anticoag-ulant heparin, according to the results of a new study.

[2] Lead investigator Dr. Harvey D. White of Green Lane Hospital in Auckland, NewZealand, and colleagues report that bivalirudin (Angiomax) should be considered as a newtreatment option for heart attack patients treated with streptokinase, another drug used todissolve blood clots.

[3] The large international study, funded by bivalirudin’s manufacturer, The MedicinesCompany, enlisted more than 17,000 patients in 539 centers across 46 countries. [4] Patientsreceived either bivalirudin or heparin. [5] Both sets of patients also received aspirin andstreptokinase.

[6] Bivalirudin was 30% more effective at reducing recurrent heart attack than heparin,translating to eight fewer heart attacks within 30 days for every 1,000 treated patients, theresearchers report in the December 1st issue of The Lancet.

[7] The rate of death at 30 days after the initial heart attack was the same for bothgroups of patients and “small absolute increases were seen in mild and moderate bleedingin patients given bivalirudin,” White and colleagues note.

[8] “Although there was no difference in the mortality rate, the 30% early reduction(within 96 hours) of recurrent heart attacks is impressive,” Dr. Sidney Smith, chief scienceofficer of the American Heart Association and professor of medicine at the University ofNorth Carolina, Chapel Hill, said in a interview with Reuters Health.

[9] “This early reduction of recurrent heart attacks without an increased risk of majorbleeding may contribute to the use of (additional treatments) in the further managementof these patients,” Smith added.

38

Manually Aligned Sentence Pairs

· Heart attack patients treated with the blood-thinning drug bivalirudin were 30% less likely to havea second heart attack than those treated with the more traditional anticoagulant heparin, accordingto the results of a new study.· Bivalirudin did not reduce mortality compared with unfractionated heparin, but did reduce therate of adjudicated reinfarction within 96 h by 30%.

· Lead investigator Dr. Harvey D. White of Green Lane Hospital in Auckland, New Zealand, andcolleagues report that bivalirudin (Angiomax) should be considered as a new treatment option forheart attack patients treated with streptokinase, another drug used to dissolve blood clots.· Bivalirudin is a new anticoagulant treatment option in patients with acute myocardial infarctiontreated with streptokinase.

· The large international study, funded by bivalirudin’s manufacturer, The Medicines Company,enlisted more than 17,000 patients in 539 centers across 46 countries.· 17073 patients with acute ST-elevation myocardial infarction were randomly assigned an intravenousbolus and 48-h infusion of either bivalirudin (n=8516) or heparin (n=8557) , together with a standard1.5 million unit dose of streptokinase given directly after the antithrombotic bolus.

· Patients received either bivalirudin or heparin.· 17073 patients with acute ST-elevation myocardial infarction were randomly assigned an intravenousbolus and 48-h infusion of either bivalirudin (n=8516) or heparin (n=8557), together with a standard1.5 million unit dose of streptokinase given directly after the antithrombotic bolus.

· Both sets of patients also received aspirin and streptokinase.· 17073 patients with acute ST-elevation myocardial infarction were randomly assigned an intravenousbolus and 48-h infusion of either bivalirudin (n=8516) or heparin (n=8557), together with a standard1.5 million unit dose of streptokinase given directly after the antithrombotic bolus.

· Bivalirudin was 30% more effective at reducing recurrent heart attack than heparin, translating toeight fewer heart attacks within 30 days for every 1,000 treated patients , the researchers report inthe December 1st issue of The Lancet.· Bivalirudin did not reduce mortality compared with unfractionated heparin, but did reduce therate of adjudicated reinfarction within 96 h by 30%.

· The rate of death at 30 days after the initial heart attack was the same for both groups of patientsand ”small absolute increases were seen in mild and moderate bleeding in patients given bivalirudin,”White and colleagues note.· Small absolute increases were seen in mild and moderate bleeding in patients given bivalirudin.

39

Appendix F

Technical/Lay Sentences Examples

· Multivariate analysis revealed that history of hypercholesterolemia, history of smokingand diabetes were independently associated with premature CHD.· High cholesterol was a risk factor for acute coronary syndrome.

· Elevated PP is a powerful independent predictor of cardiovascular end points in the elderly.· Now investigators report that pulse pressure, measured as the difference between systolicand diastolic pressure, independently predicts whether a patient will develop heart disease.

· Alcohol use was inversely associated with risk of CHD in men with type 2 diabetes.· Heart disease risk proved to be inversely associated with alcohol use.

· In univariate analyses adjusted for age and in multivariate analyses adjusted for age, totalcholesterol, and triglycerides, the values for apoB and apoB/apoA-I ratio were strongly andpositively related to increased risk of fatal myocardial infarction in men and in women.· The investigators found that levels of apoB and the ratio of apoB to apoA-1 were stronglyrelated to the risk of fatal heart attack.

· The age-adjusted RRs corresponding to intakes of < 0.5 drinks/day, 0.5 to 2 drinks/dayand > 2 drinks/day were 0.76 (95% confidence interval: [CI]: 0.52 to 1.12), 0.64 (95% CI:0.40 to 1.02) and 0.59 (95% CI: 0.32 to 1.09), respectively, as compared with nondrinkers(p for TREND=0.06).· Compared with nondrinkers, men who consumed half a drink or less per day cut theirheart disease risk by 24%, while those who drank one-half to two drinks daily cut their riskby 36%.

·With adjustment for age, treatment assignment, smoking, alcohol intake, history of angina,and parental history of myocardial infarction, the relative risks of total stroke associatedwith vigorous exercise less than 1 time, 1 time, 2 to 4 times, and 5 times per week at baselinewere 1.00 (referent), 0.79 (95% confidence interval [CI], 0.61 to 1.03), 0.80 (95%CI, 0.65 to0.99), and 0.79 (95%CI, 0.61 to 1.03), respectively; P for trend = 0.04.· They report that, compared with non-exercisers, relative stroke risk declined 21% amongmen who exercised ”vigorously” – defined as workouts in which the physician broke out ina sweat – at least once per week.

40

References

Abney, Steven. 1996. Partial parsing via finite-state cascades. Journal of Natural LanguageEngineering, 2(4):337–344.

Barzilay, Regina and Noemie Elhadad. 2003. Sentence alignment for monolingual com-parable corpora. In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP’03), pages 25–32.

Barzilay, Regina, Noemie Elhadad, and Kathleen McKeown. 2002. Inferring strategies forsentence ordering in multidocument news summarization. Journal of Artificial Intelli-gence Research, 17:35–55.

Barzilay, Regina and Lillian Lee. 2003. Learning to paraphrase: An unsupervised approachusing multiple-sequence alignment. In Proceedings of the Human Language Technol-ogy Conference of the North American Chapter of the Association for ComputationalLinguistics (HLT-NAACL’03), pages 16–23.

Barzilay, Regina, Kathleen McKeown, and Michael Elhadad. 1999. Information fusion inthe context of multi-document summarization. In Proceedings of the Meeting of theAssociation for Computational Linguistics (ACL’99), pages 550–557.

Becher, Margit, Brigitte Endres-Niggemeyer, and Gerrit Fichtner. 2002. Scenario forms forweb information seeking and summarizing in bone marrow transplantation. In Proceed-ings of the COLING Workshop on Multilingual Summarization and Question Answering.

Binsted, Kim, Alison Cawsey, and Ray Jones. 1995. Generating personalised patient infor-mation using the medical record. In Proceedings of Artificial Intelligence in MedicineEurope.

Carenini, Giuseppe, Vibhu Mittal, and Johanna Moore. 1994. Generating patient specificinteractive explanations. In Symposium on Computer Applications in Medical Care.

Carroll, John, Guido Minnen, Darren Pearce, Yvonne Canning, Siobhan Devlin, and JohnTait. 1999. Simplifying text for language-impaired readers. In Proceedings of theEuropean Chapter of the Associaton for Computational Linguistics (EACL’99).

Cawsey, Alison, Ray Jones, and Janne Pearson. 2000. The evaluation of a personalisedinformation system for patients with cancer. User Modeling and User-Adapted Interac-tion, 10(1):47–72.

Chandrasekar, Raman and Srinivas Bangalore. 1997. Automatic induction of rules for textsimplification. Knowledge-Based Systems, 10(3):183–190.

41

Duboue, Pablo and Kathleen McKeown. 2001. Empirically estimating order constraintsfor content planning in generation. In Proceedings of the European Chapter of theAssociaton for Computational Linguistics (EACL’01), pages 172–179.

Ebadollahi, Shahram, Shih-Fu Chang, Henry Wu, and Shin Takoma. 2001. Indexing andsummarization of echocardiogram videos. In Scientific Session of the American Collegeof Cardiology.

Edwards, Kari and Edward Smith. 1996. A disconfirmation bias in the evaluation ofarguments. Journal of Personality and Social Psychology, 71:5–24.

Elhadad, Michael. 1993. Using argumentation to control lexical choice: a unification-basedimplementation. Ph.D. thesis, Columbia University, Dept. of Computer Science.

Elhadad, Noemie and Kathleen McKeown. 2001. Towards generating patient specific sum-maries of medical articles. In Proceedings of the NAACL Workshop on Automatic Sum-marization.

Gerrig, Richard, Jennifer Kuczmarski, and Susan Brennan. 1999. Perspective effects onreaders’ text representations. In NSF HCI Program. Grantees’ Workshop.

Goldstein, Jade, Vibhu Mittal, Mark Kantrowitz, and Jaime Carbonell. 1999. Summariz-ing text documents: Sentence selection and evaluation metrics. In Proceedings of theConference on Research and Development in Information Retrieval (SIGIR’99).

Green, Noah, Panagiotis Ipeirotis, and Luis Gravano. 2001. SDLIP + STARTS = SDARTS:A protocol and toolkit for metasearching. In Proceedings of Joint Conference on DigitalLibraries (JCDL’01), pages 207–214.

Hatzivassiloglou, Vasileios, Judith Klavans, Melissa Holcombe, Regina Barzilay, Min-YenKan, and Kathleen McKeown. 2001. SimFinder: A flexible clustering tool for summa-rization. In Proceedings of the NAACL Workshop on Automatic Summarization.

Hovy, Eduard. 1988. Two types of planning in language generation. In Proceedings of theMeeting of the Association for Computational Linguistics (ACL’88), pages 170–186.

Hripcsak, George, James Cimino, and Soumitra Sengupta. 1999. WebCIS: Large scaledeployment of a web-based clinical information system. In Proceedings of the AMIASymposium.

Inui, Kentaro and Ulf Hermjakob, editors. 2003. The Second International Workshop onParaphrasing. In conjunction with ACL’03.

Jing, Hongyan. 2002. Using Hidden Markov Modeling to decompose human-written sum-maries. Computational Linguistics, 28(4):527–543.

42

Kan, Min-Yen. 2003. Automatic text summarization as applied to information retrieval:Using indicative and informative summaries. Ph.D. thesis, Columbia University. Chap-ters 3-4.

Kan, Min-Yen and Kathleen McKeown. 2002. Corpus-trained text generation for summa-rization. In Proceedings of the International Natural Language Generation Conference(INLG’02), pages 1–8.

Kintsch, Walter. 1988. The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95(2):163–182.

Klare, George. 1963. The Measurement of Readability. Iowa State University Press.

Lapata, Mirella. 2003. Probabilistic text structuring: Experiments with sentence order-ing. In Proceedings of the Meeting of the Association for Computational Linguistics(ACL’03), pages 545–552.

Lennox, Scott, Liesl Osman, Ehud Reiter, Roma Robertson, James Friend, Ian MacCann,Diane Skatun, and Peter Donnan. 2001. Cost effectiveness of computer tailored andnon-tailored smoking cessation letters in general practice: randomised controlled trial.British Medical Journal, 322(7299):1396–1400.

Lin, Chin-Yew and Eduard Hovy. 2002. From single to multi-document summarization: Aprototype system and its evaluation. In Proceedings of the Meeting of the Associationfor Computational Linguistics (ACL’02), pages 457–464.

Lok, Simon and Steven Feiner. 2002. The AIL automated interface layout system. InProceedings of the Internationall Conference on Intelligent User Interfaces (IUI’02).

McCray, Alexa and Olivier Bodenreider. 2002. A conceptual framework for the biomedicaldomain. In Rebecca Green, Carol Bean, and Sung Myaeng, editors, The Semanticsof Relationships: An interdisciplinary perspective. Kluwer Academic Publishers, pages181–198.

Mendonca, Eneida, James Cimino, Stephen Johnson, and Yoon-Ho Seol. 2001. Accessingheterogeneous sources of evidence to answer clinical questions. Journal of BiomedicalInformatics, 34(2):85–98.

Muresan, Smaranda and Judith Klavans. 2002. A method for automatically buildingand evaluating dictionary resources. In Proceedings of the Language Resources andEvaluation Conference (LREC’02).

National Library of Medicine, 1995. Unified Medical Language System (UMLS) KnowledgeSources. Bethesda, Maryland. http://www.nlm.nih.gov/research/umls/.

43

Osman, Liesl, M. Adballa, J. Beattie, S. Ross, I. Russell, J. Friend, J. Legge, and J. GrahamaDouglas. 1994. Reducing hospital admission through computer supported education forasthma patients. British Medical Journal, 308(6928):568–571.

Paris, Cecile. 1993. User Modelling in Text Generation. Frances Pinter.

Radev, Dragomir and Kathleen McKeown. 1998. Generating natural language summariesfrom multiple on-line sources. Computational Linguistics, 24(3):469–500.

Redish, Janice. 2000. Readability formulas have even more limitations than Klare discusses.Journal of Computer Documentation, 24(3):132–137.

Riloff, Ellen. 1993. Automatically constructing a dictionary for information extractiontasks. In Proceedings of the Conference on Artificial Intelligence (AAAI’93), pages811–816.

Riloff, Ellen. 1996. Automatically generating extraction patterns from untagged text. InProceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI’96),pages 1044–1049.

Sparck-Jones, Karen. 1999. Automatic summarizing: Factors and directions. In InderjeetMani and Mark Maybury, editors, Advances in Automatic Text Summarization. TheMIT Press, pages 1–12.

Teufel, Simone. 2001. Task-based evaluation of summary quality: Describing relation-ships between scientific papers. In Proceedings of the NAACL Workshop on AutomaticSummarization.

Teufel, Simone, Vasileios Hatzivassiloglou, Kathleen McKeown, Kathy Dunn, Desmon Jor-dan, Sergey Sigelman, and Andre Kushniruk. 2001. Personalized medical article selec-tion using patient record information. In Proceedings of the AMIA Symposium, pages696–700.

44

User-Sensitive Summarization Thesis Proposal -...

Documents

Transcript of User-Sensitive Summarization Thesis Proposal -...