©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek...

15
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithklin e.com (610) 270-6851

Transcript of ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek...

Page 1: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

CSC 9010: Text Mining Applications

Document Summarization

Dr. Paula Matuszek

[email protected]

(610) 270-6851

Page 2: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document Summarization Document Summarization

– Provide meaningful summary for each document Examples:

– Search tool returns “context”

– Monthly progress reports from multiple projects

– Summaries of news articles on the human genome Often part of a document retrieval system, to

enable user judge documents better Surprisingly hard to make sophisticated Surprisingly easy to make effective

Page 3: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Document Summarization -- How

Three general approaches: Extract predefined summary.

– Useful in highly structured environments where you can specify format. Typically very good summaries.

Capture in abstract representation, generate summary– Useful in well-defined domains with clearcut information

needs. Extract representative sentences/clauses.

– Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".

Page 4: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Extract Predefined Summary Documents have a well-defined format. Format includes a summary or abstract explicitly

written by document author. Text mining may reorganize, regroup, restructure

summaries. Example:

– People working on multiple projects write monthly reports based on what they have done, one sentence/project.

– Reporting system collects person-level reports and reorganizes into project-level reports.

Page 5: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Extract Predefined Summary: Methods

Extraction using some or all of– NLP for document parsing/chunking (finding abstract)– standard computer science: database retrieval, string

processing, etc. Reorganizing may be done using

– explicit fields specified by author– keywords searched for in documents– business rules which capture knowledge about who is working

on what tasks and projects Grouping can shade into document classification for long

summaries, ill-defined match to categories

Page 6: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Extracting Predefined Summaries: Advantages and

Disadvantages Advantages

– Summaries reflect intent of author.– If part of an overall reporting system can actually make

it simpler for author.– Incremental effort for author not large.

Disadvantages– Incremental effort for author not zero either.– Only feasible in structured situation where requirement

can be defined ahead of time.– Can't be used to summarize a group of documents.– Not all authors write good summaries.

Page 7: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Capture and Generate Documents can have arbitrary format Knowledge needed is well-defined. Often information need is for

summarizations across multiple documents

Example: – Summarizing restaurant reviews. Take

newspaper articles and produce price range, kind of food, atmosphere, quality, service.

Page 8: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Capture and Generate: Methods State of the art:

– Create "template" or "frame"– Represent the knowledge you want to capture

– Extract Information to fill in frame– Standard information extraction problem– Typically relatively large frames with relatively few relations;

mostly facts.

– Generate based on template– Relatively simple "fill-in-the-blank"– More complex based on parse tree.

Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.

Page 9: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Capture and Generate: Advantages and Disadvantages

Advantages:– Produces very focused summaries.

– Can readily incorporate multiple documents.

– Not dependent on authors Disadvantages

– Assumes information need is clearly defined.

– Information extraction component development time is significant

– Document parsing slow; probably not real-time. Comment:

– Makes no attempt to capture author's intent

Page 10: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Page 11: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Extract Representative Sentence

Document format can be arbitrary Document content can also be arbitrary;

information need not clearcut Summarization consists of text extracted

directly from document. Examples:

– Context returned by Google for each hit– Google News summaries.

Page 12: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Find Representative Sentences: Method

Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence.– If in response to a search or other information

request, the search terms are representative– If no prior query, TF*IDF and other BOW

approaches. May use pairs or n-ary groups of words.

May add a layer of rules using position, some specific phrases such as "In summary,".

Page 13: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Find Representative Sentences: Advantages and Disadvantages

Advantages– Can be applied anywhere.– Relatively fast (compared to full parse)– Provides a good general idea or feel for content.– Can do multiple-document summaries.

Disadvantages– Often choppy or hard to read– Does poorly when document doesn't contain

good summary sentences.– Can miss major information

Page 14: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Summary Appropriate approach depends on what is

known about the documents, the domain, and the information need.

All of the major approaches in use provide useful information in a reasonable time frame.

None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.

Page 15: ©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851.

©2003 Paula Matuszek

Some Useful References This is been a seriously simplified presentation; I

am focusing mostly on applications. Here are some references for more detail:

http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state.

http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources.

http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links.

http://citeseer.nj.nec.com/525002.html. Paper on summarization using GATE.