Hercules Dalianis page 1 Automatic text summarization Hercules Dalianis NADA-KTH Royal Institute of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Hercules Dalianis page 1 Automatic text summarization Hercules Dalianis NADA-KTH Royal Institute of...
Hercules Dalianis page 1
Automatic text summarizationAutomatic text summarization
Hercules DalianisNADA-KTH
Royal Institute of Technology100 44 Stockholm
ph: +46-8-790 91 05mobile: +46 70 568 13 59email: [email protected]
Hercules Dalianis page 2
Overview of talkOverview of talk
• Background
• Other summarizers
• Technique
• Future improvements
• Applications
• Evaluation
Hercules Dalianis page 3
Automatic text summarizationAutomatic text summarization• Automatic text summarization is the method
where a computer summarizes a text. • A text is given to the computer and it returns
an non-redundant shorter text- An extract from a longer original text.
• The technique has it’s roots in the 60’s.• With the Internet and the WWW it has been
an awakening interest in summarization techniques.
Hercules Dalianis page 4
Summarization toolsSummarization toolshttp://www.nada.kth.se/~hercules/HDbookmarks.htm
http://www.ics.mq.edu.au/~swan/summarization/projects_full.htm
• Microsoft Word 97, 98 and Word 2000 have a summarizer for documents.
• Intelligent Miner for Text -Summarization tool IBM
• Inxight (XEROX) • Datahammer (Glucose Development
Corporation)
Hercules Dalianis page 5
• Corporum Summarizer- Cognit AS (Norway)• Pertinence (France)• Copernic Summarizer• MuST Prototype• Automated Text Summarization
(SUMMARIST) • Columbia Newsblaster
http://www1.cs.columbia.edu/nlp/newsblaster
• OracleContext • Autonomy
Hercules Dalianis page 6
What is Automatic summarization What is Automatic summarization good for?good for?
• News paper setting and printing– Sydsvenska Dagbladet, Bergens Tidene
• Summarize Scientific texts – Danmarks Elektroniske Forskningsbibliotek
• Telephone systems– Read summarized news synthetically
Hercules Dalianis page 7
• Search engines – to summarize documents for hitlist c.f. Google,
SiteSeeker.
• NewsAgent - Business Intelligence– TDT Topic Detection Tracking and
Columbia Newsblaster
Hercules Dalianis page 11
Summarization approachesSummarization approaches• Extraction vs. Abstraction• Generic vs. Query based• Indicative vs. Informative• Restricted vs. Unrestricted domain• Background information vs. New information (TDT)• Single-document vs. Multiple-document• Monolingual vs. Multilingual• Textual vs. Multimedia
Hercules Dalianis page 12
Text summarizationText summarization
• Extraction is much easier than abstraction• Abstraction needs understanding and rewriting
Hercules Dalianis page 13
TechiquesTechiques
• Find what the text is about
• Then decide what so say
• Then decide how to say it• Text summarization (extraction) uses statistic,
linguistic and heuristic methods
Hercules Dalianis page 14
TechiquesTechiques
• A text is divided into sentences• Sentence positions (News/Reports) • Title words• Bold text, Numerical values, Citations• Named Entities (Frequence based)• Keyword frequency and extraction (nouns,
adverbs, adjectives) • Use morphological information-lemma
Hercules Dalianis page 15
Key word lexiconKey word lexicon
• Key words in news domain• Also called "open class word lexicon”• Key words can be noun, adjectives or adverbs
Hercules Dalianis page 16
• Word which are present in all other sentences.
• User adaptation– Use user keywords - Obtain slanted summaries
• Combination function of all rankings with different weights gives the rank of each sentence.
• Generate all high ranking sentences
• Voilá the summary !
Hercules Dalianis page 17
SweSumSweSum• The first text summarizer for Swedish• Summarizes Swedish news paper text in
HTML/text format on the WWW.• Uses a Swedish key word lexicon that
contains 40 000 words and their possible 700 000 inflections.
• During the text summarization are 5-10 key words produced which describes or categorizes the text -
• Key words - A miniature summary.
Hercules Dalianis page 18
The Swedish keyword lexiconThe Swedish keyword lexicon700 000 words 40 000 words
Inflected version Lemma
statsminister statsminister
statsministern statsminister
statsministerns statsminister
statsministrarna statsminister
statsministrarnas statsminister
.. ...
regeringen regeringen
regeringens regeringen
regeringarna regeringen
regeringarnas regeringen
... ....
Hercules Dalianis page 20
SweSumSweSum
• SweSum is available to summarize news texts on Swedish, Danish, Norwegian, English, Spanish, French, German and in Farsi (Iranian).
Hercules Dalianis page 24
ProblemsProblems
• Pronoun and other anafora referenser• Kalle sprang. Han sprang fort.• Pronoun resolution
• Clauses can be too long or too short• Clause reductions- and clause combination rules• Aggregation
Hercules Dalianis page 25
SweSumSweSum without without PRMPRM
Analysera mera!
Regi: Harold Ramis
Medv: Robert De Niro, Billy Crystal, Lisa Kudrow
Längd: 1 tim, 45 min
…
Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Han accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och han är tämligen oemotståndlig. Här har han åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination.
SvD 99-10-08
Hercules Dalianis page 26
SweSum with PRMSweSum with PRM
Analysera mera!
Regi: Harold Ramis
Medv: Robert De Niro, Billy Crystal, Lisa Kudrow
Längd: 1 tim, 45 min
…
Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Robert accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och Robert är tämligen oemotståndlig. Här har Harold åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination.
SvD 99-10-08
Hercules Dalianis page 27
EvaluationEvaluation
• We found that if one summarizes the text to 30 percent of original length one will obtain around 70-80 percent accuracy on 3-4 pages news articles.
• .. but query based evaluations are based on subjective opinions
• These evaluation need large human effort• Small overlap of opinions• We need man-made extracts to compare the machine
made extracts automatically
Hercules Dalianis page 28
• There are some man-made extracts for English news texts.
• We had to create such extract for Swedish news text.
• We created KTH Extract Corpus- Corpus created manually once by voting
• Then one can compare the texts from SweSum and KTH Extract Corpus manually or soon automatically
Hercules Dalianis page 29
KTH extract corpusKTH extract corpus
• http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php
and
• http://www.nada.kth.se/iplab/hlt/kthxc/
• Visa celltexten
Hercules Dalianis page 30
• http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php?cutoff=30&fileid=svenska-%3Etest-%3Etext001.htm
Hercules Dalianis page 31
Future improvements of SweSumFuture improvements of SweSum
• Tagging instead of static lexicons
• Clause level summarization
• Improved Named Entity recognition
• Improved Pronominal Resolution
• Lexical chains using SIMPLE and/or EuroWordNet
• Automatic evaluation method
Hercules Dalianis page 32
DemonstratorsDemonstrators• SweSum – Standard version
http://swesum.nada.kth.se/index-eng.html
• SweSum – Experimental NE versionhttp://www.nada.kth.se/~xmartin/swesum_lab/index-eng.html
(SweSum uses a Perl-CGI script, there is also a standalone version for plain text/html)