User oriented text summarization
description
Transcript of User oriented text summarization
BY
Asef poormasoomi
Motivation
summaries which are generic in nature do not cater to the user’s background and interests
results show that each person has different perspective on the same text
So a good summary should change in accordance to preferences of its reader
Motivation
Marcu-1997: found percent agreement of 13 judges over 5 texts from scientific America is 71 percent.
Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap
Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap
Users FeedbackQuery History:
is the most widely used implicit user feedback at present.
http://www.google.com/psearch
Data Click:when a user clicks on a document, the document is
considered to be of more interest to the user than other unclicked ones
Attention Time :often referred to as display time or reading time
Other types of implicit user feedbacks :Other types of implicit user feedbacks include display
time, scrolling, annotation, bookmarking and printing behaviors
ARTICLE1:
Generating Personalized Summaries Using Publicly AvailableWeb Documents
2008 IEEE Chandan Kumar, Prasad Pingali, Vasudeva Varma
extract the personal information of the user using information available on the web
Generic Sentence Scoring In General : compute the probability distribution over the words w appearing in the input D, p(w|D) :
For each sentence S in the input, assign a weight equal to the average probability of the words in the sentence
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Estimating User Background model : used search engine to extract the personal information of the user using information available on the web.
put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”)
’n’ top documents are taken and retrieved.After performing the removal of stop words and
stemming, a unigram language model is learned on the extracted text content.
This model can be interpreted as the probability of a word w being related to the person’s profile U :
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
User Specific Sentence Scoring :
the term probability of the document set D p(w|D), and the user profile U p(w|U) have been merged using a linear weighted combination. The score of a sentence S for user u is given as :
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
After sentence scoring, eliminate redundancy :
for redundancy identification, use the measure of number of terms overlapping between the already generated summary and the new sentence being considered
sentence are arranged based on chronological ordering (between documents i.e.based on the time stamp) and order of occurrence (within the document).
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Example : Topic of summary generation is ”Microsoft to open
research lab in India”8 articles published in different new sources forms
the news clusterIn the example we are showing the condensed
summary(100 words) for two users. User A is from NLP domain and User B from network security domain.
The italic text in user specific summary shows the differnce compare to generic summary
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Generic summary:
The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
User A Specific summary :
The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition. In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
User B Specific summary :
The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
EvaluationThe evaluation of this technique was carried out on
five different research scholars working in different fields of computer science
News articles of science and technology domain were considered for summarization. 25 different topics were chosen with each topic having 5-10 articles.
Each researcher was asked to judge the relevance of both versions of summaries for all 25 topics(1-5 score).
Result show that the users prefer profile based personalized summaries compared to a generic summary given by general automatic summarization system
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
EvaluationThis figure shows the scores given by a particular
user across different topics. for most of the topics user find personalized
summaries relevant for him.personalized summaries for the topics strongly related
to the user’s domain are more relevant to him For topics which are not closely related to user’s field,
the personalized and generic summaries are quite similar
For a few rare topics the user did not find personalized summary better
Article1: Generating Personalized Summaries Using Publicly AvailableWeb Documents, 2008 IEEE, Chandan Kumar, Prasad Pingali, Vasudeva Varma
Evaluation
ARTICLE2:
User-oriented Document Summarization Through Vision-based Eye-tracking
2009 ACM Songhua Xu , Hao Jiang , Francis C.M. Lau
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
MAIN IDEAThe key idea is to rely on the attention (reading) time of
individual users spent on single words in a document.The prediction of user attention over every word in a document
is based on the user’s attention during his previous readsalgorithm tracks a user’s attention times over individual words
using a vision-based commodity eye-tracking mechanism.user attention time over any arbitrary word is predicted by a
data mining process
use simple web camera and an existent eye-tracking algorithm “Opengazer project”
The error of the detected gaze location on the screen is between 1–2 cm, depending which area of the screen the user is looking at (a 19” screen monitor).
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
Anchoring Gaze Samples onto Individual Words the detected gaze central point is positioned at (x; y) on the screen
spacecompute the central displaying point of the word which is denoted
as (xi; yi).
and are the average width and height of a word’s displaying bounding box in the document
For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner.
The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process
During processing, remove the stop words.
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
PREDICTION OF USER ATTENTION OVER A SENTENCE
attention time prediction for a word is based on the semantic similarity of two words.
Sim(wi,wj) to denote the semantic similarity between word wi and word wj , where Sim(wi,wj) € [0; 1]
use the algorithm proposed in : Y. Li, Z. A. Bandar, and D. Mclean. An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering.
for an arbitrary word w which is not among , calculate the similarity between w and every wi(i = 1,…, n) and then select k words which share the highest semantic similarity with w.(k is set as min(10; n) )
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
Predicting User Attention for Sentences estimate the total attention of a certain user on a sentence as
the sum of the user’s attention over all the words in the sentence :
AT(wi;Uj) is user Uj ’s attention over the word wi, which is either sampled from the user’s previous reading activities via (1) or predicted via (2).
= 0 if the word wi is a stop word; = 0:6 if there is no attention sample for the user Uj
over the word wi, = 1,otherwise
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
A Hybrid Summarization ApproachIn early experiments, noticed that the performance of
our user-oriented document summarization algorithm heavily depends on the amount of available user attention time samples
To address the issue, integrate new method with a conventional automatic document summarization algorithm(MEAD)
= 1 if sentence si is selected by MEAD in its document summarization result, = 0 otherwise.
k is free parameter and is user tunable.
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
EXPERIMENT RESULTScomparing the document summarization results with
those generated by two popular text summarization algorithms.
use two sets of articles. Articles in the first set are all about science (60 articles from “Science” magazine) and articles in the second set are all about entertainment and leisure (sixty articles are randomly selected from the travel and sports section on “New York Times”)
12 people with different knowledge backgrounds read some selected articles from the two article sets.
they are asked to provide a summary for the article they just read
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
EXPERIMENT RESULTS
to measure the performance , three measurements :Recall (R), Precision (P) and F-rate (F) are introduced
SU e is the human summary result
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
EXPERIMENT RESULTS
Article2: User-Oriented Document Summarization through Vision-Based Eye-
Tracking, 2009 ACM, Songhua Xu , Hao Jiang , Francis C.M. Lau
EXPERIMENT RESULTSexperiment to evaluate the performance of hybrid
approach under different settings for the parameter K.
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
Main Ideause extra knowledge of the clickthrough data to
improve Web-page summarizationcollection of clickthrough data, can be represented
by a set of triples < u; q; p >Typically, a user's query words , reflect the true
meaning of the target Web-page content
In new algorithm, adapt two text-summarization methods to summarize Web pages.The first approach is based on significant-word selection
adapted from Luhn's methodThe second method is based on Latent Semantic Analysis
(LSA)
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
ProblemsWeb pages may have no associated query wordsthe clickthrough data are often very noisy
Solutionthematic lexicon : (using the annotated hierarchical
taxonomy of Web pages such as the one provided by ODP web-site (http://dmoz.org/))
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
Adapted Significant Word (ASW) Methodeach sentence is assigned a significance factor(word
frequency) and the sentences with high significance factors are selected to form the summary
customized factor :
Adapted Latent Semantic Analysis (ALSA) MethodThe corpus can be represented by a term-document
matrix.
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
Summarize Web Pages Not Covered by Clickthrough Databuild a thematic lexiconuse TS(c) to represent a set of terms associated with category c.thematic lexicon is a set of TS, which correspond with categories
in ODP.The lexicon is built as follows :
first, TS corresponding to each category is set emptyfor each page covered by the clickthrough data, its query words are added
into TS if a page belongs to more than one category, its query terms will be added
into all TS associated with all its categories.At last, term weight in each TS is multiplied by its Inverse Category
Frequency (ICF).For each Web page that are not covered by the clickthrough
data,first look up the lexicon for TS according to the page's category, Then the summarization methods are used.
When a TS does not have sufficient terms, TS corresponding with its parent category is used
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
EXPERIMENTS data set contains about 44.7 million records of 29 days from
Dec 6 of 2003 to Jan 3 of 2004 (MSN search engine )3,074,678 Web pages of the ODP directory are crawled. Web
pages crawledAt last got 1,125,207 Web pages, 260,763 of which are clicked
by Web users using 1,586,472 different queries.DAT1, consists of 90 pages which are selected from the browsed
pages.Three human evaluators were employed to summarize these
pages
they also use a relatively large scale data set, denoted by DAT2, to evaluate summarization methods(10,000 pages).
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
Summarization Results on DAT1 (ASW)ROUGE is a software package adopted by DUC for
automatic summarization evaluation (http://www.isi.edu/ cyl/ROUGE/)
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
Summarization Results on DAT1(ALSA)
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
evaluation summarization method using the thematic lexiconclickthrough data contains only 260,763 pages, and
lexicon contains 141,869 categories, which is a subset of the ODP category structure.
If terms under this category have more than P% overlap with distinct terms in the Web page, then they are used for summarization. Otherwise, use lexicon terms of its parent category.
This process continues until we find a category which covers enough query terms or until we reach the root of the thematic lexicon
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
evaluation summarization method using the thematic lexicon(ASW)
Article3: WebPage Summarization Using Clickthrough Data, 2005 ACM, JianTaoSun, Dou Shen , HuaJun Zeng, Qiang Yang , Yuchang Lu , Zheng Chen
evaluation summarization method using the thematic lexicon(ALSA)
thanks