Efficient summarization framework for multi-attribute uncertain data
description
Transcript of Efficient summarization framework for multi-attribute uncertain data
1
Efficient summarization framework for multi-attribute
uncertain data
Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra
Uncertain Data Set
The Summarization Problem
2
location (e.g. LA)
face (e.g. Jeff, Kate)
visual concepts (e.g. water, plant, sky)
Extractive
Abstractive
O1
O8
O11
O25
Kate Jeff wedding at LA
O1
O2
On
…
3
Modeling Information
Summarization Process
What information does this image contain?
Extract best subset
…
dataset summary
Metrics?- Coverage Agrawal, WSDM’09; Li, WWW’09; Liu, SDM‘’09;
Sinha, WWW’11 - Diversity Vee, ICDE’08; Ziegler, WWW’05- Quality Sinha, WWW’11
information
object
Existing Techniques
4
Kennedy et al. WWW’08
Simon et al. ICCV’07
Sinha et al. WWW’11
Hu et al. KDD’04
Ly et al. CoRR’11
Inouye et al. SocialCom ’11
Li et al. WWW’09
Liu et al. SDM’09
• Do not consider information in multiple attributes
• Do not deal with uncertain data
image customer review doc/micro-blog
5
Challenges
Design a summarization framework for
Multi-attribute Data
Uncertain/Probabilistic Data.
visual concept
face tags
locationtimeevent
visual conceptsP(sky) = 0.7, P(people) = 0.9
data processing(e.g. vision analysis)
6
Existing techniques typically model & summarize a single information dimension
Limitations of existing techniques - 1
Summarize only information about visual content (Kennedy et al. WWW’08,Simon et al. ICCV’07)
Summarize only information about review content (Hu et al. KDD’04,Ly et al. CoRR’11)
What information is in the image?
7
{sky}, {plant}, …
{Kate}, {Jeff}
{wedding}
{12/01/2012}
{Los Angeles}
Elemental IU
Is that all?
{Kate, Jeff}{sky, plant}…
Intra-attribute IU
Even more information from attributes?
{Kate, LA}
Inter-attribute IU
{Kate, Jeff, wedding}
…
8
Are all information units interesting?
Is {Sharad, Mike} an interesting intra-attribute IU?
Yes, they often have coffee together and appear frequently in other photos
Are all of the 2n combinations of people interesting? Shall we select a summary that covers all these information?
Well, probably not! I don’t care about person X and person Y who happen to be together in the photo of this large group.
Is {Liyan, Ling} interesting?
Yes from my perspective, because they are both my close friends
9
Mine for interesting information units
O1face
{Jeff, Kate}
O2face
{Tom}
O3face
{Jeff, Kate, Tom}
O4face
{Kate, Tom}
O5face
{Jeff, Kate}
…
Onface
{Jeff, Kate}
T1
T2
T3
T4
T5
…
Tn
Modified Item-set mining algorithm
frequentcorrelated{Jeff, Kate}
10
Mine for interesting information units
O1face
{Jeff, Kate}
O2face
{Jeff}
O3face
{Jeff, Kate, Tom}
O4face
{Kate, Tom}
O5face
{Jeff, Kate}
…
Onface
{Jeff, Kate}
Mine from social context
(e.g. Jeff is friend of Kate,
Tom is a close friend of the user)
{Jeff, Kate}
{Tom}
11
Can not handle probabilistic attributes
Limitation of existing techniques – 2
…
dataset summary
P(Jeff) = 0.8
P(Jeff) = 0.6
Not sure whether an object covers an IU in another object
?
objects
IU
1
2
3
n
n
3
12
Deterministic Coverage Model --- Example
Coverage = 8 / 14
dataset summary
information
object
Probabilistic Coverage Model
13
Expected amount of information covered by S
Expected amount of total information
Simplify to compute efficiently
Can be computed in polynomial time
The function is sub-modular
Optimization Problem for summarization
Parameters: dataset O = {o1, o2, · · · , on} positive number K
Finding summary with Maximum Expected Coverage is NP-hard.
We developed an efficient greedy algorithm to solve it.
14
For each object o in O \ S,Compute hkjhkhk
Basic Greedy Algorithm
Expensive to compute Cov. It is
(Object-level optimization)
Too many operations of computing Cov.
(Iteration-levelOptimization)
Initialize S = empty set
Select o* with max
Yes
Nodone
Efficiency optimization – Object-level
Reduce the time required to compute the coverage for one object
Instead of directly compute and optimize coverage in each iteration, compute the gain of adding one object o to summary S
gain(S,o) = -
Updating gain(S,o) is much more efficient ( )
16
Submodularity of Coverage
Expected Coverage Cov(S,O) is submodular:
17
Cov(S, O)Cov(S ∪ o, O) – Cov(S, O)
Cov(T, O)Cov(T ∪ o) - Cov(T, O)
18
Efficiency optimization – Iteration-level
Reduce the number of object-level computations (i.e. gain(S,o) ) in each iteration of the greedy process
While traversing objects in O \ S, we maintain the maximum gain so far gain*.
an upper bound Upper(S, O) on gain(S,o). For any
prune an object o if Upper(S, o) < gain*.
By definition
By submodularity
Update in constant time
Experiment -- Datasets
Facebook Photo Set 200 photos uploaded by 10 Facebook users
Review Dataset Reviews about 10 hotels from TripAdvisor.
Each hotel has about 250 reviews on average.
Flickr Photo Set 20,000 photos from Flickr.
19
visual concept
event timeface
visual conceptfacets rating
visual event time
Experiment – Quality
20
Experiment – Efficiency
21
Basic greedy algorithm without optimization runs more than 1 minute
Summary
22
Developed a new extractive summarization framework
Multi-attribute data.
Uncertain/Probabilistic data.
Generates high-quality summaries.
Highly efficient.
23