Efficient summarization framework for multi-attribute uncertain data

1

Efficient summarization framework for multi-attribute

uncertain data

Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra

Uncertain Data Set

The Summarization Problem

2

location (e.g. LA)

face (e.g. Jeff, Kate)

visual concepts (e.g. water, plant, sky)

Extractive

Abstractive

O1

O8

O11

O25

Kate Jeff wedding at LA

O1

O2

On

…

3

Modeling Information

Summarization Process

What information does this image contain?

Extract best subset

…

dataset summary

Metrics?- Coverage Agrawal, WSDM’09; Li, WWW’09; Liu, SDM‘’09;

Sinha, WWW’11 - Diversity Vee, ICDE’08; Ziegler, WWW’05- Quality Sinha, WWW’11

information

object

Existing Techniques

4

Kennedy et al. WWW’08

Simon et al. ICCV’07

Sinha et al. WWW’11

Hu et al. KDD’04

Ly et al. CoRR’11

Inouye et al. SocialCom ’11

Li et al. WWW’09

Liu et al. SDM’09

• Do not consider information in multiple attributes

• Do not deal with uncertain data

image customer review doc/micro-blog

5

Challenges

Design a summarization framework for

Multi-attribute Data

Uncertain/Probabilistic Data.

visual concept

face tags

locationtimeevent

visual conceptsP(sky) = 0.7, P(people) = 0.9

data processing(e.g. vision analysis)

6

Existing techniques typically model & summarize a single information dimension

Limitations of existing techniques - 1

Summarize only information about visual content (Kennedy et al. WWW’08,Simon et al. ICCV’07)

Summarize only information about review content (Hu et al. KDD’04,Ly et al. CoRR’11)

What information is in the image?

7

{sky}, {plant}, …

{Kate}, {Jeff}

{wedding}

{12/01/2012}

{Los Angeles}

Elemental IU

Is that all?

{Kate, Jeff}{sky, plant}…

Intra-attribute IU

Even more information from attributes?

{Kate, LA}

Inter-attribute IU

{Kate, Jeff, wedding}

…

8

Are all information units interesting?

Is {Sharad, Mike} an interesting intra-attribute IU?

Yes, they often have coffee together and appear frequently in other photos

Are all of the 2n combinations of people interesting? Shall we select a summary that covers all these information?

Well, probably not! I don’t care about person X and person Y who happen to be together in the photo of this large group.

Is {Liyan, Ling} interesting?

Yes from my perspective, because they are both my close friends

9

Mine for interesting information units

O1face

{Jeff, Kate}

O2face

{Tom}

O3face

{Jeff, Kate, Tom}

O4face

{Kate, Tom}

O5face

{Jeff, Kate}

…

Onface

{Jeff, Kate}

T1

T2

T3

T4

T5

…

Tn

Modified Item-set mining algorithm

frequentcorrelated{Jeff, Kate}

10

Mine for interesting information units

O1face

{Jeff, Kate}

O2face

{Jeff}

O3face

{Jeff, Kate, Tom}

O4face

{Kate, Tom}

O5face

{Jeff, Kate}

…

Onface

{Jeff, Kate}

Mine from social context

(e.g. Jeff is friend of Kate,

Tom is a close friend of the user)

{Jeff, Kate}

{Tom}

11

Can not handle probabilistic attributes

Limitation of existing techniques – 2

…

dataset summary

P(Jeff) = 0.8

P(Jeff) = 0.6

Not sure whether an object covers an IU in another object

?

objects

IU

1

2

3

n

n

3

12

Deterministic Coverage Model --- Example

Coverage = 8 / 14

dataset summary

information

object

Probabilistic Coverage Model

13

Expected amount of information covered by S

Expected amount of total information

Simplify to compute efficiently

Can be computed in polynomial time

The function is sub-modular

Optimization Problem for summarization

Parameters: dataset O = {o1, o2, · · · , on} positive number K

Finding summary with Maximum Expected Coverage is NP-hard.

We developed an efficient greedy algorithm to solve it.

14

For each object o in O \ S,Compute hkjhkhk

Basic Greedy Algorithm

Expensive to compute Cov. It is

(Object-level optimization)

Too many operations of computing Cov.

(Iteration-levelOptimization)

Initialize S = empty set

Select o* with max

Yes

Nodone

Efficiency optimization – Object-level

Reduce the time required to compute the coverage for one object

Instead of directly compute and optimize coverage in each iteration, compute the gain of adding one object o to summary S

gain(S,o) = -

Updating gain(S,o) is much more efficient ( )

16

Submodularity of Coverage

Expected Coverage Cov(S,O) is submodular:

17

Cov(S, O)Cov(S ∪ o, O) – Cov(S, O)

Cov(T, O)Cov(T ∪ o) - Cov(T, O)

18

Efficiency optimization – Iteration-level

Reduce the number of object-level computations (i.e. gain(S,o) ) in each iteration of the greedy process

While traversing objects in O \ S, we maintain the maximum gain so far gain*.

an upper bound Upper(S, O) on gain(S,o). For any

prune an object o if Upper(S, o) < gain*.

By definition

By submodularity

Update in constant time

Experiment -- Datasets

Facebook Photo Set 200 photos uploaded by 10 Facebook users

Review Dataset Reviews about 10 hotels from TripAdvisor.

Each hotel has about 250 reviews on average.

Flickr Photo Set 20,000 photos from Flickr.

19

visual concept

event timeface

visual conceptfacets rating

visual event time

Experiment – Quality

20

Experiment – Efficiency

21

Basic greedy algorithm without optimization runs more than 1 minute

Summary

22

Developed a new extractive summarization framework

Multi-attribute data.

Uncertain/Probabilistic data.

Generates high-quality summaries.

Highly efficient.

Efficient summarization framework for multi-attribute uncertain data

Documents

Transcript of Efficient summarization framework for multi-attribute uncertain data