Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han,...

25
Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han, Philip S.Yu Industrial and Government Track short paper Advisor Advisor Dr. Koh Jia-Ling Dr. Koh Jia-Ling Speaker Speaker Che-Wei Liang Che-Wei Liang Date Date 2007.11.20 2007.11.20 1

Transcript of Truth Discovery with Multiple Confliction Information Providers on the Web Xiaoxin Yin, Jiawei Han,...

Truth Discovery with Multiple Confliction Information Providers

on the WebXiaoxin Yin, Jiawei Han, Philip S.Yu

Industrial and Government Track short paper

AdvisorAdvisor :: Dr. Koh Jia-LingDr. Koh Jia-LingSpeakerSpeaker :: Che-Wei LiangChe-Wei Liang

DateDate :: 2007.11.202007.11.20

1

Outline

• Introduction• Problem Definitions• Computational Model– Web Site Trustworthiness and Fact Confidence– Iterative Computation

• Empirical Study• Conclusions

2

Introduction

• World-wide web– a necessary part of our lives.– ex: Amazon.com, ShopZilla.com.

• Is the world-wide web always trustable?– There is no guarantee for the correctness of

information on the web.

3

Introduction

• Example 1: Authors of books

incomplete!

incorrect!

4

Introduction

• Ranking web pages– According to authority based on hyperlinks.– Ex: Authority-Hub analysis, PageRank,

more general link-based analysis.

• Does authority or popularity of web sites lead to accuracy of information?

5

Introduction

• Veracity problem– Discover the true fact about each object.

6

Problem Definitions

• Define1: Confidence of facts.– The probability of a fact f being correct,

denote by s(f).

• Define2: Trustworthiness of web sites.– The expected confidence of the facts provided by

a web site w, denote by t(w).

7

Problem Definitions

• Facts may be conflict or supportive to each other.– Ex: “Jennifer Widom”, “J. Widom”

• Concept of implication– imp(f1 → f2): f1’s influence on f2’s confidence.

8

Basic heuristic

• Basic heuristic1. Usually there is only one true fact

for a property of an object.

2. This true fact appears to be the same or similar on different web sites.

9

Basic heuristic (cont.)

• Basic heuristic3. The false facts on different web sites are

less likely to be the same or similar.

4. In a certain domain, a web site that provides mostly true facts for many objects will likely provide true facts for other objects.

10

Web Site Trustworthiness and Fact Confidence

• Trustworthiness t(w)

where F(w) is the set of facts provided by w.

11

Web Site Trustworthiness and Fact Confidence

• more difficult to estimate the confidence of a fact.

12

Web Site Trustworthiness and Fact Confidence

• Simple case– f1 is the only fact about object o1

– assume w1 and w2 are independent.

• Confidence s(f)

W(f) is the set of web sites providing f.13

Web Site Trustworthiness and Fact Confidence

• Trustworthiness score of a web site

• τ(w) is between 0 and +∞, better characterizes how accurate w is.– ex: t(w1) = 0.9, t(w2) = 0.99

t(w2) = 1.1 × t(w1)

τ(w2) = 2 × τ(w1)

14

Web Site Trustworthiness and Fact Confidence

• Confidence score of a fact

– Property:

15

Web Site Trustworthiness and Fact Confidence

• adjusted confidence score of a fact f

16

Web Site Trustworthiness and Fact Confidence

• Compute the confidence of f based on σ*(f) in the same way as computing it based on σ(f).

• Different web sites are independent. add a dampening factor γ, 0 < γ < 1.

incorrect!

17

Web Site Trustworthiness and Fact Confidence

• Negative-confidence problem– a fact f conflicting with some facts provided by

trustworthy web sites. σ*(f) < 0 and s*(f) < 0.

• – If γ . σ*(f) > 0, s(f) is very close to s*(f).– If γ . σ*(f) < 0, s(f) is close to zero but still

positive.

unreasonable!

18

Iterative Computation

• TRUTHFINDER - Iterative method– TruthFinder has little information about the

web sites and the facts.

– Each iteration, improves its knowledge about trustworthiness and confidence.

– Stops when the computation reaches a stable state.

19

Empirical Study

• Compare with VOTING– Which Chooses the fact that is provided by most

web sites.

• Intel PC with a 1.66GHz dual-core processor, 1GB memory, Windows XP Professional.ρ = 0.5 and γ = 0.3.

20

Empirical Study

21

Empirical Study

22

Empirical Study

23

Empirical Study

24

Conclusions

• Introduce and formulate the Veracity problem– resolving conflicting facts from multiple web site.– finding true facts among them.

• Propose TRUTHFINDER– Utilizes Web site trustworthiness and fact confidence to

find trustable web sites and true facts.

• Experiment achieves high accuracy.

25