CSE509 Lecture 3

47
Muhammad Atif Qureshi and Arjumand Younus Web Science Research Group Institute of Business Administration (IBA) CSE509: Introduction to Web Science and Technology Lecture 3: The Structure of the Web, Link Analysis and Web Search

description

 

Transcript of CSE509 Lecture 3

Page 1: CSE509 Lecture 3

Muhammad Atif Qureshi and Arjumand YounusWeb Science Research Group

Institute of Business Administration (IBA)

CSE509: Introduction to Web Science

and Technology

Lecture 3: The Structure of the Web, Link Analysis and Web Search

Page 2: CSE509 Lecture 3

2

Last Time…

Basic Information Retrieval Approaches

Bag of Words Assumption

Information Retrieval Models Boolean model Vector-space model Topic/Language models

July 23, 2011

Page 3: CSE509 Lecture 3

3

Today

Search Engine Architecture

Overview of Web Crawling

Web Link Structure Ranking Problem

SEO and Web Spam

Web Spam Research

July 23, 2011

Page 4: CSE509 Lecture 3

4

Introduction World Wide Web has evolved from a handful of pages to billions

of pages In January 2008, Google reported indexing 30 billion pages and Yahoo 37

billion.

In this huge amount of data, search engines play a significant role in finding the needed information

Search engines consist of the following basic operations Web crawling Ranking Keyword extraction Query processing

July 23, 2011

Page 5: CSE509 Lecture 3

5

General Architecture of a Web Search Engine

Web

Crawler Indexing

Index

Visual Interface

Ranking

User

Query

QueryOperation

s

July 23, 2011

Page 6: CSE509 Lecture 3

6

CRAWLING MODULE

July 23, 2011

Page 7: CSE509 Lecture 3

7

Web Crawler Definition

Program that collects Web pages by recursively fetching links (i.e., URLs) starting from a set of seed pages [HN99]

Objective Acquisition of large collections of Web pages to be indexed by the search

engine for efficient execution of user queries

July 23, 2011

Introduction

Page 8: CSE509 Lecture 3

8

Basic Crawler Operation

1.Place known seed URLs in the URL queue2.Repeat following steps until a threshold number of pages

downloaded1. Fetch a URL on the URL queue and download the corresponding Web page2. For each downloaded Web page

a) Extract URLs from the Web pageb) For each extracted URL, check validity and availability of URL using checking

modulesc) Place the URLs that pass the checks on the URL queue

July 23, 2011

DNSresolver

URLduplication checkURL queue

Robotscheck

Web pages

Web

Checking module

Extracted URLs

New URLs

URLs tocrawl

Crawled Web pages

URLs tocrawl

NOTATIONS USED

: queue

: module

: data flow

Linkextractor

Web page downloader

Seed URLs

Page 9: CSE509 Lecture 3

9

Crawling Issues Load at visited Web sites

Load at crawler

Scope of crawl

Incremental crawling

July 23, 2011

Page 10: CSE509 Lecture 3

10

RANKING MODULE

July 23, 2011

Page 11: CSE509 Lecture 3

11

Problems of TFIDF Vector Works well on small controlled corpus, but not on the Web

Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times American Airlines mentioned?

Easy to spam Ranking purely based on page content Authors can manipulate page content to get high ranking

Any idea?

July 23, 2011

Page 12: CSE509 Lecture 3

12

Web Page Ranking Motivation

User queries return huge amount of relevant web pages, but the users want to browse the most important ones

Note: Relevance represents that a web page matches the user’s query

ConceptOrdering the relevant web pages according to their importanceNote: Importance represents the interest of a user on the relevant web pages

Methods Link-based method: exploiting the link structure of web for ordering the search results Content-based method: exploiting the contents of web pages for ordering the search results

July 23, 2011

Page 13: CSE509 Lecture 3

13

Link Structure of Web Concept

Web can be modeled as a graph G(V, E) where V is a set of vertices representing web nodes, and E is a set of edges representing directed links between the nodes.

Note: Web node represents either a web page or a web domain. Links are classifed into two classes as follows:

The link structure is called web graph.

ExampleV = {A, B, C}

E = {AB, BC}

AB is an outlink of the web node A.

BC is an outlink of the web node B.

AB is an inlink of the web node B.

BC is an inlink of the web node C.

A CB

Inlink: the incoming link to a web node. Outlink: the outgoing link from a web node.

Fig. 1: An example of a web graph.

July 23, 2011

Page 14: CSE509 Lecture 3

14

PageRank: Basic Idea

Think of ….People as pagesRecommendations as links

Therefore, “Pages are popular, if popular pages link them”

“PageRank is a global ranking of all Web pages regardless of their content, based solely on their location in the Web’s graph structure” [Page et al 1998]

July 23, 2011

Page 15: CSE509 Lecture 3

15

PageRank Overview

A web page is more important if it is pointed by many other important web pages

The importance of a web page (called PageRank value) represents the probability that a user visits the web page

Function

July 23, 2011

PR[p]: PageRank value of web page p

Nolink(q): number of outlinks of web page q

d: damping factor (probability of following a link)

v[p]: probability that a user randomly jumps to web page p (random jump value over web page p)

AB

CD

F

E

jump to a random page< User’s behavior on the web

graph >

web page

linkrandom jump from F to B

user

important web page

][)1()(

][][),(:

pvdqN

qPRdpPREpqq outlink

Page 16: CSE509 Lecture 3

16

PageRank Example

Iteration 1 Iteration 2 Iteration 3r0(1)=1/4 r1(1)=1/6 r2(1)= 7/72r0(2)=1/4 r1(2)= ? r2(2)= ?r0(3)=1/4 r1(3)=1/12 r2(3)= ?r0(4)=1/4 r1(4)=7/24 r2(4)= ?

July 23, 2011

1 2

3

4

727

312

1

324

5 (1)r2

Page 17: CSE509 Lecture 3

17

PageRank: Problems on the Real Web

Dangling nodes A page with no links to send importance All importance “leak out of” the Web Solution: Random surfer model

Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web Solution: Damping factor

July 23, 2011

Page 18: CSE509 Lecture 3

18

Link Analysis in Modern Web Search PageRank like ideas play basic role in the ranking functions of

Google, Yahoo! And Bing

Current ranking functions far from pure PageRank Far more complex Evolve all the time Kept in secret!

July 23, 2011

Page 19: CSE509 Lecture 3

19

Search Engine Optimization Important game-theoretic principle: the world reacts and adapts

to the rules Web page authors create their Web pages with the search engine’s ranking

formula in mind

July 23, 2011

Page 20: CSE509 Lecture 3

20

A Huge Challenge for Today’s Search En-gines

SEO gives birth to nuisance of Web spam

July 23, 2011

Page 21: CSE509 Lecture 3

21

Web Spam Concept

Any deliberate action in order to boost a web node’s rank, without improving its real merit.

Link spam: web spam against link-based methods An action that changes the link structure of web in order to boost web node's ranking. Example

N3

N4

N1 N2

The web nodes N1 and N2 are not involved in link

spam, so they care called non-spam nodes

N5

Nx

Web nodes N3-Nx are involved in link spam, so

they are called spam nodes

Node Link Actor

Actor creates

the web node

N 3 to N x

I want to boost the rank of the web node N3

Fig. 2: An example of link spam.

July 23, 2011

Page 22: CSE509 Lecture 3

22

TrustRank Overview [GGP04]

Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks.

Trust scores are propagated through the outlinks of trusted domains. Domains having high trust scores(≥threshold) at the end of propagation are declared as non-

spam domains.

Example

ObservationTrust scores can propagate to spam domains if trusted domain outlinks to the spam domains.

1

2

31/2

t(1)=1

t(2)=1

t(3)=5/6

1/2

1/31/3

1/3

5/12

5/12

4t(4)=1/3

A seed non-spam domain

t(i): The trust score of domain i

The domain 3 gets trust scores from the domains 1 and 2.

A domain being considered

Fig. 3: An example for explaining TrustRank.

July 23, 2011

Page 23: CSE509 Lecture 3

23

Anti-TrustRank Overview [KR06]

Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks.

Anti-trust scores are propagated by the inlinks of anti-trusted domains. Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as

spam domains.

Example

ObservationAnti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain.

1

2

31/2

at(1)=1

at(2)=1

at(3)=5/6

1/2

1/3

1/3

1/3

5/12

5/12

4at(4)=1/3

A seed spam domain

at(i): The anti-trust score of domain i

The domain 3 gets anti-trust scores from the domains 1 and 2.

A domain being considered

Fig. 4: An example for explaining Anti-TrustRank.

July 23, 2011

Page 24: CSE509 Lecture 3

24

Spam Mass Overview [GBG06]

A domain is spam if it has excessively high spam score. Spam score is estimated as subtraction from a PageRank score to a non-spam score. Non-spam score is estimated as a trust score computed by TrustRank.

Example

Observation Since the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does.

1

25

3A seed non-spam domain

A domain being considered

The domain 5 receives many inlinks but only one indirect inlink from a

non-spam domain.

4

76

Fig. 5: An example for explaining Spam Mass.

July 23, 2011

Page 25: CSE509 Lecture 3

25

Link Farm Spam Overview [WD05]

A domain is spam if it has many bidirectional links with domains. A domain is spam if it has many outlinks pointing to spam domains.

Example

Observation Link Farm Spam does not take any input seed set. A domain can have many bidirectional links with trusted domains as well.

2

1 345

A domain being considered

The domains 1, 3, and 4 have two directional links.

Fig. 6: An example for explaining Link Farm Spam.

July 23, 2011

Page 26: CSE509 Lecture 3

26

RESEARCH SECTION

July 23, 2011

Page 27: CSE509 Lecture 3

27

Web Spam Filtering Algorithm Overview

The web spam filtering algorithms output spam nodes to be filtered out [GBG06]. In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam

nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05].

Spam input seed set: the input seed set containing spam nodes. Non-spam input seed set: the input seed set containing non-spam nodes.

The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06].

Observation The output quality of web spam filtering algorithms is dependent on that of the input seed

sets. The output of the one web spam filtering algorithm can be used as the input of the other web

spam filtering algorithm.

The algorithms may support one another if placed in appropriate succession.

July 23, 2011

Page 28: CSE509 Lecture 3

28

Motivation and Goal Motivation

There is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms.

There is no well-known study on successions among web spam filtering algorithms.

Goal Improving the quality of web spam filtering by using seed refinement. Improving the quality of web spam filtering by finding the appropriate succession among web

spam filtering algorithms.

July 23, 2011

Page 29: CSE509 Lecture 3

29

Contributions We propose modified algorithms that apply seed refinement techniques using

both spam and non-spam input seed sets to well-known web spam filtering algorithms.

We propose a strategy that makes the best succession of the modified algorithms.

We conduct extensive experiments in order to show quality improvement for our work. We compare the original(i.e., well-known) algorithms with the respective modified algorithms. We evaluate the best succession among our modified algorithms.

July 23, 2011

Page 30: CSE509 Lecture 3

30

Web Spam Filtering Using Seed Refinement

Objectives Decrease the number of domains incorrectly detected as belonging to the class of non-spam

domains (called False Positives). Increase the number of domains correctly detected as belonging to the class of spam domains

(called True Positives).

Our approaches We modify the spam filtering algorithms by using both spam and non-spam domains in order

to decrease False Positives. We use non-spam domains so that their goodness should not propagate to spam domains. We use spam domains so that their badness should not propagate to non-spam domains.

We make the succession of these algorithms in order to increase True Positives. We make the succession of the seed refinement algorithm followed by the spam detection algorithm so

that the spam detection algorithm uses the refined input seed sets, which is produced by the seed re-finement algorithm.

July 23, 2011

Page 31: CSE509 Lecture 3

31

Modified TrustRank Modification

Trust score should not propagate to spam domains.

Example

1

2

31/2

t(1)=1

t(2)=1

t(3)=5/6

1/2

1/31/3

1/3

5/12

5/12

A seed non-spam domain

t(i): The trust score of domain iThe domains 5 and 6 are involved in Web spam.

A domain being consideredt(5)=5/12 +

5 6

4t(4)=1/3

t(6)=5/12 + …

5/12

5/12

A seed spam domain

Fig. 7: An example explaining Modified TrustRank.

July 23, 2011

Page 32: CSE509 Lecture 3

32

Modified Anti-TrustRank Modification

Anti-Trust score should not propagate to non-spam domains.

Example

1

2

31/2at(1)=1

at(2)=1

at(3)=5/6

1/2

1/3

1/3

1/3

5/12

5/12

4

The domains 5 ,6 and 7 are non- spam domains.

at(5)=5/12

at(6)=5/12 + …

56

at(i): The anti-trust score of domain i

A domain being considered

A seed spam domain

75/12

at(4)=1/3

5/12

5/12 at(7)=5/12 + … A seed non-spam domain

Fig. 8: An example explaining Modified Anti-TrustRank.

July 23, 2011

Page 33: CSE509 Lecture 3

33

Modified Spam Mass Modification

Use modified TrustRank in place of TrustRank.

Example

1

25

3A seed non-spam domain

A domain being considered

The domain 5 receives many inlinks4

76

but only one indirect inlink from a non-spam domain.

A seed spam domain

Fig. 9: An example explaining Modified Spam Mass.

July 23, 2011

Page 34: CSE509 Lecture 3

34

Modified Link Farm Spam Modification

Use two types (i.e., spam and non-spam domain) of input seed sets. A domain having many bidirectional links with only trusted domains is not detected as a spam

domain.

Example

2

1 345

A domain being considered

The domains 1, 3, and 4 have two directional links.

Fig. 10: An example explaining Modified Link Farm Spam.

A seed non-spam domain

6 87

July 23, 2011

Page 35: CSE509 Lecture 3

35

Modified Link Farm Spam

Seed Refiner

Spam Detector

Detected spam domains

Class

Data flow

Refined spam and non-spam

domains

Manually labeled spam and non-spam

domains

Fig. 11: The strategy of succession.

Overview We make the succession of the seed refinement algorithms (simply, Seed

Refiner) followed by the spam detection algorithms (simply, Spam Detector).

We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively.

Strategy Consideration of the execution order in Seed Refiner.

Modified TrustRank followed by Modified Anti-TrustRank.

Modified Anti-TrustRank followed by Modified TrustRank.

Consideration of the execution order in Spam Detector. Modified Spam Mass followed by Modified Link Farm Spam.

Modified Link Farm Spam followed by Modified Spam Mass.

July 23, 2011

Page 36: CSE509 Lecture 3

36

Performance Evaluation Purpose

Show the effect of seed refinement on the quality of web spam filtering. Show the effect of succession on the quality of web spam filtering.

Experiments We conduct two sets of the experiments according to the two purposes as mentioned above.

Table. 1: Summary of the experiments.

Experimental Sets Experiments Parameters

Set 1: Comparisons for showing the effect of

refining seed

Exp.1 Comparison between TR (TrustRank) and MTR (Modified TrustRank)

cutoffTr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85

Exp.2 Comparison between ATR (Anti-TrustRank) and MATR (Modified Anti-TrustRank)

cutoffATr 0% − 300%ratioTop 10%, 50%, 100%damp 0.85

Exp.3 Comparison between SM (Spam Mass) and MSM (Modified Spam Mass)

relativeMass 0.7 − 1.0topPR 10%, 50%, 100%damp 0.85

Exp.4 Comparison between LFS (Link Farm Spam) and MLFS (Modified Link Farm Spam)

limitBL 2 − 7limitOL 2 − 7

Set 2: Comparisons for showing the effect of ordering executions

Exp.5 Finding the best succession for the seed refinercutoffTr 50%, 75%, 100%cutoffATr 100%damp 0.85

Exp.6 Finding the best succession for the spam detector

relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85

Exp.7 Comparison among the best succession, the best known algorithm, and best modified algorithm

relativeMass 0.8 − 0.99topPR 100%limitBL 7limitOL 7damp 0.85

July 23, 2011

Page 37: CSE509 Lecture 3

37

Experimental Parameters

Table. 2: Parameters used in experiments.

Parameters Descriptiondamp It is a parameter used in TR, MTR, ATR, and MATR. It is the probability of following an outlink.

RatioTop

It is the ratio for determining the input seed sets in TR, MTR, ATR, and MATR.Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top-Ratiotop% domain in the entire domains, and then, use the domains as the input seed set.

cutoffTrIt is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoffTr proportional to the size of input seed set of the non-spam domains.

cutoffATrIt is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoffATr proportional to the size of input seed set of the spam domains.

relativeMassIt is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam.

topPRIt is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores.

limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold.

limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.

July 23, 2011

Page 38: CSE509 Lecture 3

38

Experimental data [BCD08] [CDB06] [CDG07]

Experimental Data

Domains Web Pages

LabeledSpam 1,924

Total 77.9 Million

Non-Spam 5,549Unlabeled Unknown 3,929

Total 11,402

Seed Set Test SetLabeled Spam Domains 674 1,250Labeled Non-Spam Domains 4,948 601

Table. 3: Characteristics of the data set in terms of domains and web pages.

Table. 4: Classification of the data set as Seed Set and Test Set.

July 23, 2011

Page 39: CSE509 Lecture 3

39

Experimental Measure

Measures Description

True positivesThe number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]

False positivesThe number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08]

F-measure

The combined representation of precision and recall. Precision, recall [SM86], and F-measure are expressed as follows.

Table. 5: Description of the measures.

1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam).

July 23, 2011

Page 40: CSE509 Lecture 3

40

Comparison between Originaland Modified Algorithms (1/3)

Experiment 1: Comparison Between TR and MTR MTR performs either comparable to or slightly better than TR in terms of both true positives and

false positives. We find cutoffTr effective till 100% mark indicating that after 100% detection becomes unstable

in terms of false positives.

For later experiments, we fix the cutoffTr range till 100%.

Experiment 2: Comparison Between ATR and MATR MATR generally performs better than ATR in terms of true positives We find cutoffATr effective till 180% mark indicating that after 100% detection becomes unstable

in terms of false positives.

For later experiments, we fix the cutoffATr at 100% to ensure high precision.

July 23, 2011

Page 41: CSE509 Lecture 3

41

Comparison between Originaland Modified Algorithms (2/3)

Experiment 3: Comparison Between SM and MSM MSM performs slightly better than SM in terms of true positives and comparable in terms of

false positives We find relativeMass effective between the range of 0.95 to 0.99 in terms of maximizing true

positives and minimizing false positives.

For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range.

Experiment 4: Comparison Between LFS and MLFS MLFS performs better than LFS in terms of false positives while at some expense of true

positives. We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing

many false positives.

For later experiments, we keep limitBL = 7 and limitOL = 7.

July 23, 2011

Page 42: CSE509 Lecture 3

42

Comparison between Originaland Modified Algorithms (3/3)

Summary We have found all modified algorithms providing better quality than the respective original

algorithms. We found SM as the best original web spam detection algorithms among ATR, SM, and LFS

algorithms due to high true positives and relatively less false positives. We also found MSM as the best modified web spam detection algorithms among MATR, MSM,

and MLFS algorithms due to high true positives and relatively less false positives.

July 23, 2011

Page 43: CSE509 Lecture 3

43

True Positives False Positives

For Finding Refined Non-

Spam Domains

For Finding Refined Spam

Domains

The Best Succession for the Seed Refiner

Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner.

Identical performance for both successions Identical performance for both successions

Identical performance for both successionsBetter performance for MATR-MTR compared toMTR-MATR

Table. 6: Comparison for the seed refiner.

July 23, 2011

Page 44: CSE509 Lecture 3

44

The Best Successionfor the Spam Detector

Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other

values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MLFS fails to detect considerable number of spam domains. We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM, MSM-MLFS, MLFS, and

MSM respectively. We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM, MSM-MLFS, MLFS, and MSM

respectively. MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as

the best spam detector without loss of generality.

Fig. 12: Comparison for the spam detector.

July 23, 2011

Page 45: CSE509 Lecture 3

45

Comparison We pick 0.99 of relativeMass since false positives are minimum at this value compared to other

values of relativeMass while true positives are almost comparable for all values of relativeMass. We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives. We obtain the precisions 0.85, 0.86, and 0.86 for SM, MSM, and MATR-MTR-LFS-MSM

respectively. We obtain the recalls 0.64, 0.70, and 0.80 for SM, MSM, and MATR-MTR-LFS-MSM respectively.

Comparison among the Best Succession, theBest Known Algorithm and the Best Modified Al-

gorithm

Fig. 13: Comparison among MATR-MTR-MLFS-MSM, SM, and MSM.

Therefore, MATR-MTR-MLFS-MSM is more effective.

July 23, 2011

Page 46: CSE509 Lecture 3

46

Conclusions

We have improved the quality of web spam filtering by using seed refinement We have proposed modifications in four well-known web spam filtering algorithms.

We have proposed a strategy of succession of modified algorithms Seed Refiner contains order of executions for seed refinement algorithms. Spam Detector contains order of executions for spam detection algorithms.

We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filtering We find that every modified algorithm performs better than the respective original algorithm. We find the best performance among the successions by MATR followed by MTR, MLFS, and MSM (i.e.,

MATR-MTR-MSM). This succession outperforms the best original algorithm i.e., SM, by up to 1.25 times in recall and is comparable in terms of precision.

July 23, 2011

Page 47: CSE509 Lecture 3

47

QUESTIONS?

July 23, 2011