Relating Web Characteristics with Link-Based Ranking

Post on 01-Nov-2014

1.473 views 1 download

Tags:

description

 

Transcript of Relating Web Characteristics with Link-Based Ranking

Relating Web Characteristics

Ricardo Baeza-Yates

Carlos CastilloUniversidad de Chile

Relating Web Characteristics

Agenda

• Introduction

• Link-based ranking

• Web structure

• Web characteristics

• Web usage

• Web dynamics

• Conclusions

Relating Web Characteristics

Introduction: Sample

• Web sample: .CL domain on year 2000• 670,000 pages in 7,500 domains• 15kb average page size• Collection from the TodoCL web search

engine

Relating Web Characteristics

Introduction: Emphasis

• Broder et al.: Graph Structure on the Web (2000)– Page-based structure based on strongly

connected components

– The Web graph is not a random graph

– Process: cut & paste model

• Our is mostly a site-based analysis– Trying to make Web structure meaningful

Relating Web Characteristics

Introduction: The Empire

Relating Web Characteristics

Introduction: One Map

Relating Web Characteristics

Link ranking: Pagerank

∑=

−+=k

i

irPagerankqN

qpPagerank

1

)()1()(

Pages that pointto page p

Probability of a random jump over number of pages

Currently used byGoogleBrin & Page, 1998

Relating Web Characteristics

Link ranking: Hubs & Authorities

• HITS algorithm (Kleinberg, 1998)

• A good authority is a page pointed by good hubs, so we assume that it has good content

• A good hub is a page that points to good authorities, so we assume it is a good set of links

• Linear system calculated by numerical iteration

Relating Web Characteristics

Link ranking: Distribution

9% with relevanthub score 2-3% with relevant

authority score

<2% with relevant Pagerank

Relating Web Characteristics

Link ranking: Correlation

Hub score,authority scoreand Pagerankdo not seem

to be correlated

Relating Web Characteristics

Link ranking: Sites

• Which measure to use for sites ?

• Average score– But good sites can have lots of bad pages

• Maximum score– But one good page cannot be all that is

needed to be a good site

• Sum of the scores of all pages– Natural for Pagerank

Relating Web Characteristics

Link ranking: Sites Graph

90% relevant site-Pagerank

It’s harder to have a good hub than a good authority (site)

Relating Web Characteristics

Web Structure: Basis

• The Web graph has structure:

INOUT

MAIN

ISLANDS

Relating Web Characteristics

Web Structure: Basis (cont.)

• The MAIN component has structure:

INOUTMAIN NORM

MAIN IN

MAIN MAIN MAIN OUT

Relating Web Characteristics

Web Structure: Sketch

Relating Web Characteristics

Web Structure: Degree

Relating Web Characteristics

Web Structure: Sizes

Relating Web Characteristics

Web Structure: Preferences

Relating Web Characteristics

Web Structure: Preferences

OUT

MAINMAIN

MAINMAIN

OUTMAINOUT

Real ODP TodoCL

Relating Web Characteristics

Web Structure: Various

Relating Web Characteristics

Web Structure: Link Scores

Relating Web Characteristics

Web Dynamics: Ages

• The kernel of the Web comes from the past

Relating Web Characteristics

Web Dynamics: By Component

Relating Web Characteristics

Web Dynamics: Pagerank

Pagerank is biased against newer pages

Relating Web Characteristics

Web Dynamics: Hubs & Authorities

Age (months)

Aut

horit

y S

core

Hub

Sco

re

Relating Web Characteristics

Conclusions

• Pagerank/HITS do not seem to be correlated– And Pagerank is biased to older pages

• Site ranking can help to make good human-selected directories

• Finding good pages is not so simple

• Characterizing Web structure gives valuable insight– Web Graph Mining is just starting