Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and...

Post on 23-Dec-2015

216 views 0 download

Transcript of Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and...

Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings

Baoning Wu and Brian D. Davison

Lehigh University

Symposium on Applied Computing 2006

Motivation

Link-based ranking algorithms are important to current popular search engines. (e.g., HITS for Teoma)

Link farms will deteriorate the performance of link-based ranking algorithms

HITS algorithm

Each page has two measures, authority score a shows how good this page is for a query, hub score h shows the possibility that the page points to good authority pages. E is the adjacency matrix.

a = ET h

h = E a

Example: for query “weather” http://www.tripadvisor.com/ http://www.virtualtourist.com/ http://www.abed.com/memoryfoam.html http://www.abed.com/furniture.html http://www.rental-car.us/ http://www.accommodation-specials.com/ http://www.lasikeyesurgery.com/ http://www.lasikeyesurgery.com/lasik-surgery.asp http://mortgage-rate-refinancing.com/ http://mortgage-rate-refinancing.com/mortgage-

calculator.html

Factors that degrade HITS

Mutually reinforcing relationships

Duplicate pages

Link farms

Complete hyperlink

Definition: The link with its anchor text as a unit.

Duplication of a complete link is a much stronger sign of copying behavior on the Web than a duplicate link target.

Document - Complete link Matrix

Bipartite Graph

Two disjoint sets X and Y, each edge starts from an element in X and ends with an element in Y.

Link farms

Link farms are usually densely connected via multiple overlapping small bipartite cores.

Task: to detect densely connected bipartite components from “document - complete link” matrix

Algorithm for finding bipartite components

Result: k=2 and l=2

Adjustment: document-document matrix

Final matrix

Weighted adjacency matrix

Experiment: HITS result of “rental car” http://www.discountcars.net/ http://www.motel-discounts.com/ http://www.stlouishoteldeals.com/ http://www.richmondhoteldeals.com/ http://www.jacksonvillehoteldeals.com/ http://www.jacksonhoteldeals.com/ http://www.keywesthoteldeals.com/ http://www.austinhoteldeals.com/ http://www.gatlinburghoteldeals.com/ http://www.ashevillehoteldeals.com/

Experiment: B&H HITS result of “rental car” http://www.rentadeal.com/ http://www.allaboutstlouis.com/ http://www.allaboutboston.com/ https://travel2.securesites.com/ about_travelguides/addlisting.html http://www.allaboutsanfranciscoca.com/ http://www.allaboutwashingtondc.com/ http://www.allaboutalbuquerque.com/ http://www.allabout-losangeles.com/ http://www.allabout-denver.com/ http://www.allabout-chicago.com/

Experiment: CL-HITS result of “rental car” http://www.hertz.com/ http://www.avis.com/ http://www.nationalcar.com/ http://www.thrifty.com/ http://www.dollar.com/ http://www.alamo.com/ http://www.budget.com/ http://www.enterprise.com/ http://www.budgetrentacar.com/ http://www.europcar.com/

Experiment: B&H HITS result of “translation online” http://www.no-gambling.com/ http://www.teleorg.org/ http://ong.altervista.org/ http://bx.b0x.com/ http://video-poker.batcave.net/ http://www.websamba.com/marketing-campaigns http://online-casino.o-f.com/ http://caribbean-poker.webxis.com/ http://roulette.zomi.net/ http://teleservices.netfirms.com/

Experiment: CL-HITS result of “translation online” http://www.freetranslation.com/ http://www.systransoft.com/ http://babelfish.altavista.com/ http://www.yourdictionary.com/ http://dictionaries.travlang.com/ http://www.google.com/ http://www.foreignword.com/ http://www.babylon.com/ http://www.worldlingo.com/products_services /worldlingo_translator.html http://www.allwords.com/

Duplicate example: BH-HITS result of “maps” http://www.maps.com/ http://www.mapsworldwide.com/ http://www.cartographic.com/ http://www.amaps.com/ http://www.cdmaps.com/ http://www.ewpnet.com/maps.htm http://mapsguidesandmore.com/ http://www.njdiningguide.com/maps.html http://www.stanfords.co.uk/ http://www.delorme.com/

Duplicate example: CL-HITS result of “maps” http://www.maps.com/ http://maps.yahoo.com/ http://www.delorme.com/ http://tiger.census.gov/ http://www.davidrumsey.com/ http://memory.loc.gov/ammem/gmdhtml/gmdhome.html http://www.esri.com/ http://www.maptech.com/ http://www.streetmap.co.uk/ http://www.libs.uga.edu/darchive/hargrett/maps/maps.html

User evaluation

Category HITS BHITS CL-HITS CL-POP

Quite relevant 12.9% 24.5% 48.4% 46.3%

Relevant 10.7% 18.3% 28.8% 26.2%

Not sure 6.6% 10.5% 6.7% 6.4%

Irrelevant 26.8% 14.8% 11.3% 12.7%

Totally irrelevant 42.8% 31.9% 4.6% 8.1%

Discussion

Using link alone, the precision at 10 is 66.4%. Much lower than using “complete link”.

Random anchor texts.

Questions?

baw4@cse.lehigh.edu davison@cse.lehigh.edu