Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on...

13
Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee.

Transcript of Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on...

Page 1: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Recognizing Communities on the Web

CS349 Presentation by

Audrey Kao

Recognizing Nepotistic Links on the Web, Brian D. Davison.

Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee.

Page 2: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Introduction How do links determine web communities?

Natural community formation vs. web authors manipulating nepotistic links

Theoretical graph theory vs. artificial learning program

Both papers are fairly dated, from 2002

Page 3: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

What is a Web Community?

Goal: To identify web communities. Why?For practical applications and web analysis

A collection of web pages where each member page has more links within the community than outside the community.

Page 4: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Maximum Flow Communities

• Given a directed graph G = (V, E), with edge capacities c(u, v) ϵ Z+, and two vertices s, t ϵ V, find the maximum flow that can be routed from the source, s, to the sink, t, that obeys all capacity constraints

• The Max Flow-Min Cut theorem proves that the maximum flow of the network = minimum cut that separates s and t

Page 5: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Exact vs. Approximate Flow Communities

Exact: The “sink” is artificial and generic, ie. it receives from every edge from every other vertex Accepts any bi-directional link The community is very connected internally, but isolated from the rest of the

graph

Approximate: Determined by a fixed depth crawl Uses the exact-flow-community algorithm, then chooses the highest-ranked sites

and repeats the algorithm Rank determined by number of edges site has to within the community This model used for study as it better represents the actual web

Score determined by total # of inbound and outbound links a page has to other pages in its community…

Page 6: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Sample ResultsFrancis Crick Community

80 Biography of Francis Harry Compton Crick (Nobel Foundation)79 Biography of James Dewey Watson (Nobel Foundation)51 The Nobel Prize in Physiology or Medicine 1962 (Nobel Foundation)50 Biographical Sketch of James Dewey Watson (Cold Spring Harbor Lab.)41 A structure for Deoxyribose Nucleic Acid (Nature, April 2, 1953)...1 Felix D’Herelle and the Origins of Molecular Biology (Amazon.com)1 Biography of Gregor Mendel1 Magazine: HMS Beagle Home1 The Alfred Russel Wallace Page1 U.S. Human Genome Project 5 Year Plan

Stephen Hawking Community85 Professor Stephen W. Hawking’s web pages46 Stephen Hawking’s Universe at PBS17 The Stephen Hawking Pages15 Stephen Hawking Builds Robotic Exoskeleton (parody at the Onion)14 Stephen Hawking and Intel...1 Did the cosmos arise from nothing? MSNBC story1 Spanish page for Stephen Hawking’s Universe1 Relativity Group at DAMTP, Cambridge1 Millennium Mathematics Project1 Particle Physics Education and Information Sites

Ronald Rivest Community86 Ronald L. Rivest : Home Page29 Chaffing and Winnowing: Confidentiality without Encryption20 Thomas H. Cormen’s home page at Dartmouth9 The Mathematical Guts of RSA Encryption8 German news story on Cryptography...1 Phil Zimmermann’s PGP web page1 A Very Brief History of Computer Science1 Cormen / Leiserson / Rivest: Introduction to Algorithms1 Security and Encryption Links1 HotBot Directory: Computers & Internet, Computer Science, People: R

rivest, “l rivest”, “ronald l”, ronald, cryptography, rsa, “ron rivest”, lcs, “theory lcs”, encryption, “lcs mit”, theory, chaffing, winnowing, crypto

Community Most Significant Text Features

crick, nobel, dna, “francis crick”, “the nobel”, “of dna”, watson, “james watson”, francis, molecular, biology, genetics, “watson and”, “structure of”, “crick and”

hawking, “stephen hawking”, stephen, “hawking s”, “s universe”, physics,“black holes”, “the universe”, cambridge, cosmology, einstein, relativity, damtp, “universe the”

Page 7: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Results, con’t Communities are strongly topically related in the form of binary classifiers

Study used three-term binary classifiers like crick or nobel or darwin (54% match for the Francis Crick community, but only 0.5% for random web pages), hawking or relativity or “for mathematical”(84% Stephen Hawking community, 0.2% random pages) to determine communities

Breadth-first crawling strategies do not yield topically relevant pages (only 10% of pages at a depth of two matched classification rules)

Page 8: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

What are Nepotistic Links? Nepotistic Links: Links between pages that

are present for reasons other than merit Sites that are run by the same administrative

control, like About.com Advertising/paid links Note: different from duplicate pages or

mirrored sites

Eba6.com

Mapquesy.com

Page 9: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Preliminary Experiments Two data sets were used:

1. 1536 arbitrarily selected manually labeled links 2. 750 random links from DiscoWeb search engine’s 7 million pages, also

manually labeled as either nepotistic or not

75 binary features were used: Identical page titles or descriptions? Page descriptions overlapped at least some percentage of the text Identical complete host names? Some number of initial IP address identical? Pages share at least some percentage of outgoing links Domains had same contact email address?

Page 10: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Machine Learning C4.5 decision tree package used to determine the binary features

Page 11: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Results

Page 12: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Results, con’t Can classify links with more accuracy if one uses already categorized

search engine results as “training data”

Second set of data too small – does not represent the variety of sites on the web

Nepotistic links largely do not affect popular pages

Page 13: Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the.

Conclusions Both experiments focused on binary classifiers

Naïve researchers: scale of web is too large to run any of these algorithms on it, both used small sample sizes to begin with