Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on...
-
Upload
edwina-amberly-chambers -
Category
Documents
-
view
214 -
download
0
Transcript of Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on...
Recognizing Communities on the Web
CS349 Presentation by
Audrey Kao
Recognizing Nepotistic Links on the Web, Brian D. Davison.
Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee.
Introduction How do links determine web communities?
Natural community formation vs. web authors manipulating nepotistic links
Theoretical graph theory vs. artificial learning program
Both papers are fairly dated, from 2002
What is a Web Community?
Goal: To identify web communities. Why?For practical applications and web analysis
A collection of web pages where each member page has more links within the community than outside the community.
Maximum Flow Communities
• Given a directed graph G = (V, E), with edge capacities c(u, v) ϵ Z+, and two vertices s, t ϵ V, find the maximum flow that can be routed from the source, s, to the sink, t, that obeys all capacity constraints
• The Max Flow-Min Cut theorem proves that the maximum flow of the network = minimum cut that separates s and t
Exact vs. Approximate Flow Communities
Exact: The “sink” is artificial and generic, ie. it receives from every edge from every other vertex Accepts any bi-directional link The community is very connected internally, but isolated from the rest of the
graph
Approximate: Determined by a fixed depth crawl Uses the exact-flow-community algorithm, then chooses the highest-ranked sites
and repeats the algorithm Rank determined by number of edges site has to within the community This model used for study as it better represents the actual web
Score determined by total # of inbound and outbound links a page has to other pages in its community…
Sample ResultsFrancis Crick Community
80 Biography of Francis Harry Compton Crick (Nobel Foundation)79 Biography of James Dewey Watson (Nobel Foundation)51 The Nobel Prize in Physiology or Medicine 1962 (Nobel Foundation)50 Biographical Sketch of James Dewey Watson (Cold Spring Harbor Lab.)41 A structure for Deoxyribose Nucleic Acid (Nature, April 2, 1953)...1 Felix D’Herelle and the Origins of Molecular Biology (Amazon.com)1 Biography of Gregor Mendel1 Magazine: HMS Beagle Home1 The Alfred Russel Wallace Page1 U.S. Human Genome Project 5 Year Plan
Stephen Hawking Community85 Professor Stephen W. Hawking’s web pages46 Stephen Hawking’s Universe at PBS17 The Stephen Hawking Pages15 Stephen Hawking Builds Robotic Exoskeleton (parody at the Onion)14 Stephen Hawking and Intel...1 Did the cosmos arise from nothing? MSNBC story1 Spanish page for Stephen Hawking’s Universe1 Relativity Group at DAMTP, Cambridge1 Millennium Mathematics Project1 Particle Physics Education and Information Sites
Ronald Rivest Community86 Ronald L. Rivest : Home Page29 Chaffing and Winnowing: Confidentiality without Encryption20 Thomas H. Cormen’s home page at Dartmouth9 The Mathematical Guts of RSA Encryption8 German news story on Cryptography...1 Phil Zimmermann’s PGP web page1 A Very Brief History of Computer Science1 Cormen / Leiserson / Rivest: Introduction to Algorithms1 Security and Encryption Links1 HotBot Directory: Computers & Internet, Computer Science, People: R
rivest, “l rivest”, “ronald l”, ronald, cryptography, rsa, “ron rivest”, lcs, “theory lcs”, encryption, “lcs mit”, theory, chaffing, winnowing, crypto
Community Most Significant Text Features
crick, nobel, dna, “francis crick”, “the nobel”, “of dna”, watson, “james watson”, francis, molecular, biology, genetics, “watson and”, “structure of”, “crick and”
hawking, “stephen hawking”, stephen, “hawking s”, “s universe”, physics,“black holes”, “the universe”, cambridge, cosmology, einstein, relativity, damtp, “universe the”
Results, con’t Communities are strongly topically related in the form of binary classifiers
Study used three-term binary classifiers like crick or nobel or darwin (54% match for the Francis Crick community, but only 0.5% for random web pages), hawking or relativity or “for mathematical”(84% Stephen Hawking community, 0.2% random pages) to determine communities
Breadth-first crawling strategies do not yield topically relevant pages (only 10% of pages at a depth of two matched classification rules)
What are Nepotistic Links? Nepotistic Links: Links between pages that
are present for reasons other than merit Sites that are run by the same administrative
control, like About.com Advertising/paid links Note: different from duplicate pages or
mirrored sites
Eba6.com
Mapquesy.com
Preliminary Experiments Two data sets were used:
1. 1536 arbitrarily selected manually labeled links 2. 750 random links from DiscoWeb search engine’s 7 million pages, also
manually labeled as either nepotistic or not
75 binary features were used: Identical page titles or descriptions? Page descriptions overlapped at least some percentage of the text Identical complete host names? Some number of initial IP address identical? Pages share at least some percentage of outgoing links Domains had same contact email address?
Machine Learning C4.5 decision tree package used to determine the binary features
Results
Results, con’t Can classify links with more accuracy if one uses already categorized
search engine results as “training data”
Second set of data too small – does not represent the variety of sites on the web
Nepotistic links largely do not affect popular pages
Conclusions Both experiments focused on binary classifiers
Naïve researchers: scale of web is too large to run any of these algorithms on it, both used small sample sizes to begin with