Web Page Categorization without the Web Page
description
Transcript of Web Page Categorization without the Web Page
Web Page Categorization without the Web Page
Author: Min-Yen Kan
WWW-2004
Basic Idea
Web Page Categorization ~ Text Categorization
Some retrieve the whole document This yields URLs of additional documents Could result in cyclic crawling or non-
terminating crawling Glean information from intuitive URLs Avoid the bottleneck
An Example
http://cs.cornell.edu/Info/Courses/Current/CS415/CS415.html
Classify the above webpage into one of the following categories: Course Faculty Project Student
Approach
2 phase URL segmentation First phase
Baseline scheme://host/path-elements/document.extension More segmentation like, faculty-info faculty info
Refined Break the URL if a transition between uppercase,
lowercase and digits is observed
Approach
Second phase Information content reduction
Examines all possible partitions of the segment Adds information content (IC) of all such partitions Pick the one with lowest IC
Title token based finite state transducer What about acronyms Non-deterministic weighted finite-state transducer
splits and expands segments based on previously seen web page titles
An Example
FST Rule Score Output
1. Match the initial letter in the subsequent token 2 |l
2. Match the initial letter in the non-subsequent token 1 |l
3. Match a subsequent letter in the current token 1 l
4. Match the final letter in the current token 3 l
5. Skip a character in the candidate expansion 0 є
nytimes New York Times ФNewYorkTimes
Score of 12 and outputs |n|y|times
R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4
Experiments
Dataset used: WebKB (4167 pages) Classified under student, faculty, course and project Classification used: SVM Compared with: FOIL-PILFS (based on inductive logic
programming) Evaluation made based on (U)RL {Ub,Ur,Ui,Uf}, (A)nchor
text, (T)itle text and page te(X)t
Experiments
Conclusion
URLs contain tokens effective for classification Its faster Careful URL segmentation boosts classification URL segmentation is more powerful than expansion Can assist source based classification to a limited
extent FST can not expand what it hasn’t seen Cryptic URLs are hard to tackle