Web Page Categorization without the Web Page

Web Page Categorization without the Web Page

Author: Min-Yen Kan

WWW-2004

Basic Idea

Web Page Categorization ~ Text Categorization

Some retrieve the whole document This yields URLs of additional documents Could result in cyclic crawling or non-

terminating crawling Glean information from intuitive URLs Avoid the bottleneck

An Example

http://cs.cornell.edu/Info/Courses/Current/CS415/CS415.html

Classify the above webpage into one of the following categories: Course Faculty Project Student

Approach

2 phase URL segmentation First phase

Baseline scheme://host/path-elements/document.extension More segmentation like, faculty-info faculty info

Refined Break the URL if a transition between uppercase,

lowercase and digits is observed

Approach

Second phase Information content reduction

Examines all possible partitions of the segment Adds information content (IC) of all such partitions Pick the one with lowest IC

Title token based finite state transducer What about acronyms Non-deterministic weighted finite-state transducer

splits and expands segments based on previously seen web page titles

An Example

FST Rule Score Output

1. Match the initial letter in the subsequent token 2 |l

2. Match the initial letter in the non-subsequent token 1 |l

3. Match a subsequent letter in the current token 1 l

4. Match the final letter in the current token 3 l

5. Skip a character in the candidate expansion 0 є

nytimes New York Times ФNewYorkTimes

Score of 12 and outputs |n|y|times

R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4

Experiments

Dataset used: WebKB (4167 pages) Classified under student, faculty, course and project Classification used: SVM Compared with: FOIL-PILFS (based on inductive logic

programming) Evaluation made based on (U)RL {Ub,Ur,Ui,Uf}, (A)nchor

text, (T)itle text and page te(X)t

Experiments

Conclusion

URLs contain tokens effective for classification Its faster Careful URL segmentation boosts classification URL segmentation is more powerful than expansion Can assist source based classification to a limited

extent FST can not expand what it hasn’t seen Cryptic URLs are hard to tackle

Web Page Categorization without the Web Page

Documents

Transcript of Web Page Categorization without the Web Page