Post on 06-Feb-2016
description
Link Distribution on Wikipedia
[0422]KwangHee Park
Table of contents Introduction Similarity between document
Error case Modify word bag
Conclusion
Introduction Why focused on Link
When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others
Assumption Link terms in the Wikipedia articles is the key terms which
can represent specific characteristic of articles
Introduction Problem what we want to solve is
To analyses latent distribution of set of Target document by topic modeling
Topic modeling – our approach Target
Document = Wikipedia article Terms = linked term in document
Modeling method LDA
Modeling tool Lingpipe api
Advantage of linked term Don’t need to extra preprocessing
Boundary detection Remove stopword Word stemming
Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document
cancer
A Cancer
Preliminary Problem How well link terms in the document are represent
specific characteristic of that document
Link evaluation Calculate similarity between document
Link evaluation Similarity based evaluation
Calculate similarity between documents Sim_d{doc1,doc2}
Calculate similarity between terms Sim_t{term1,term2}
Compare two similarity
Similarity between documents Sim_d
Similarity between documents Significantly affected input term set
Data set 1536 number of document
Disease domain : 208 Settlement domain : 1328
p,q = topic distribution of each document Kullback Leibler divergence
Example –reasonable
Example – not good
Error analysis Length problem – overestimate portion of topic
If the document contain only few link term then portion of topic of that document tend to be overestimated Ex)1950 년 ,1960 년 , 파푸아 뉴기니 , 식인풍습
Error analysis Some document’s Link terms do not describe docu-
ment itself Ex) Date, Country,…etc
Demo website For disease domain :
http://semanticweb.kaist.ac.kr/research/tmodel/ For settlement domain :
http://semanticweb.kaist.ac.kr/research/tmodel/sindex.php
For disease + settlement domain : http://semanticweb.kaist.ac.kr/research/tmodel/dsi
ndex.php
Modify word bag Including non-link term
Excluding noise term
Weighted score for duplication term
Including incoming link
Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can rep-
resent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many applica-
tion
Thank