BINGO!: Bookmark-Induced Gathering of Information
description
Transcript of BINGO!: Bookmark-Induced Gathering of Information
BINGO!:BINGO!: Bookmark-Induced Bookmark-Induced Gathering of InformationGathering of Information
Sergej SizovSergej Sizov, Martin Theobald,, Martin Theobald,
Stefan Siersdorfer, Gerhard WeikumStefan Siersdorfer, Gerhard Weikum
University of the SaarlandUniversity of the Saarland
GermanyGermany
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Part IPart I
System OverviewSystem Overview
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
MotivationMotivation
Web search engines
The vector space modelLink analysis & authority ranking
Information demands
Mass queries(“madonna tour”)
Needle-in-a-haystack queries(“solidarity eisler”)
?
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Overview (II)Overview (II)WWW
ROOT
SemistructuredData
DB CoreTechnology
NetworkingWorkflow and
E-Services
WebRetrieval
DataMining
XML
SemistructuredData
DB CoreTechnology
NetworkingWorkflow and
E-Services
WebRetrieval
DataMining
XML
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Focused CrawlingFocused Crawling
Crawler Queue
Results
Classifier
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Focused Crawling (2)Focused Crawling (2)
Key aspects:
the mathematical model and algorithm that are used for the classifier(e.g., Naive Bayes vs. SVM)
the feature set upon which the classifier makes its decision(e.g., all terms vs. a careful selection of the "most
discriminative" terms)
the quality of the training data
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Focused Crawling (3)Focused Crawling (3)
Crawler
Re-Training
Queue
SVM Classifier H I T S
SVM Archetypes
HubsAuthorities
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
System OverviewSystem Overview
Crawler
DocumentAnalyzer Feature
Selection
ClassifierAdaptive
Re-Training
LinkAnalyzer
URLQueue
DocsFeatureVectors
OntologyIndex
TrainingDocs
Book-marks
Hubs &Authorities
W W W
......
.....
......
.....
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Part IIPart II
System ComponentsSystem Components
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Focus ManagerFocus Manager
Focusing strategies
Depth-first (df):
Breadth-first (bf):
Strong focus (learning phase)
Soft focus (harvesting phase)
Tunneling
depth(j)+pos(j) /links(j)P (j)=bf 2(confidence(j)+1)
pos(j) 2P (j)=- depth(j)+ ×(confidence(j)+1)df links(j)
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Focus Manager (2)Focus Manager (2)
Sample URL Prioritization
confidence = 0.3topic=A
1
2 3
5 6
4
7 8 9 10
confidence = 0.4topic=A
confidence = 0.85topic=A
confidence = 0.6topic=B
DF strong order: 1–2–5–3–6–4–9–10 ..BF strong order: 1–2–5–3–4–6–9–10 ..DF soft order: 1–2–5–6–3–7–8–4–9–10 ..BF soft order: 1–2–5–3–6–4–7–8–9–10 ..
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Feature SelectionFeature Selection
Mutual Information (MI) criterion:P[ X V ]i jMI( X ,V ) P[ X V ] log
i j i j P[ X ] P[V ]i j
A A NMI( X ,V ) log
i j N A B ( A C )
A is the number of documents in Vj containing Xi,B is the number of documents with Xi in "competitive" topics C is the number of documents in Vj without Xi N is the overall number of documents in Vj and its competitive topics
Time complexity: O(n)+O(mk) for n documents, m terms and k competitive topic.
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Feature Selection (2)Feature Selection (2)Top features for the topic “DB Core Technology" with regard to tf*idf (left) and MI (right)
tf*idf score MI weight
below 1.4927 storag 0.1428 et 1.2778 modifi 0.1258 graph 1.2446 sql 0.1209 involv 1.0406 disk 0.1179 accomplish 0.9491 pointer 0.1150 backup 0.8613 deadlock 0.1001 command 0.8567 redo 0.1001 exactli 0.8112 implement 0.0963 feder 0.7764 correctli 0.0911 histor 0.6822 size 0.0911
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
ClassifierClassifier
w x b 0
δ
¬ VV
?
δ
x1
x2
Training: Compute w x b 0 ����������������������������
Classification: Check w y b 0 ����������������������������
Input:
n training vectors with
components (x1, ..., xm, C)
and C = +1 or C = -1σ
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Hierarchical ClassificationHierarchical Classification
Recursive classification by the taxonomy tree. Decisions based on topic-specific feature spaces
SemistructuredData
DB CoreTechnology
ROOT
NetworkingWorkflow and
E-Services
WebRetrieval
DataMining
XML
0.80.1
-0.50.2
-0.70.2
SemistructuredData
0.4
DataMining
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Link AnalysisLink Analysis
The HITS Algorithm
q p( p ,q ) E
q p( p ,q ) E
Authority Score : x y
Hub Score : y x
Iterative approximation of the dominant Eigenvectors of ATA and AAT:
xAA:yA:x TT
yAA:xA:y T yAx T
xAy
Web graphG = (S, E)
?
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Retraining based on ArchetypesRetraining based on Archetypes
Two sources of potential archetypes:
Link analysis → Nauth good authorities
SVM classifier → Nconf best-rated docs
To avoid the "topic drift" phenomenon: the classification confidence of an archeteype must be higher than the mean confidence of the previous iteration's training documents.
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Retraining (2)Retraining (2)if {at least one topic has more than Nmax positive documents or all topics have more than Nmin positive documents} {for each topic Vi { link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents; authorities (Vi) = top Nauth documents; sort docs of Vi in descending order of confidence; archetypes (Vi) = top Nconf from confidence ranking auth (Vi); remove from archetypes(Vi) all docs with confidence < mean of the previous iteration; archetypes (Vi) = archetypes(Vi) bookmarks (Vi) };for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi }re-initialize URL queue using hubs (Vi) to URL queue } }
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Part IIIPart III
EvaluationEvaluation
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
TestbedTestbedBookmarks: homepages of researchers in the various areasLeaf nodes were filled with 9 -15 bookmarksThe total training data comprised 81 documents
Focused crawl:Crawling time: 6hVisited: 11000 pages (1800 hosts), link distances 1 – 74230 positively classified (675 different hosts)
Entire crawl: 7 iterations with re-training.Parameters:
Nmin = 50, Nmax = 200,Nhub = 50, Nauth = 20, Nconf = 20.Feature selection: MI criterion, best 300 for each topic;Authority ranking: HITS algorithm
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Crawling PrecisionCrawling Precision
Iteration Data Mining XMLEntire ontology
1 0,98 0,94 0,98
2 0,98 0,93 0,98
3 0,99 0,97 0,96
4 0,87 0,99 0,97
5 0,90 0,95 0,96
6 0,98 0,98 0,95
7 0,94 0,97 0,96
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Crawling Precision (2)Crawling Precision (2)
Iteration BINGO!with
focusing,no MI
no focusing,
no MI
1 0,98 0.89 0.84
2 0,98 0.86 0.86
3 0,96 0.75 0.79
4 0,97 0.78 0.73
5 0,96 0.55 0.63
6 0,95 0.54 0.52
7 0,96 0.63 0.50
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Crawling RecallCrawling Recall
Iteration Data Mining XMLEntire ontology
1 307 117 807
2 552 343 1615
3 1092 396 2436
4 1553 442 3245
5 2071 562 4072
6 2678 627 4898
7 3027 701 5715
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Archetype SelectionArchetype SelectionTopic „Data Mining“:
URL SVM confidence
http://www.it.iitb.ernet.in/~sunita/it642/ 1.35 http://www.research.microsoft.com/research/datamine/ 1.31 http://www.acm.org/sigs/sigkdd/explorations/ 1.28http://robotics.stanford.edu/users/ronnyk/ 1.24 http://www.kdnuggets.com/index.html 1.18http://www.wizsoft.com/ 1.16 http://www.almaden.ibm.com/cs/people/ragrawal/ 1.14http://www.cs.sfu.ca/~han/DM_Book.html 1.14http://db.cs.sfu.ca/sections/publication/kdd/kdd.html 1.14http://www.cs.cornell.edu/johannes/publications.html 0.78
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Archetype Selection (2)Archetype Selection (2)
Iteration Data Mining XMLEntire ontology
1 10 (1) 5 (0) 24 (4)
2 10 (2) 11 (0) 27 (5)
3 9 (1) 17 (1) 32 (4)
4 8 (0) 7 (0) 29 (3)
5 22 (2) 26 (2) 62 (8)
6 43 (4) 12 (2) 77 (10)
7 38 (0) 13 (1) 75 (8)
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Feature SelectionFeature Selection
Topic „Data Mining“:
Feature MI weight
mine 0.178
knowledg 0.122
olap 0.106
frame 0.086
pattern 0.066
genet 0.061
discov 0.053
miner 0.053
cluster 0.049
dataset 0.044
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Future WorkFuture WorkLarge-scale experiments (portal generator)
Annotation and semantic classification of HTML sources (e.g. transformation of HTML to XML for improved data management, detection of “information units”)
Advanced feature construction and feature selection algorithmsFault tolerance on document collections with wrong samples, adaptive re-training
... ?
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
CrawlerCrawler
Key features:
asynchronous DNS lookups with caching
multiple download attempts
advanced duplicate recognition
following multiple redirects
advanced topic-balanced URL-queue
document filters for common datatypes
focusing strategies
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Classifier (II)Classifier (II)
Training:Find hyperplane that separates the samples with maximum margin (quadratic optimization task):
Classification:Test unlabeled vector y for
Very efficient runtime in O(m)
w x b 0 ����������������������������
w y b 0 ����������������������������
n
ii 1
nii 1 i i
ni 1 i
1minimize : V( ,b, ) C
2
subj . to : y [ x b ] 1
0
��������������������������������������������������������
����������������������������
BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov
Related WorkRelated Work
General-purpose crawling
Focused crawling
Authority ranking
Classification of Web documents
Web ontologies