LINGO
description
Transcript of LINGO
LINGO
Sandra Gama
Search Results Clustering
Internet endless document collection
Search Engines
NO question answering
FAST access to Web content
SENSITIVE to query quality
we NEED meaningful RESULTS
CLUSTERING!
GROUPING by Similarity
Semantic structure
Groups
Description
Luxury Car
Feline, panther family
Description QUALITY
How to cluster?
LINGOa new approach
Pre-processing
Phrase extraction
Cluster-Label Induction
Cluster-content allocation
Filtered docs
Frequent phrases
Cluster labels
user query
clustered documents
STAGE 1/4: PREPROCESSING
Pre-processing
Phrase extraction
Cluster-Label Induction
Cluster-content allocation
Filtered docs
Frequent phrases
Cluster labels
user query
clustered documents
STAGE 1/4: PREPROCESSING
1. Text segmentation
2. Stemming
3. Ignore stop words
STAGE 2/4: PHRASE EXTRACTION
Pre-processing
Phrase extraction
Cluster-Label Induction
Cluster-content allocation
Filtered docs
Frequent phrases
Cluster labels
user query
clustered documents
Goal
1/4 More than N occurrences
2/4 No more than 1 sentence
3/4 Complete phrase
4/4 Stop words
How it works
1 2 3 4 5 6 7 8 9 10 11
a b r a c a d a b r aHow many non-empty suffixes?
abracadabra
bracadabra
racadabra
acadabra
cadabra
adabra
dabra
abra
bra
ra
a
11 suffixes
abracadabra
bracadabra
racadabra
acadabra
cadabra
adabra
dabra
abra
bra
ra
a
Sorted Suffix Index
a 11
abra 8
abracadabra 1
acadabra 4
adabra 6
bra 9
bracadabra 2
cadabra 5
dabra 7
ra 10
racadabra 3
1 2 3 4 5 6 7 8 9 10 11 12
a b r a c a d a b r a $
1
2
3
4
5
6
7
8
9
10
11
Sorted Suffix Indexa 11
abra 8
abracadabra 1
acadabra 4
adabra 6
bra 9
bracadabra 2
cadabra 5
dabra 7
ra 10
racadabra 3
11 8 1 4 6 9 2 5 7 10 3Suffix array:
STAGE 3/4: CLUSTER-LABEL INDUCTION
Pre-processing
Phrase extraction
Cluster-Label Induction
Cluster-content allocation
Filtered docs
Frequent phrases
Cluster labels
user query
clustered documents
Singular Value Decomposition
A term x document matrix
U, ∑ , V such that A = U ∑ VTfind matrixes
D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization
T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval
P1: Singular valueP2: Information retrieval
D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization
T1: InformationT2: Singular
T3: ValueT4: Computations
T5: Retrieval
D1 D2 D3 D4 D5 D6 D7
0.00 0.00 0.56 0.56 0.00 0.00 1.00
0.49 0.71 0.00 0.00 0.00 0.71 0.00
0.49 0.71 0.00 0.00 0.00 0.71 0.00
0.72 0.00 0.00 0.00 1.00 0.00 0.00
0.00 0.00 0.83 0.83 0.00 0.00 0.00
Abstract concept matrix (SVD)
0.00 0.75 0.00 -0.66 0.00
0.65 0.00 -0.28 0.00 -0.71
0.65 0.00 -0.28 0.00 0.71
0.39 0.00 0.92 0.00 0.00
0.00 0.66 0.00 0.75 0.00
U =
0.00 0.56 1.00 0.00 0.00 0.00 0.00
0.71 0.00 0.00 1.00 0.00 0.00 0.00
0.71 0.00 0.00 0.00 1.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 1.00 0.00
0.00 0.83 0.00 0.00 0.00 0.00 1.00
= PT1
: Inf
orm
ation
P2: I
nfor
mati
on re
triev
al
P1: S
ingu
lar v
alue
T2: S
ingu
lar
T4: C
ompu
tatio
ns
T3: V
alue
T5: R
etrie
val
T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval
M matrix = UkTP
0.92 0.00 0.00 0.65 0.65 0.39 0.00
0.00 0.97 0.75 0.00 0.00 0.00 0.66
Phrases/single words
Abstractconcepts
T1: I
nfor
mati
on
P2: I
nfor
mati
on
retr
ieva
l
P1: S
ingu
lar v
alue
T2: S
ingu
lar
T4: C
ompu
tatio
ns
T3: V
alue
T5: R
etrie
val
Last step
Prune overlapping label descriptions
ZTZ
STAGE 4/4: CLUSTER-CONTENT ALLOCATION
Pre-processing
Phrase extraction
Cluster-Label Induction
Cluster-content allocation
Filtered docs
Frequent phrases
Cluster labels
user query
clustered documents
Similarity
Cluster Score
Evaluation and Results
Test Data
10 categories
4 subjects
Subject # docs Contents
Movies 77 Information about the BladeRunner movie
Movies 92 Information about the Lord of the Rings movie
Health Care 77 Orthopedic equipment and manufactures
Photography 15 Infrared-photography references
Computer Science 27 Articles about data warehouses (integrator DBs)
Computer Science 42 MySQL database
Computer Science 15 Native XML databases
Computer Science 38 PostgreSQL database
Computer Science 39 Java programming language tutorials and guides
Computer Science 37 VI text editor
Identifier Merged Categories
G1 LRings, MySQL
G3 LRings, MySQL, Ortho, Infra
G5 MySQL, XMLDB, Dware, Postgr, JavaTut, Vi
G6 MySQL, XMLDB, Dware, Postgr, Ortho
Identifier Merged Categories
G1 Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links
G3 MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie
G5 Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database
G6 MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web
Cluster Contamination
Analytical evaluation:
LINGO vs. Suffix Tree Clustering
CONCLUSIONS
Future work
Pointer
Communication!
LINGOThank you.
Search Results Clustering