A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...
-
Upload
abel-george -
Category
Documents
-
view
215 -
download
0
description
Transcript of A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...
A Powerful Principle forAutomatically Finding Concepts in
Unstructured Data
Holger BastMax-Planck-Institut für Informatik (MPII)
Saarbrücken, Germany
Self-Star Workshop, Bertinoro, 2nd June 2004
Dimension Reduction
Holger BastMax-Planck-Institut für Informatik (MPII)
Saarbrücken, Germany
Self-Star Workshop, Bertinoro, 2nd June 2004
Dimension ReductionA Powerful Principle for
Intelligent Search in Very Large Text Collections
One Type of Query Searching for the exact terms
Search
SELF-STAR: Self-* Properties in Complex Information SystemsSELF-STAR: International Workshop on Self-* Properties in ComplexInformation Systems 31 May - 2 June 2004 University of Bologna ...www.cs.unibo.it/self-star/ - 14k - 26 May 2004 - Cached - Similar pages
Self-Star RegistrationSELF-STAR: International Workshop on Self-* Properties in Complex InformationSystems 31 ... registration. Back to the Self-Star Home page. www.cs.unibo.it/self-star/register.html - 7k - Cached - Similar pages
CEO Forum Home pageThe Teacher Preparation STaR Chart, a self-assessment tool for collegesof education is available! Or, use the interactive Teacher ... www.ceoforum.org/ - 6k - Cached - Similar pages
This is easily automated (full text index)
self star
Another Type of Query Searching for what is behind words
SearchChurch Media Community - Power point 2003? - Any point in jumping ... ... Two interesting facts about the new PowerPoint Viewer is that it ... New Church Media Member. Nominate Now. Join Date: Oct ... so you can't directly place text over the ... www.churchmedia.net/ community/forum/showthread.php - 77k - Cached – Similar Pages
Flash < Internet < tutorialfind... Here is the actionscript in two easy steps ... Then join Mike as he walks you through tweening, morphing, and more ... Simple Use of Text Boxes. Use Text boxes to gather ... www.tutorialfind.com/tutorials/internet/flash/ - 92k - Cached - Similar Pages
[DOC] Introduction ... set up reminders and click a button to join the broadcast ... There are two ways to change colors: use a preset color ... and then make any changes to the text like you ... www.microsoft.com/education/DOWNLOADS/tutorials/ classroom/office2k/ppt2000.doc - Similar PagesHow to improve on this in a self-star fashion?
join two text boxes in power point
query expressedin conceptsa concept
expressedin terms
a querya document expressed
in terms
document expressedin concepts
The Idea of Dimension Reduction
internet 0 2 0 1 0 0web 2 1 0 0 0 0surfing 1 1 0 1 1 1beach 0 0 1 1 1 1hawaii 0 0 2 2 2 1
The approximation actually adds to the precision
2 02 01 10 10 2
●
Hawaii, 2nd June 2004Dear Pen Pal,I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB
1 1 0 .5
0 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
10000
internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .
51 1
hawaii 0 0 2 1 2 2
Equally dissimilar to query!
10
matrix multiplication
Finding concepts = approximate low-rank matrix decomposition
A Concrete Example 676 abstracts from the Max-Planck-Institute
– for example: We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved.
– 3283 words (words like and, or, this, … removed)– abstracts come from 5 working groups: Algorithms, Logic,
Graphics, CompBio, Databases– reduce to 10 concepts
No dictionary, no training, only the plain text itself !
The Idea of Dimension Reduction
2 02 01 10 10 2
●1 1 0 .
50 0 WWWWWW
0 0 1 .5
1 1 HawaiiHawaii
internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .
51 1
hawaii 0 0 2 1 1 2
Once More: the Self-Star Issues Is there a valid scientific basis for self-star computing?
– Apparently yes! What are some real problems that have been solved?
– Large-scale concept-based search! Are there any negative results? What are the limits?
– A little human guidance is necessary ̶ but feasible! What is there left to do?
– The amount of hand-tuning required is still significant.– Better understanding why it works will help!
Thank you!
Why is Large-Scale Search Self-Star? I was asked to submit a paper The first search “engine”, Yahoo, was a hand-
made directory The index-building of state-of-the-art engines,
like Google, is a complicated yet highly automated and self-organising process– which pages to crawl?– which terms to index?– …
Extracting sensible concepts by dimension reduction work surprisingly well in practise– But it is not really understood why– Lots of theoretical open problems!
For real applications, some amount of external knowledge has to be input– But how to integrate that?– Very practical open problem!
Conclusions
These two areconnected!!
Overview
A major problem in text search A way to deal with it A demonstration that this works well Relations to self-star Open questions
For this talk I will focus on text search
Search
self star
join two text boxes in power point
This talk: plain text
DELIS WP 3.1 Relevance The DELIS example (105 documents, 1182
words) took 4 minutes to compute– on my notebook (Intel PM, 1.6GHz)– with a careful implementation
For 10.000.000 documents and 1.000.000 words, this extrapolates to 300 years
Very large & nonlinear optimization problem, but no need to solve exactly
Find simple approximation algorithm that provably performs well and scales
Comparing Methods Fundamental question: which method is how
good under which circumstances? Few theoretically founded answers to this
question– seminal paper: A Probabilistic Analysis of Latent
Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)
– follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01
– main statement: LSI is robust against addition of (how much?) noise
Why does LSI work so well? A good method should produce
– small angles between documents on similar topics– large angles between documents on different topics
A formula for angles in the reduced space:– Let D = C·G, and let c1’,…,ck’ be the images of the
concepts under LSI– Then the k×k dot products ci’·cj’ are given by the
matrix (G·GT)-1
– That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)-1 has nonpositive offdiagonal entries (M-matrix)
Polysemy and Simonymy Let Tij be the dot product of the i-th with the
j-th row of a term-document matrix (~ co-occurence of terms i and j)– Call term k a polysem if there exist terms i and j
such that for some t, Tik, Tjk ≥ t but Tij < t– Two terms i and j are simonyms if Tij ≥ Tii or Tjj
Without polysems and simonyms we have1. Tij ≥ min(Tik,Tjk) for all i,j,k2. Tii > Tij for all j≠i
A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric
Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]:
The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive
23.0 05.007.005.032.0 19.007.019.047.0 1
3.57.15.17.16.41.25.11.22.3
A new LSI theorem Theorem: If D can be well approximated by a
set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.
Beware: This only holds for the original LSI, not for its widely used variant!
Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!
Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01):
– documents have a topic (implicit in the distribution of terms)
– and a quality (implicit in the link structure)– represent each document by a vector
direction corresponds to the topic length corresponds to the quality
– Goal: for a given query, rank documents by their dot product with the topic of the query
Model details Underlying parameters
– A = [A1 … An] authority topics, one per doc.– H = [H1 … Hn] hub topics, one per doc.– C = [C1 … Ck] translates topics to terms– q = [q1 … qk] query topic
The input we see– D A·C + H·C term document matrix– L HT·A link matrix– Q q·Cquery terms
Goal: recover ordering of A1·q,…,An·q
Model - Problems Link matrix generation L HT·A
– is ok, because the presence of a link is related to the hub/authority value
Term document matrix generation D A·C + H·C – very unrealistic: the term distribution gives information
on the topic, but not on the quality!– more realistic: D A0·C + H0·C, where A0 and H0
contain the normed columns of A and H So far, we could solve the special case where A
differs from H by only a diagonal matrix (i.e. hub topic = authority topic)
Perspective Strong theoretical foundations
– unifying framework + comparative analysis for large variety of dimension reduction methods
– realistic models + performance guarantuees Make proper use of human intelligence
– integrate explicit knowledge– but only as much as required (automatic detection)– combine dimension reduction methods with
interactive schemes (e.g. phrase browsing)
Ende!
Specific Methods Latent semantic indexing (LSI) [Dumais et al. ’89]
– orthogonal concepts c1,…,ck
– span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances
Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]– find stochastic matrix of rank k that maximizes the
probability that given matrix is an instance Concept Indexing (CI) [Karypis & Han ’00]
– c1,…,ck = centroid vectors of a k-clustering– documents = projections onto these centroids
Dimension Reduction Methods Main idea: the high-dimensional space of objects is
a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank,
find best low-rank approximation
car 11 11 01 11 10 10 10automobile 11 01 11 11 10 10 10search 00 00 00 00 01 01 01engine 11 11 11 01 01 01 01web 00 00 00 00 01 01 01
I will talk about … Dimension reduction techniques
– some methods– a new theorem
Exploiting link structure– state of the art– some new ideas
Perspective
Overview Exploiting the link structure
– Google, HITS, SmartyPants– Trawling
Semantic Web– XML, XML-Schema– RDF, DAML+OIL
Interactive browsing– Scatter/Gather– Phrase Browsing
Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document
collection Realisation: geometric clustering Challenge: extremely fast algorithms required, i.p.
– linear-time preprocessing– constant-time query processing
Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)
Scatter/Gather – Example
taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR
Scatter/Gather – Example
taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR
Phrase Browsing Nevill-Manning,Witten,Moffat, 1997 Formulating a good query requires more or
less knowledge of the document collection– if less, fine– if more, interaction is a must
Build hierarchy of phrases Example: http://www.nzdl.org/cgi-bin/library Challenge: fast algorithms for finding minimal
grammar, e.g. for S babaabaabaa
Teoma More refined concept of authoritativeness,
depending on the specific query (“subject-specific popularity”)
More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.com