A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...

33
A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany Self-Star Workshop, Bertinoro, 2 nd June 2004 Dimension Reduction

description

One Type of Query  Searching for the exact terms SELF-STAR: Self-* Properties in Complex Information Systems SELF-STAR: Self-* Properties in Complex Information Systems SELF-STAR: International Workshop on Self-* Properties in Complex Information Systems 31 May - 2 June 2004 University of Bologna k - 26 May Cached - Similar pagesCachedSimilar pages Self-Star Registration Self-Star Registration SELF-STAR: International Workshop on Self-* Properties in Complex Information Systems registration. Back to the Self-Star Home page. - 7k - Cached - Similar pagesCachedSimilar pages CEO Forum Home page CEO Forum Home page The Teacher Preparation STaR Chart, a self-assessment tool for colleges of education is available! Or, use the interactive Teacher k - Cached - Similar pagesCachedSimilar pages This is easily automated (full text index)

Transcript of A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...

Page 1: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

A Powerful Principle forAutomatically Finding Concepts in

Unstructured Data

Holger BastMax-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

Self-Star Workshop, Bertinoro, 2nd June 2004

Dimension Reduction

Page 2: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Holger BastMax-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

Self-Star Workshop, Bertinoro, 2nd June 2004

Dimension ReductionA Powerful Principle for

Intelligent Search in Very Large Text Collections

Page 3: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

One Type of Query Searching for the exact terms

Search

SELF-STAR: Self-* Properties in Complex Information SystemsSELF-STAR: International Workshop on Self-* Properties in ComplexInformation Systems 31 May - 2 June 2004 University of Bologna ...www.cs.unibo.it/self-star/ - 14k - 26 May 2004 - Cached - Similar pages

Self-Star RegistrationSELF-STAR: International Workshop on Self-* Properties in Complex InformationSystems 31 ... registration. Back to the Self-Star Home page. www.cs.unibo.it/self-star/register.html - 7k - Cached - Similar pages

CEO Forum Home pageThe Teacher Preparation STaR Chart, a self-assessment tool for collegesof education is available! Or, use the interactive Teacher ... www.ceoforum.org/ - 6k - Cached - Similar pages

This is easily automated (full text index)

self star

Page 4: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Another Type of Query Searching for what is behind words

SearchChurch Media Community - Power point 2003? - Any point in jumping ... ... Two interesting facts about the new PowerPoint Viewer is that it ... New Church Media Member. Nominate Now. Join Date: Oct ... so you can't directly place text over the ... www.churchmedia.net/ community/forum/showthread.php - 77k - Cached – Similar Pages

Flash < Internet < tutorialfind... Here is the actionscript in two easy steps ... Then join Mike as he walks you through tweening, morphing, and more ... Simple Use of Text Boxes. Use Text boxes to gather ... www.tutorialfind.com/tutorials/internet/flash/ - 92k - Cached - Similar Pages

[DOC] Introduction ... set up reminders and click a button to join the broadcast ... There are two ways to change colors: use a preset color ... and then make any changes to the text like you ... www.microsoft.com/education/DOWNLOADS/tutorials/ classroom/office2k/ppt2000.doc - Similar PagesHow to improve on this in a self-star fashion?

join two text boxes in power point

Page 5: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

query expressedin conceptsa concept

expressedin terms

a querya document expressed

in terms

document expressedin concepts

The Idea of Dimension Reduction

internet 0 2 0 1 0 0web 2 1 0 0 0 0surfing 1 1 0 1 1 1beach 0 0 1 1 1 1hawaii 0 0 2 2 2 1

The approximation actually adds to the precision

2 02 01 10 10 2

Hawaii, 2nd June 2004Dear Pen Pal,I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB

1 1 0 .5

0 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

10000

internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .

51 1

hawaii 0 0 2 1 2 2

Equally dissimilar to query!

10

matrix multiplication

Finding concepts = approximate low-rank matrix decomposition

Page 6: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

A Concrete Example 676 abstracts from the Max-Planck-Institute

– for example: We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved.

– 3283 words (words like and, or, this, … removed)– abstracts come from 5 working groups: Algorithms, Logic,

Graphics, CompBio, Databases– reduce to 10 concepts

No dictionary, no training, only the plain text itself !

Page 7: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

The Idea of Dimension Reduction

2 02 01 10 10 2

●1 1 0 .

50 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .

51 1

hawaii 0 0 2 1 1 2

Page 8: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Once More: the Self-Star Issues Is there a valid scientific basis for self-star computing?

– Apparently yes! What are some real problems that have been solved?

– Large-scale concept-based search! Are there any negative results? What are the limits?

– A little human guidance is necessary ̶ but feasible! What is there left to do?

– The amount of hand-tuning required is still significant.– Better understanding why it works will help!

Page 9: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Thank you!

Page 10: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Why is Large-Scale Search Self-Star? I was asked to submit a paper The first search “engine”, Yahoo, was a hand-

made directory The index-building of state-of-the-art engines,

like Google, is a complicated yet highly automated and self-organising process– which pages to crawl?– which terms to index?– …

Page 11: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Extracting sensible concepts by dimension reduction work surprisingly well in practise– But it is not really understood why– Lots of theoretical open problems!

For real applications, some amount of external knowledge has to be input– But how to integrate that?– Very practical open problem!

Conclusions

These two areconnected!!

Page 12: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Overview

A major problem in text search A way to deal with it A demonstration that this works well Relations to self-star Open questions

For this talk I will focus on text search

Page 13: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Search

self star

join two text boxes in power point

This talk: plain text

Page 14: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

DELIS WP 3.1 Relevance The DELIS example (105 documents, 1182

words) took 4 minutes to compute– on my notebook (Intel PM, 1.6GHz)– with a careful implementation

For 10.000.000 documents and 1.000.000 words, this extrapolates to 300 years

Very large & nonlinear optimization problem, but no need to solve exactly

Find simple approximation algorithm that provably performs well and scales

Page 15: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Comparing Methods Fundamental question: which method is how

good under which circumstances? Few theoretically founded answers to this

question– seminal paper: A Probabilistic Analysis of Latent

Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)

– follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01

– main statement: LSI is robust against addition of (how much?) noise

Page 16: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Why does LSI work so well? A good method should produce

– small angles between documents on similar topics– large angles between documents on different topics

A formula for angles in the reduced space:– Let D = C·G, and let c1’,…,ck’ be the images of the

concepts under LSI– Then the k×k dot products ci’·cj’ are given by the

matrix (G·GT)-1

– That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)-1 has nonpositive offdiagonal entries (M-matrix)

Page 17: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Polysemy and Simonymy Let Tij be the dot product of the i-th with the

j-th row of a term-document matrix (~ co-occurence of terms i and j)– Call term k a polysem if there exist terms i and j

such that for some t, Tik, Tjk ≥ t but Tij < t– Two terms i and j are simonyms if Tij ≥ Tii or Tjj

Without polysems and simonyms we have1. Tij ≥ min(Tik,Tjk) for all i,j,k2. Tii > Tij for all j≠i

A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric

Page 18: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]:

The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive

23.0 05.007.005.032.0 19.007.019.047.0 1

3.57.15.17.16.41.25.11.22.3

Page 19: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

A new LSI theorem Theorem: If D can be well approximated by a

set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.

Beware: This only holds for the original LSI, not for its widely used variant!

Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!

Page 20: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01):

– documents have a topic (implicit in the distribution of terms)

– and a quality (implicit in the link structure)– represent each document by a vector

direction corresponds to the topic length corresponds to the quality

– Goal: for a given query, rank documents by their dot product with the topic of the query

Page 21: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Model details Underlying parameters

– A = [A1 … An] authority topics, one per doc.– H = [H1 … Hn] hub topics, one per doc.– C = [C1 … Ck] translates topics to terms– q = [q1 … qk] query topic

The input we see– D A·C + H·C term document matrix– L HT·A link matrix– Q q·Cquery terms

Goal: recover ordering of A1·q,…,An·q

Page 22: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Model - Problems Link matrix generation L HT·A

– is ok, because the presence of a link is related to the hub/authority value

Term document matrix generation D A·C + H·C – very unrealistic: the term distribution gives information

on the topic, but not on the quality!– more realistic: D A0·C + H0·C, where A0 and H0

contain the normed columns of A and H So far, we could solve the special case where A

differs from H by only a diagonal matrix (i.e. hub topic = authority topic)

Page 23: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Perspective Strong theoretical foundations

– unifying framework + comparative analysis for large variety of dimension reduction methods

– realistic models + performance guarantuees Make proper use of human intelligence

– integrate explicit knowledge– but only as much as required (automatic detection)– combine dimension reduction methods with

interactive schemes (e.g. phrase browsing)

Page 24: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Ende!

Page 25: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Specific Methods Latent semantic indexing (LSI) [Dumais et al. ’89]

– orthogonal concepts c1,…,ck

– span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances

Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]– find stochastic matrix of rank k that maximizes the

probability that given matrix is an instance Concept Indexing (CI) [Karypis & Han ’00]

– c1,…,ck = centroid vectors of a k-clustering– documents = projections onto these centroids

Page 26: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Dimension Reduction Methods Main idea: the high-dimensional space of objects is

a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank,

find best low-rank approximation

car 11 11 01 11 10 10 10automobile 11 01 11 11 10 10 10search 00 00 00 00 01 01 01engine 11 11 11 01 01 01 01web 00 00 00 00 01 01 01

Page 27: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

I will talk about … Dimension reduction techniques

– some methods– a new theorem

Exploiting link structure– state of the art– some new ideas

Perspective

Page 28: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Overview Exploiting the link structure

– Google, HITS, SmartyPants– Trawling

Semantic Web– XML, XML-Schema– RDF, DAML+OIL

Interactive browsing– Scatter/Gather– Phrase Browsing

Page 29: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document

collection Realisation: geometric clustering Challenge: extremely fast algorithms required, i.p.

– linear-time preprocessing– constant-time query processing

Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Page 30: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Page 31: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Page 32: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Phrase Browsing Nevill-Manning,Witten,Moffat, 1997 Formulating a good query requires more or

less knowledge of the document collection– if less, fine– if more, interaction is a must

Build hierarchy of phrases Example: http://www.nzdl.org/cgi-bin/library Challenge: fast algorithms for finding minimal

grammar, e.g. for S babaabaabaa

Page 33: A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast Max-Planck-Institut…

Teoma More refined concept of authoritativeness,

depending on the specific query (“subject-specific popularity”)

More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.com