A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...

Post on 20-Jan-2018

215 views 0 download

description

One Type of Query  Searching for the exact terms SELF-STAR: Self-* Properties in Complex Information Systems SELF-STAR: Self-* Properties in Complex Information Systems SELF-STAR: International Workshop on Self-* Properties in Complex Information Systems 31 May - 2 June 2004 University of Bologna k - 26 May Cached - Similar pagesCachedSimilar pages Self-Star Registration Self-Star Registration SELF-STAR: International Workshop on Self-* Properties in Complex Information Systems registration. Back to the Self-Star Home page. - 7k - Cached - Similar pagesCachedSimilar pages CEO Forum Home page CEO Forum Home page The Teacher Preparation STaR Chart, a self-assessment tool for colleges of education is available! Or, use the interactive Teacher k - Cached - Similar pagesCachedSimilar pages This is easily automated (full text index)

Transcript of A Powerful Principle for Automatically Finding Concepts in Unstructured Data Holger Bast...

A Powerful Principle forAutomatically Finding Concepts in

Unstructured Data

Holger BastMax-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

Self-Star Workshop, Bertinoro, 2nd June 2004

Dimension Reduction

Holger BastMax-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

Self-Star Workshop, Bertinoro, 2nd June 2004

Dimension ReductionA Powerful Principle for

Intelligent Search in Very Large Text Collections

One Type of Query Searching for the exact terms

Search

SELF-STAR: Self-* Properties in Complex Information SystemsSELF-STAR: International Workshop on Self-* Properties in ComplexInformation Systems 31 May - 2 June 2004 University of Bologna ...www.cs.unibo.it/self-star/ - 14k - 26 May 2004 - Cached - Similar pages

Self-Star RegistrationSELF-STAR: International Workshop on Self-* Properties in Complex InformationSystems 31 ... registration. Back to the Self-Star Home page. www.cs.unibo.it/self-star/register.html - 7k - Cached - Similar pages

CEO Forum Home pageThe Teacher Preparation STaR Chart, a self-assessment tool for collegesof education is available! Or, use the interactive Teacher ... www.ceoforum.org/ - 6k - Cached - Similar pages

This is easily automated (full text index)

self star

Another Type of Query Searching for what is behind words

SearchChurch Media Community - Power point 2003? - Any point in jumping ... ... Two interesting facts about the new PowerPoint Viewer is that it ... New Church Media Member. Nominate Now. Join Date: Oct ... so you can't directly place text over the ... www.churchmedia.net/ community/forum/showthread.php - 77k - Cached – Similar Pages

Flash < Internet < tutorialfind... Here is the actionscript in two easy steps ... Then join Mike as he walks you through tweening, morphing, and more ... Simple Use of Text Boxes. Use Text boxes to gather ... www.tutorialfind.com/tutorials/internet/flash/ - 92k - Cached - Similar Pages

[DOC] Introduction ... set up reminders and click a button to join the broadcast ... There are two ways to change colors: use a preset color ... and then make any changes to the text like you ... www.microsoft.com/education/DOWNLOADS/tutorials/ classroom/office2k/ppt2000.doc - Similar PagesHow to improve on this in a self-star fashion?

join two text boxes in power point

query expressedin conceptsa concept

expressedin terms

a querya document expressed

in terms

document expressedin concepts

The Idea of Dimension Reduction

internet 0 2 0 1 0 0web 2 1 0 0 0 0surfing 1 1 0 1 1 1beach 0 0 1 1 1 1hawaii 0 0 2 2 2 1

The approximation actually adds to the precision

2 02 01 10 10 2

Hawaii, 2nd June 2004Dear Pen Pal,I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB

1 1 0 .5

0 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

10000

internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .

51 1

hawaii 0 0 2 1 2 2

Equally dissimilar to query!

10

matrix multiplication

Finding concepts = approximate low-rank matrix decomposition

A Concrete Example 676 abstracts from the Max-Planck-Institute

– for example: We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved.

– 3283 words (words like and, or, this, … removed)– abstracts come from 5 working groups: Algorithms, Logic,

Graphics, CompBio, Databases– reduce to 10 concepts

No dictionary, no training, only the plain text itself !

The Idea of Dimension Reduction

2 02 01 10 10 2

●1 1 0 .

50 0 WWWWWW

0 0 1 .5

1 1 HawaiiHawaii

internet 2 2 0 1 0 0web 2 2 0 1 0 0surfing 1 1 1 1 1 1beach 0 0 1 .

51 1

hawaii 0 0 2 1 1 2

Once More: the Self-Star Issues Is there a valid scientific basis for self-star computing?

– Apparently yes! What are some real problems that have been solved?

– Large-scale concept-based search! Are there any negative results? What are the limits?

– A little human guidance is necessary ̶ but feasible! What is there left to do?

– The amount of hand-tuning required is still significant.– Better understanding why it works will help!

Thank you!

Why is Large-Scale Search Self-Star? I was asked to submit a paper The first search “engine”, Yahoo, was a hand-

made directory The index-building of state-of-the-art engines,

like Google, is a complicated yet highly automated and self-organising process– which pages to crawl?– which terms to index?– …

Extracting sensible concepts by dimension reduction work surprisingly well in practise– But it is not really understood why– Lots of theoretical open problems!

For real applications, some amount of external knowledge has to be input– But how to integrate that?– Very practical open problem!

Conclusions

These two areconnected!!

Overview

A major problem in text search A way to deal with it A demonstration that this works well Relations to self-star Open questions

For this talk I will focus on text search

Search

self star

join two text boxes in power point

This talk: plain text

DELIS WP 3.1 Relevance The DELIS example (105 documents, 1182

words) took 4 minutes to compute– on my notebook (Intel PM, 1.6GHz)– with a careful implementation

For 10.000.000 documents and 1.000.000 words, this extrapolates to 300 years

Very large & nonlinear optimization problem, but no need to solve exactly

Find simple approximation algorithm that provably performs well and scales

Comparing Methods Fundamental question: which method is how

good under which circumstances? Few theoretically founded answers to this

question– seminal paper: A Probabilistic Analysis of Latent

Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)

– follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01

– main statement: LSI is robust against addition of (how much?) noise

Why does LSI work so well? A good method should produce

– small angles between documents on similar topics– large angles between documents on different topics

A formula for angles in the reduced space:– Let D = C·G, and let c1’,…,ck’ be the images of the

concepts under LSI– Then the k×k dot products ci’·cj’ are given by the

matrix (G·GT)-1

– That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)-1 has nonpositive offdiagonal entries (M-matrix)

Polysemy and Simonymy Let Tij be the dot product of the i-th with the

j-th row of a term-document matrix (~ co-occurence of terms i and j)– Call term k a polysem if there exist terms i and j

such that for some t, Tik, Tjk ≥ t but Tij < t– Two terms i and j are simonyms if Tij ≥ Tii or Tjj

Without polysems and simonyms we have1. Tij ≥ min(Tik,Tjk) for all i,j,k2. Tii > Tij for all j≠i

A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric

Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]:

The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive

23.0 05.007.005.032.0 19.007.019.047.0 1

3.57.15.17.16.41.25.11.22.3

A new LSI theorem Theorem: If D can be well approximated by a

set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.

Beware: This only holds for the original LSI, not for its widely used variant!

Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!

Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01):

– documents have a topic (implicit in the distribution of terms)

– and a quality (implicit in the link structure)– represent each document by a vector

direction corresponds to the topic length corresponds to the quality

– Goal: for a given query, rank documents by their dot product with the topic of the query

Model details Underlying parameters

– A = [A1 … An] authority topics, one per doc.– H = [H1 … Hn] hub topics, one per doc.– C = [C1 … Ck] translates topics to terms– q = [q1 … qk] query topic

The input we see– D A·C + H·C term document matrix– L HT·A link matrix– Q q·Cquery terms

Goal: recover ordering of A1·q,…,An·q

Model - Problems Link matrix generation L HT·A

– is ok, because the presence of a link is related to the hub/authority value

Term document matrix generation D A·C + H·C – very unrealistic: the term distribution gives information

on the topic, but not on the quality!– more realistic: D A0·C + H0·C, where A0 and H0

contain the normed columns of A and H So far, we could solve the special case where A

differs from H by only a diagonal matrix (i.e. hub topic = authority topic)

Perspective Strong theoretical foundations

– unifying framework + comparative analysis for large variety of dimension reduction methods

– realistic models + performance guarantuees Make proper use of human intelligence

– integrate explicit knowledge– but only as much as required (automatic detection)– combine dimension reduction methods with

interactive schemes (e.g. phrase browsing)

Ende!

Specific Methods Latent semantic indexing (LSI) [Dumais et al. ’89]

– orthogonal concepts c1,…,ck

– span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances

Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]– find stochastic matrix of rank k that maximizes the

probability that given matrix is an instance Concept Indexing (CI) [Karypis & Han ’00]

– c1,…,ck = centroid vectors of a k-clustering– documents = projections onto these centroids

Dimension Reduction Methods Main idea: the high-dimensional space of objects is

a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank,

find best low-rank approximation

car 11 11 01 11 10 10 10automobile 11 01 11 11 10 10 10search 00 00 00 00 01 01 01engine 11 11 11 01 01 01 01web 00 00 00 00 01 01 01

I will talk about … Dimension reduction techniques

– some methods– a new theorem

Exploiting link structure– state of the art– some new ideas

Perspective

Overview Exploiting the link structure

– Google, HITS, SmartyPants– Trawling

Semantic Web– XML, XML-Schema– RDF, DAML+OIL

Interactive browsing– Scatter/Gather– Phrase Browsing

Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document

collection Realisation: geometric clustering Challenge: extremely fast algorithms required, i.p.

– linear-time preprocessing– constant-time query processing

Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Phrase Browsing Nevill-Manning,Witten,Moffat, 1997 Formulating a good query requires more or

less knowledge of the document collection– if less, fine– if more, interaction is a must

Build hierarchy of phrases Example: http://www.nzdl.org/cgi-bin/library Challenge: fast algorithms for finding minimal

grammar, e.g. for S babaabaabaa

Teoma More refined concept of authoritativeness,

depending on the specific query (“subject-specific popularity”)

More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.com