(C) 2000, The University of Michigan 1 Language and Information Handout #5 November 30, 2000.

(C) 2000, The University of Michigan

1

Language and Information

Handout #5

November 30, 2000


2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 305A, West Hall

• Phone: (734) 615-5225

• Office hours: TTh 3-4

• Course page: http://www.si.umich.edu/~radev/760

• Class meets on Thursdays, 5-8 PM in 311 West Hall


3

Clustering (Cont’d)


4

Using similarity in visualization

• Dendrograms (see Figure 14.1 of M&S, page 496)


5

Types of clustering

• Hierarchical: agglomerative, divisive• Soft & hard• Similarity functions:

– Single link: most similar members– Complete link: least similar members– Group-average: average similarity

• Applications:– improving language models– etc.


6

HITS-type algorithms


7

Hyperlinks and resource communities

• Jon Kleinberg (Almaden, Cornell)

• authoritative sources

• www.harvard.edu --> Harvard

• conferring authority via links

• global properties of authority pages


8

Hubs and authoritieshubs authorities unrelated pages


9

Authorities

• Java: www.gamelan.com, java.sun.com, sunsite.unc.edu/javafaq/javafaq.html

• Censorship: www.eff.org, www.eff.org/blueribbon.html, www.aclu.org

• Search engines: www.yahoo.com, www.excite.com, www.mckinley.com, www.lycos.com


10

Related pages

• www.honda.com: www.ford.com, www.toyota.com, www.yahoo.com

• www.nyse.com: www.amex.com, www.liffe.com, update.wsj.com


11

Collocations


12

Collocations

• Idioms

• Free word combinations

• Know a word by the company that it keeps (Firth)

• Common use

• No general syntactic or semantic rules

• Important for non-native speakers


13

Examples

Idioms

To kick the bucketDead endTo catch up

Collocations

To trade activelyTable of contentsOrthogonal projection

Free-word combinations

To take the busThe end of the roadTo buy a house


14

Uses

• Disambiguation (e.g, “bank”/”loan”,”river”)

• Translation

• Generation


15

Properties

• Arbitrariness

• Language- and dialect-specific

• Common in technical language

• Recurrent in context

• (see Smadja 83)


16

Arbitrariness

• Make an effort vs. *make an exertion

• Running commentary vs. *running discussion

• Commit treason vs. *commit treachery


17

Cross-lingual properties

• Régler la circulation = direct traffic

• Russian, German, Serbo-Croatian: direct translation is used

• AE: set the table, make a decision

• BE: lay the table, take a decision

• “semer le désarroi” - “to sow disarray” - “to wreak havoc”


18

Types of collocations

• Grammatical: come to, put on; afraid that, fond of, by accident, witness to

• Semantic (only certain synonyms)

• Flexible: find/discover/notice by chance


19

Base/collocator pairs

Base

NounNounVerbAdjectiveVerb

Collocator

verbadjectiveadverbadverbpreposition

Example

Set the tableWarm greetingsStruggle desperatelySound asleepPut on


20

Extracting collocations• Mutual information

• What if I(x;y) = 0?• What if I(x;y) < 0?

P(x,y)

P(x)P(y)I (x;y) = log2


21

Yule’s coefficient

A - frequency of lemma pairs involving both Li and Lj

B - frequency of pairs involving Li only

C - frequency of pairs involving Lk only

D - frequency of pairs involving neither

YUL = AD - BC

AD + BC -1 YUL 1


22

Specific mutual information

• Used in extracting bilingual collocations

I (e,f) = p (e,f)

p(e) p(f)

• p(e,f) - probability of finding both e and f in aligned sentences

• p(e), p(f) - probabilities of finding the word in one of the languages


23

Example from the Hansard corpus (Brown, Lai, and Mercer)

French word Mutual information

sein 5.63

bureau 5.63

trudeau 5.34

premier 5.25

résidence 5.12

intention 4.57

no 4.53

session 4.34


24

Total p-5 p-4 p-3 p-2 p-1 p+1 p+2 p+3 p+4 p+5

8031 7 6 13 5 7918 0 12 20 26 24

Flexible and rigid collocations

• Example (from Smadja): “free” and “trade”


25

Xtract (Smadja)

• The Dow Jones Industrial Average

• The NYSE’s composite index of all its listed common stocks fell *NUMBER* to *NUMBER*


26

Translating Collocations

• Brush up a lesson, repasser une leçon

• Bring about/осуществлять

• Hansards: late spring: fin du printemps, Atlantic Canada Opportunities Agency, Agence de promotion économique du Canada atlantique


27

The eSseNSe system


28

…

Offline processing Online processing

…

cluster 1 cluster 2 cluster 3 cluster n

summary 1 summary 2 summary 3 summary n

cached documents

hitlist

…user query

summary


29


30


31


32


33


34

Sample summary

500.80Data mining on the other hand through the use of specific algorithms or search engines attempts to source out discernable patterns and trends in the data and infers rules from these patterns

520.53Data mining versus traditional database queries Traditional database queries contrasts with data mining since these are typified by the simple question such as what were the sales of orange juice in January 1995 for the Boston area

486.92Intense competition in an increasing saturated marketplace the ability to custom manufacture market and advertise to small market segments and individuals 4 and the market for data mining products is estimated at about 500 million in early 1994 12 Data mining technologies are characterized by intensive computations on large volumes of data

576.60These are : -the untapped value in large databases consolidation of database records tending towards a single customer view concept of an information or data warehouse from the consolidation of databases dramatic drop in the cost/performance ratio of hardware systems - for data storage and processing

487.92This term data mining has been used by statisticians data analyst and the MIS management information systems community whereas KDD has been mostly used by artificial intelligence and machine learning researchers

509.11The term data mining is then this high-level application techniques / tools used to present and analyze data for decision makers

494.92The idea behind data mining then is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data 18 2 The term knowledge discovery in databases KDD was formalized in 1989 in reference to the general concept of being broad and 'high level' in the pursuit of seeking knowledge from data

Score Sentence


35

ClusterID

Numberof URLs

Centroid words

00085 63 informat ion retrieval systems university workshop papers research library sciencesubmission chair text edu conference computer applications language processingdata libraries

00044 167 web internet site design hosting search online sites commerce page business metayour information content server marketing you electronic pages

00086 115 retrieval in formation university ir text research systems science semanticevaluation document pp library

00657 135 vol pp no retrieval proceedings conference user ir informat ion query

00766 6 user interface users ariadne hypertext search computer interfaces we designersystem laurel informat ion nelson knowledge collaboration representationinteraction process systems

00127 4 neural systems computational data evolutionary intelligent networks learningartificial knowledge intelligence


36


37


38


39


40


41

mining (84.54), data (64.13), knowledge (14.25), discovery (11.98), advertised (11.20), databases

(9.69), information (6.98), research (6.96), analysis (6.95), text (6.05), patterns (5.30), algorithms

(4.39).

Text Min ing is a new and excit ing research area that tries to solve the informat ion overloadproblem by using techniques from data mining machine learning information retrieval natural-language understanding case-based reasoning statistics and knowledge management to helppeople gain insight into large quantities of semi-structured or unstructured text

183.73

<BR> Current Projects - A ims The Data Mining Program has two projects: DMITL - DataMining in the Large Conducting practical case studies for clients involving the analysis oflarge and complex data sets ParAlg – Parallel Algorithms The main computing facility forthese projects consists of a secure multiprocessor Sun E4000

156.56

The DMITL project aims to develop knowledge and techniques relevant to the data mining oflarge and complex datasets using high performance computers

163.78

Megaputer develops software and solutions for data mining text analysis and knowledgediscovery in databases

188.96


42

Cross-language information access


43

English

QE

CE

DE SE

SE


44

English Chinese

QE

CE

QC

DE SE

SE


45

English Chinese

QE

CE CC

QC

DE SE SC DC

SE SC


46

English Chinese

QE

CE CC

QC

DE SE SC DC

SE SC


47

English Chinese

QE

CE CC

QC

DE SE SC DC

SE SC


48

English Chinese

QE

CE CC

QC

DE SE SC DC

SC->E

SE SC


49

Objectives

• Produce summaries using multiple algorithms• evaluate summarization and translation separately• intrinsic and extrinsic language-independent

evaluation metrics• establish correlation between evaluation metrics• build parallel C-E doc+summary 9K docs (Hong

Kong news)


50

Participants• Full time

– K.-L. Kwok, Queens College– Dragomir Radev, U. Michigan– Wai Lam, Chinese University of HK– Simone Teufel, Columbia

• “Consultants”– Chin-Yew Lin, ISI– Tomek Strzalkowski, Albany– Jade Goldstein, CMU– Jian-Yun Nie, U. Montréal

• Supporters– TIDES roadmap group: Ed Hovy, Daniel Marcu, Kathy McKeown


51

Techniques and parameters

• Summarization:– position, TF*IDF, centroids, largest common

subsequence, keywords

• Evaluation:– intrinsic: percent agreement, relative utility,

precision/recall– extrinsic: document rank, question answering

• Length of documents/summaries


52

The parallel corpus

• English and Chinese (Hong Kong News)• Already there:

– 9000 documents and their translations– list of 300 queries in English and their translations

• We will create before the workshop:– document relevance judgements

• 50 queries, 5 hrs/query, $10/hr -> $2,500

– sentence relevance judgements• 4 doc/hr, need 4000 rel. judgements -> $10,000

– optional: manual abstracts


53

Creating the judgements

• For each query– submit to IR engine– discard unless it has 5-20 hits– get exhaustive document relevance judgements– consider top 100 documents

• get sentence relevance judgements for– all relevant judgements

– top 50 documents (including irrelevant ones!)


54

Experiments• Experiment 1: (Validation)

Compare preservation of ranking with other measures: judgement overlap, relative utility

• Experiment 2 & Experiment 3: – use with preservation of ranking– only possible due to new, parallel experimental design– factor out effects of

• query translation• summarization• monolingual IR

• Baseline:– leading sentence summaries vs. documents– other summarization methods vs. documents– (ideal: manual summaries vs. documents)


55

• Monolingual experiments– Effect of summarization

• English Query -> English Doc (ranks)

• English Query-> English Summary (ranks)

• Chinese Query -> Chinese Doc (ranks)

• Chinese Query -> Chinese Summary (ranks)

– Baseline:• leading sentence summary vs. document

• ideal: manual summary vs. document

– Effect of language on IR• English Query -> English Doc

• Chinese Query-> Chinese Doc

• Experiment 2: crosslingual– Effect of query translation

• English Query -> English Doc

• English Query -> Chinese Query -> Chinese Doc

Experiments


56

Timeline

• Pre-workshop: build corpus• Sentence segmenter, Chinese tokenizer, machine

translation, IR system, eSseNSe summarizer• Workshop: system integration, build toolkit,

summarization, evaluation, correlation, system refinement, final evaluation


57

WorkshopW1

W2

W3

W4

W5

W6

Set up experimental testbedEvaluation plan laid outSelection of training/test sub-corporaAlpha version of CLIA system tested on a small number of queriesBaseline experimentRun experiment one

Run experiment twoCompute query translation qualityRun experiment threeFeedback from first three experimentsSystem improvements

Improved CLIA system readyEvaluation using unseen test dataDraft of final reportAdditional experimentsWrap-upFinal version of CLIA system released


58

• Novelty: never done before, integration of CLIR and summarization

Merit criteria


59


• Collaboration: participants wouldn’t work together otherwise

Merit criteria


60


• Collaboration: participants wouldn’t work together otherwise• Scientific merit: much-needed evaluation metrics,

techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption

Merit criteria


61


• Collaboration: participants wouldn’t work together otherwise• Scientific merit: much-needed evaluation metrics, techniques

for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption

• Feasibility: uses existing work, specific plan for new work

Merit criteria


62




• Feasibility: uses existing work, specific plan for new work• Community building: corpora, evaluation techniques, and

software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC)

Merit criteria


63






• Funder interest: multilingual systems, large amounts of data

Merit criteria


64






• Funder interest: multilingual systems, large amounts of data

Merit criteria


65

What was dropped

• Interactive clustering of documents

• Evaluation of the quality of translated summaries

• Document translation

• Effects of document genre, length

• Evolving summaries


66

More…


67

What we didn’t talk about

• Hidden Markov models

• Part of speech tagging

• Probabilistic parsing

• Information retrieval

• Text classification

• etc.


68

THE END ?

http://perun.si.umich.edu/~radev/760/job/

(C) 2000, The University of Michigan 1 Language and Information Handout #5 November 30, 2000.

Documents

Transcript of (C) 2000, The University of Michigan 1 Language and Information Handout #5 November 30, 2000.