1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

35
1 Alexander Gelbukh Moscow, Russia

Transcript of 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

Page 1: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

1

Alexander GelbukhMoscow, Russia

Page 2: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

2

Mexico

Page 3: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

3

Computing Research Center (CIC), Computing Research Center (CIC), MexicoMexico

Page 4: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

4

Chung-Ang University, KoreaChung-Ang University, KoreaElectronic Commerce andElectronic Commerce andInternet Application LabInternet Application Lab

Page 5: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

5

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art ofThe Art ofInformation RetrievalInformation Retrieval

Alexander Gelbukh

www.Gelbukh.com

Page 6: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

6

Information RetrievalInformation Retrieval

In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it

The Web User information need Ranking

Page 7: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

7

Page 8: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

8

Page 9: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

9

Information RetrievalInformation Retrieval

In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it

The Web User information need Ranking

Page 10: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

10

ImportanceImportance

Knowledge: the main treasure of man Web: Repository? Cemetery of information! Natural language and multimedia information

o Poorly structured, badly written

Corporate and organizational document baseso Senate speeches: Mexicoo Medical data collectionso Corporate memory. Microsoft knowledge base

Future: data explosion increasing importance

Page 11: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

11

PerspectivesPerspectives

Corporations: corporate databases Organizations: document bases Government

o European Union multilingual problemo The same in Asia

Academyo Lots of open research topicso Web topicso Computational Linguistics topicso Intelligent technologies, AI

Page 12: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

12

TextbookTextbook

http://sunsite.dcc.uchile.cl/irbook/

Page 13: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

13

ContentsContents

1. Introduction 2. Modeling 3. Retrieval Evaluation 4. Query Languages5. Query Operations 6. Text and Multimedia Languages and Properties 7. Text Operations8. Indexing and Searching9. Parallel and Distributed IR10.User Interfaces and Visualization11.Multimedia IR: Models and Languages12.Multimedia IR: Indexing and Searching13.Searching the Web 14.Libraries and Bibliographical Systems15.Digital Libraries

Page 14: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

14

CalendarCalendar

1. September 18 Chapter 1 Introduction

2. September 25 Chapter 2 Modeling

3. October 2 Chapter 3 Retrieval Evaluation

4. October 9 Chapter 4 Query Languages

5. October 16 Chapter 5 Query Operations

October 23 – midterm exam

6. October 30 Chapter 6 Text and Multimedia Languages...

7. November 6 Chapter 7 Text Operations

8. November 13 Chapter 8 Indexing and Searching

9. November 20 Chapter 10 User Interfaces and Visualization

10. November 27 Chapter 13 Searching the Web

11. December 4 Chapter 14 Libraries and Bibliographical Systems

12. December 11 Chapter 15 Digital Libraries

December – final exam

Page 15: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

15

Class structureClass structure

Main course: Information Retrieval Discussion of previous chapter. Questions I briefly present a new chapter

Research seminar: Natural Language Processing Discussion of previous paper. Questions.

o Identification of possible research topics

Presentation of a new paper or current work Discussion and questions Goal: publications!

Page 16: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

16

Natural Language Processing Natural Language Processing Research SeminarResearch Seminar

Page 17: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

17

What CL is aboutWhat CL is about

Computers to process natural language text “Understand” Generate Search Organize Translate …

Useful in IR

Page 18: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

18

MethodsMethods

No: text as a stream of letterso Brute force statisticso Simplified heuristics (ex.: Porter)

Yes: attention to language ruleso Linguistically motivated approacheso Knowledge-based approacheso Corpus-based approaches

Page 19: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

19

What IR is aboutWhat IR is about

Classical IR: find words? Concepts! Question answering Summarization Clustering …

Take language seriously

Page 20: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

20

Text representations for IRText representations for IR

Represent the retrieval featureso Strings → stems (lexemes), synsets, phrases.o Women → woman, lady, femaleo Old men and women → old woman

Structured representation of texto Network of related events and entitieso Enables logical inference

Page 21: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

21

CL tasks useful in IRCL tasks useful in IR

Morphology (stemming) POS / Word dense disambiguation Word relatedness Anaphora resolution Parsing and semantics (phrase search) Synonymic rephrasing Translation etc…

Each one a whole science in itself

Page 22: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

22

MorphologyMorphology

Q: pig T: piggish Simple: stemming

o piggish → pig- Lexeme: set of word forms

o same stem can give different wordso pigment → not pig; piny → pine, not pin

Dictionary/corpus-based methodso Learning; dictionary management

Page 23: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

23

Part of Speech DisambiguationPart of Speech Disambiguation

Q: oil well T: He did very well Q: what is an are? T: They are nice Important for English, Chinese. Less

important for other types Perhaps not so helpful directly, but is

necessary for most other tasks Usually statistical / heuristic methods

Page 24: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

24

Word Sense DisambiguationWord Sense Disambiguation

Q: bank account T: on the beautiful banks of Han river ...

bill: document, banknote, law, ax, peak, Gates...

Very frequent, almost any word in text Statistical & dictionary methods International competitions

Page 25: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

25

Word relatednessWord relatedness

Q: female T: woman (women)o Synonyms. Subtypes/super-typeso Dictionaries. WordNet. Similarity. Lesk.

Q: Korea T: Seoulo Other linguistic relationships (e.g., part)o Real-world relationships (facts)

Q: Clinton T: Lewinskyo Statistical co-occurrence (MI)

Page 26: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

26

Anaphora resolutionAnaphora resolution

Q: Awards of Prof. Han T: Prof. Han said... He did... IBM awarded him...o Frequencyo Phrases, co-occurrence, summarization, infer

ence, translation Heuristic (Mitkov) and knowledge-base

d methods Other types of co-reference

Page 27: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

27

Parsing, semanticsParsing, semantics

Q: Awards of Prof. Han T1: Prof. Han among many other prizes has several IBM awards T2: Mr. Kang has an award Prof. Han does not know of

Understanding of texto Rich structured representation

Better phrase search; question answering, summarization, ...

Page 28: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

28

Synonymic rephrasing, reasoningSynonymic rephrasing, reasoning

Q: experienced computer scientists T: Prof. Han has been programming for many years and awarded an IBM award

Requires good syntactic and semantic analysis

Knowledge-based methods

Page 29: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

29

Multilingual accessMultilingual access

Q: 요구르트 T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur.o Search multilingual collections

Europe: dozens of official languages of EU

o If you don’t know how to say it in English

Dictionaries, bilingual corpora, ...

Page 30: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

30

Tasks are entangledTasks are entangled

Many of CL tasks require other taskso Morphology → syntax → semantics

Many CL tasks form circleso parsing ← WSD ← parsingo I see a wild cat with a telescope (tripod?)

Can be done quick-and-dirty (?)o Fighting for last %so Zipf law: 20% of men drink 80% of beer

Page 31: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

31

Tools and infrastructureTools and infrastructure

Analysis toolso Tasks, methods

Dictionaries and grammarso Types, structureo Automatic acquisition

Corporao Corpora analysis tools and methods

Page 32: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

32

Possible tasksPossible tasks

WSD to help IR Clustering + summarization in IR results Anaphora and coreference resolution to help

IR Multilingual IR Applications to Korean ... a lot of others

Page 33: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

33

ReadingReading

Textbookso Manning & Schütze, Allen, Jurafsky, Hausser, ...

CICLing proceedings Computational Linguistics Google, ResearchIndex

Page 34: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

34

QuestionsQuestions

Who expects to publish? Who will make a presentation at the next

seminar?

Page 35: 1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.

35

Thank you!

Till September 18