Web Search: Techniques, algorithms and Aplications Basic...

37
1 Web Search: Techniques, algorithms and Aplications Basic Techniques for Web Search German Rigau <[email protected]> [Based on slides by Eneko Agirre … and Christopher Manning and Prabhakar Raghavan]

Transcript of Web Search: Techniques, algorithms and Aplications Basic...

Page 1: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

1

Web Search:

Techniques, algorithms and Aplications

Basic Techniques for Web Search

German Rigau <[email protected]>

[Based on slides by Eneko Agirre … and Christopher Manning and Prabhakar Raghavan]

Page 2: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Resources

Information Retrieval book Possible coursework

Page 3: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

3

Basic Techniques for Web Search

Review of applications Basic Techniques in detail

Boolean search Vocabularies, dictionaries, index Scoring, evaluation, complete system Web search

Semantic search

Page 4: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Information Retrieval

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

4

Page 5: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Unstructured (text) vs. structured (database) data in the 90s

5

Page 6: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Unstructured (text) vs. structured (database) data now

6

Page 7: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IR vs. databases:Structured vs unstructured data

Structured data tends to refer to information in “tables”

7

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.

Page 8: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Unstructured data

Typically refers to free text Allows

Keyword queries including operators More sophisticated “concept” queries e.g.,

find all web pages dealing with drug abuse Classic model for searching text documents

8

Page 9: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Semi-structured data

In fact almost no data is “unstructured” E.g., this slide has distinctly identified zones such as the

Title and Bullets Facilitates “semi-structured” search such as

Title contains data AND Bullets contain search XML search

… to say nothing of linguistic structure (NLP)

9

Page 10: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Semi-structured data

Page 11: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Review of applications Information Retrieval (IR)

Cross-lingual Information Retrieval (CLIR) Question Answering (Q/A)

Related tasks Classification Information Extraction (IE) Summarization Machine Translation

Page 12: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Need for IR With the advance of WWW -

http://www.worldwidewebsize.com/ Corporate Intranets Personal PC’s Various needs for information:

Search for (new) documents that fall in a given topic Search for a specific (new) information Search a (new) answer to a question Search for information in a different language … Search for images! Search for music! Search for a (candidate) friend … similar hobbies, etc.

Page 13: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

A definition of IR

Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”

Page 14: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Examples of IR systems Conventional (library catalog)

Search by keyword, title, author, etc. Text-based (Google, Yahoo, Bing …).

Search by keywords. Limited search using queries in natural language. Categorized search (dmoz, google directories) Intranets (ehu.es - reglamento doctorado)

Multimedia Google images (it’s mostly text) By content (this is more semantic-web)

Images Music (Shazam)

Other: Cross language information retrieval: Elhuyar Question answering systems:

How many people live in New York? ... state? What is the population of New York?

Page 15: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Review of applications

Information Retrieval (IR) Cross-lingual Information Retrieval (CLIR) Question Answering (Q/A)

Related tasks Classification Information Extraction (IE) Summarization Machine Translation

Page 16: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Document ClassificationAssign labels to each document or web-page:• Labels are most often topics such as Dmoz or Amazon

e.g., "finance," "sports," "news>world>asia>business"• Labels may be genres

e.g., "editorials" "movie-reviews" "news“• Labels may be opinion

e.g., “like”, “hate”, “neutral”• Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me”e.g., “spam” : “not-spam”e.g., “is a toner cartridge ad” :“isn’t”

Page 17: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Document Classification• Given:

– A description of an instance, xX, where X is a document.

• Issue: how to represent text documents.

– A fixed set of categories:

C = {c1, c2,…, cn}

• Determine:– The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.• We want to know how to build categorization functions

(“classifiers”).

Page 18: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning  language  proof  intelligence”

TrainingData:

TestingData:

Classes:(AI)

Document Classification

(Programming) (HCI)

... ...

(Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML

approaches to Garb. Coll.)

Page 19: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Information Extraction (IE)

• Identify specific pieces of information (data) in a unstructured or semi-structured textual document.

• Transform unstructured information in a corpus of documents or web pages into a structured database.

• Applied to different types of text:– Newspaper articles

– Web pages

– Scientific articles

– Newsgroup messages

– Classified ads

– Medical notes

– Attributes of Named Entities (products, people, ...)

Page 20: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

Subject: US-TN-SOFTWARE PROGRAMMERDate: 17 Nov 1996 17:37:29 GMTOrganization: Reference.Com Posting ServiceMessage-ID: <[email protected]>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:Kim AndersonAdNET(901) 458-2888 [email protected]

IE: Sample Job Posting

Page 21: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IE: Extracted Job Templatecomputer_science_jobid: [email protected]: SOFTWARE PROGRAMMERsalary:company:recruiter:state: TNcity:country: USlanguage: Cplatform: PC \ DOS \ OS-2 \ UNIXapplication:area: Voice Mailreq_years_experience: 2desired_years_experience: 5req_degree:desired_degree:post_date: 17 Nov 1996

Page 22: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IE: Amazon Book Description….</td></tr></table><b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br><font face=verdana,arial,helvetica size=-1>by <a href="/exec/obidos/search-handle-url/index=books&field-author= Kurzweil%2C%20Ray/002-6235079-4593641">Ray Kurzweil</a><br></font><br><a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg"><img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90 height=140 align=left border=0></a><font face=verdana,arial,helvetica size=-1><span class="small"><span class="small"><b>List Price:</b> <span class=listprice>$14.95</span><br><b>Our Price: <font color=#990000>$11.96</font></b><br><b>You Save:</b> <font color=#990000><b>$2.99 </b>(20%)</font><br></span><p> <br>…

Page 23: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IE: Extracted Book Template

Title: The Age of Spiritual Machines : When Computers Exceed Human IntelligenceAuthor: Ray KurzweilList-Price: $14.95Price: $11.96

Page 24: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IE: real state

• http://www.nuroa.es/

Page 25: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

IE: Scientific bibliography• We can improve the library catalog with a

citation index• Match

– Citation (Agirre & Rigau, 1996)– Reference: Agirre, E. and G. Rigau. 1996. Word

sense disambiguation using conceptual density. Proceedings of COLING, 1996

– Paper

• http://scholar.google.com• http://academic.research.microsoft.com

Page 26: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain
Page 27: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain
Page 28: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

TAC workshop - Nov. 2009

28

Named Entity Disambiguation (aka Entity linking)

Given Knowledge Base

Given target string and surrounding text:I watched “Slapshot”, the 1977 hockey classic starring

Paul Newman for the first time.

Return entity in KB (E0181364) or NIL

KB subset/superset of WikipediaPaul_Newman E0181364Paul_Newman_(politician) NILPaul_Newman_(cricketer) NILPaul_Newman_(linguist) NILPaul_Newman_(band) NIL

string entityPaul Newman E0181364

Page 29: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Slot filling (Infobox generation)

Distant supervision (Mintz et al. 09):– Use facts in Knowledge Base

=> gold-standard entity–slot–filler – Search for spans containing

entity–filler pair in document base => positive examples to train

– Train classifier per slot– Search for mentions of target

entity in document collection– Run each of the classifiers

Manual work kept to a minimum.

Page 30: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

TAC workshop - Nov. 2009

30

Google Knowledge Graph

Page 31: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

TAC workshop - Nov. 2009

31

Google Knowledge Graph

Page 32: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

TAC workshop - Nov. 2009

32

Google Knowledge Graph

Page 33: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Summarization

Page 34: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Machine Translation

Page 35: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Applications and tools: what is the relation?

– Information Retrieval (IR)• Conventional (library catalog) • Text-based• Multimedia by content

• Cross-lingual Information Retrieval (CLIR)

• Question Answering (Q/A)

– Related tasks• Information Extraction (IE)

• Classification

• Summarization

• Machine Translation

– Tools• Tokenization

• Sentence Splitting

• Language Identifiers

• Lemmatization, POS tagging

• Named Entity Recognizers and Categorizers (NERC)

• Parsing

• Named Entity Dis. (NEDWord Sense Dis. (WSD)

• Semantic Role Labelling (SRL)

Page 36: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

Applications and resources: what is the relation?

– Information Retrieval (IR)• Conventional (library catalog) • Text-based• Multimedia by content

• Cross-lingual Information Retrieval (CLIR)

• Question Answering (Q/A)

– Related tasks• Information Extraction (IE)

• Classification

• Summarization

• Machine Translation

– Resources• Wordnets, EuroWordnets, MCR

• Topic signatures

• FrameNet

• SUMO

• CYC

• ...

• Domain ontologies: UMLS

• Wikipedia / DBpedia / Freebase /Linked-open-data

Page 37: Web Search: Techniques, algorithms and Aplications Basic ...adimen.si.ehu.es/~rigau/research/Doctorat/LSKBs/10-WS... · information, and identify and retrieve from the files certain

37

Web Search:

Techniques, algorithms and Aplications

Basic Techniques for Web Search

German Rigau <[email protected]>

[Based on slides by Eneko Agirre … and Christopher Manning and Prabhakar Raghavan]