Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Post on 02-Apr-2015

223 views 2 download

Tags:

Transcript of Information Retrieval Techniques MS(CS) Lecture 2 AIR UNIVERSITY MULTAN CAMPUS.

Information Retrieval Techniques

MS(CS) Lecture 2AIR UNIVERSITY MULTAN CAMPUS

Issues and Challenges in IR

RELEVANCE ?

Issues and Challenges in IR

• Query Formulation– Describing information need

• Relevance– Relevant to query (system relevancy)– Relevant to information need (User relevancy)

• Evaluation– System oriented (Bypass User)– User Oriented (Relevance Feedback)

What makes IR “experimental”?

• Evaluation– How do design experiments that answer our

questions?– How do we assess the quality of the documents

that come out of the IR black box?– Can we do this automatically?

Simplification?Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Is this itself a vast simplification?

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Problems in Query FormulationStefano Mizzaro Model of Relevance in IR

• RIN: Real Information Need (Target)• PIN: Perceived Information Need (Mentality)• EIN: Expressed Information Need (Natural Lng)• FIN: Formal Information Need (Query)

Paper reference 4 dimensions of Relevance by stephen W Draper

Taylor’s Model

• The visceral need (Q1) the actual, but unexpressed, need for information

• The conscious need (Q2) the conscious within-brain description of the need

• The formalized need (Q3) the formal statement of the question

• The compromised need (Q4) the question as presented to the information system

Robert S. Taylor. (1962) The Process of Asking Questions. American Documentation, 13(4), 391--396.

Taylor’s Model and IR SystemsVisceral need (Q1)

Conscious need (Q2)

Formalized need (Q3)

Compromised need (Q4)

IR System

Results

naïve usersQuestion

Negotiation

how trap mice alive

The classic search model

Collection

User task

Info need

Query

Results

Searchengine

Queryrefinement

Get rid of mice in a politically correct way

Info about removing micewithout killing them

Misconception?

Misformulation?

Search

Building Blocks of IRS-I• Different models of information retrieval

– Boolean model– Vector space model– Languages models

• Representing the meaning of documents– How do we capture the meaning of documents?– Is meaning just the sum of all terms?

• Indexing– How do we actually store all those words?– How do we access indexed terms quickly?

• Relevance Feedback– How do humans (and machines) modify queries

based on retrieved results?• User Interaction

– Information retrieval meets computer-human interaction

– How do we present search results to users in an effective manner?

– What tools can systems provide to aid the user in information seeking?

Building Blocks of IRS-II

IR Extensions

• Filtering and Categorization– Traditional information retrieval: static collection,

dynamic queries– What about static queries against dynamic collections?

• Multimedia Retrieval– Thus far, we’ve been focused on text…– What about images, sounds, video, etc.?

• Question Answering– We want answers, not just documents!

CAN U GUESS WHAT DATA IS MAINLY FOCUSED BY IR?

StructuredUnstructured

Semi structured

What about databases?

• What are examples of databases?– Banks storing account information– Retailers storing inventories– Universities storing student grades

• What exactly is a (relational) database?– Think of them as a collection of tables– They model some aspect of “the world”

A (Simple) Database Example

Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies

Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History

Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98

Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam

Student Table

Department Table Course Table

Enrollment Table

IR vs. databases:Structured vs unstructured data

• Structured data tends to refer to information in “tables”

17

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000Ivy Smith

Typically allows numerical range and exact match(for text) queries, e.g., Salary < 60000 AND Manager = Smith.

Database Queries

• What would you want to know from a database?– What classes is John Arrow enrolled in?– Who has the highest grade in LBSC 690?– Who’s in the history department?– Of all the non-CLIS students taking LBSC 690 with

a last name shorter than six characters and were born on a Monday, who has the longest email address?

Unstructured data

• Typically refers to free text• Allows

– Keyword queries including operators– More sophisticated “concept” queries e.g.,

• find all web pages dealing with drug abuse

• Classic model for searching text documents

19

2020

2121

Semi-structured data

• In fact almost no data is “unstructured”• E.g., this slide has distinctly identified zones such

as the Title and Bullets• … to say nothing of linguistic structure

• Facilitates “semi-structured” search such as– Title contains data AND Bullets contain search

• Or even– Title is about Object Oriented Programming AND

Author something like stro*rup – where * is the wild-card operator

22

Hopkins IR Workshop 2005 Copyright © Victor Lavrenko

Comparing IR to databases

Databases IR

Data Structured Unstructured

Fields Clear semantics (SSN, age)

No fields (other than text)

QueriesDefined (relational algebra, SQL)

Free text (“natural language”), Boolean

RecoverabilityCritical (concurrency control, recovery, atomic operations)

Downplayed, though still an issue

MatchingExact (results are always “correct”)

Imprecise (need to measure effectiveness)

Databases vs. IR

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

IRS IN ACTION (TASKS)

Information Retrieval and Web SearchPandu Nayak and Prabhakar Raghavan

Outline

• What is the IR problem?• How to organize an IR system? (Or

the main processes in IR)• Indexing• Retrieval

The problem of IR

• Goal = find documents relevant to an information need from a large document set

Document collection

Info. need

Query

Answer list

IR systemRetrieval

IR problem• First applications: in libraries (1950s)

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content

Possible approaches

1.String matching (linear search in documents)- Slow- Difficult to improve

2.Indexing (*)- Fast- Flexible to further improvement

Indexing-based IR

Document Query

indexing indexing (Query analysis)

Representation Representation(keywords) Query (keywords)

evaluation

Main problems in IR

• Document and query indexing– How to best represent their contents?

• Query evaluation (or retrieval process)– To what extent does a document

correspond to a query?• System evaluation

– How good is a system? – Are the retrieved documents relevant?

(precision)– Are all the relevant documents retrieved?

(recall)

The basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

32

33

Document indexing

• Goal = Find the important meanings and create an internal representation

• Factors to consider:– Accuracy to represent meanings (semantics)– Exhaustiveness (cover all the contents)– Facility for computer to manipulate

• What is the best representation of contents?– Char. string (char trigrams): not precise enough– Word: good coverage, not precise– Phrase: poor coverage, more precise– Concept: poor coverage, precise

Coverage(Recall)

Accuracy(Precision)String Word Phrase Concept

Parsing a document

• What format is it in?– pdf/word/excel/html?

• What language is it in?• What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Sec. 2.1

34

Complications: Format/language

• Documents being indexed can include docs from many different languages– A single index may have to contain terms of several

languages.• Sometimes a document or its components can contain

multiple languages/formats– French email with a German pdf attachment.

• What is a unit document?– A file?– An email? (Perhaps one of many in an mbox.)– An email with 5 attachments?– A group of files (PPT or LaTeX as HTML pages)

Sec. 2.1

35

HOW TO CONSTRUCT INDEXOF TERMS?

• function words do not bear useful information for IRof, in, about, with, I, although, …

• Stoplist: contain stopwords, not to be used as index– Prepositions– Articles– Pronouns– Some adverbs and adjectives– Some frequent words (e.g. document)

• The removal of stopwords usually improves IR effectiveness

• A few “standard” stoplists are commonly used.

Stopwords / Stoplist

Stop words• With a stop list, you exclude from the dictionary

entirely the commonest words. Intuition:– They have little semantic content: the, a, and, to, be– There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:– Good compression techniques means the space for including stop words

in a system is very small– Good query optimization techniques mean you pay little at query time

for including stop words.– You need them for:

• Phrase queries: “King of Denmark”• Various song titles, etc.: “Let it be”, “To be or not to be”• “Relational” queries: “flights to London”

Sec. 2.2.2

38

Stemming

• Reason: – Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them• Stemming:

– Removing some endings of word computercompute computescomputingcomputedcomputation

comput

Stemming

• Reduce terms to their “roots” before indexing• “Stemming” suggest crude affix chopping

– language dependent– e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Sec. 2.2.4

40