Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st...

23
Information Retrieval Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1

Transcript of Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st...

Page 1: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Information RetrievalInformation Retrieval

Lebanese UniversityFaculty of Economics and Business

Administration – 1st Branch

Class: M1Instructor: Dr. Lina A. Nimri

1

Page 2: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Course Text BookCourse Text Book

Modern Information Retrieval,

R. Baeza-yates and B. Ribeiro-Neto.,

Addison-Wesley and ACM Press, 1999,

ISBN: 0-201-39829-X

2

Page 3: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

IntroductionIntroduction

Modern Information Retrieval, Chapter 1 Ricardo Baeza-Yates, Berthier Ribeiro-Neto

Page 4: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

IntroductionIntroduction Examples of information need in the context of the

world wide web: “Find all documents containing information on

computer courses which: (1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies,

To be relevant, the document must include information on admission requirements, and e-mail and phone number for contact purpose.” “Find all docs containing information on college

tennis teams which:

(1) are maintained by a USA university and

(2) participate in the NCAA tournament.

Information Retrieval4

Page 5: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

5

Information RetrievalInformation Retrieval

Retrieval SystemRetrieval System

QueryQuery

Set of retrieved documentsdocuments

Docu

men

tsD

ocu

men

tsUser Information NeedUser Information Need

Search EngineSearch Engine

Useful or relevant Useful or relevant information to the userinformation to the user

Primary goal of an IR system“Retrieve all the documents which are relevant to a user

query, while retrieving as few non-relevant documents as possible.”

Representation, storage, organisation, and access to information items

(Usually) keyword-based representation

Page 6: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Data RetrievalData Retrieval

Determine which documents contain the keywords in the user query is not always enough to satisfy the user information need.

Data Retrieval retrieves objects which satisfy clearly defined conditions, such as regular expressions or relational algebra expressions.

Data Retrieval system deals with data with well-defined structure and semantics

6

Page 7: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Information Retrieval SystemInformation Retrieval System

Retrieving information about a subjectDeals with natural language text which

is not well structured and could be semantically ambiguous

It must interpret the contents of documents and rank them according to the degree of relevance to the user need.

7

Page 8: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Area of interestArea of interest

Digital LibrariesInformation expertsWorld Wide Web - Very difficult task

– The hyperspace is vast– The absence of a well defined data model

(format or representation form)

8

Page 9: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Effective retrievalEffective retrieval

The effective retrieval of relevant information is directly affected by:– The user task– The logical view of the document

(document’s representation) adopted by the retrieval system.

9

Page 10: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

User tasksUser tasks

Pull technology User requests

information in an interactive manner

3 retrieval tasks– Browsing (hypertext)– Retrieval (classical IR

systems)– Browsing and retrieval

(modern digital libraries and web systems)

Push technology– automatic and

permanent pushing of information to user

– software agents– example: news

service– filtering (retrieval

task) relevant information for later inspection by user

10

Page 11: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

PullingPulling

The user can browse the documents when his main objectives are not clear in the beginning and whose purpose might change during the interaction with the system.

Combination of retrieval and browsing is not yet a well established approach.

11

Retrieval

Browsing

Database

Page 12: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

DocumentsDocumentsUnit of retrievalA passage of free text

– composed of text, strings of characters from an alphabet

– composed of natural language newspaper article, a journal paper, a

dictionary definition, email messages

– size of documents arbitrary newspaper article vs. journal paper vs.

email12

Page 13: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

What is a document?What is a document?

13

Page 14: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Representation of documentsRepresentation of documents Documents are represented thru a set of index

terms or keywords or term descriptors– extracted directly form text– specified by human subjects (information science)

metadata Most concise representation Poor quality of retrieval

Full text representation– Most complete representation– High computational cost

Large collections– Reduce set of representative keywords

Elimination of stop words Stemming Identification of noun phrases Further compression 14

Document term descriptors to access texts

Generation of descriptors for text• By hand

• By analysing the text

Page 15: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Logical View of the Logical View of the documentsdocuments

15

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Page 16: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

The retrieval functionsThe retrieval functions

16

Information need

Query

FormulationFormulation

Documents

Document representation

IndexingIndexing

Retrieved documents

Retrieval functionsRetrieval functions

Rele

vance

fe

edb

ack

Page 17: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

QueriesQueries

Information Need: Simple queries

– composed of two or three, perhaps even dozens, of keywords

– e.g., as in web retrieval Boolean queries

– “neural networks AND speech recognition” Context Queries

– Proximity search, phrase queries

17

User term descriptors characterising the user need

Page 18: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Best-Match retrievalBest-Match retrieval

Compare the terms in a document and query

Compute similarity between each document in the collection and the query based on the terms that they have in common

Sorting the documents in order of decreasing similarity with the query

The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system

18

Document term descriptors to access texts

User term descriptors characterising the user need

Page 19: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Conceptual view of text Conceptual view of text retrieval systemretrieval system

19

Queries DocumentsSimilarity

Computation

RetrievedDocuments

Page 20: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Expanded view of text Expanded view of text retrieval systemretrieval system

20

Queries DocumentsIndexingIndexed

DocumentsSimilarity

Computation

RetrievedDocuments

RankedDocuments

Page 21: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Process of retrieving infoProcess of retrieving info

21

User Interface

Text Operations

Query Operations

Indexing

Similarity Computation (Searching)

Ranking

Document RepositoryManager

Index

User need

Logical view Logical view

Inverted file

Query

Retrieved docs

Text

TextUser feedback

Ranked docs

Text repository

Page 22: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Key TopicsKey Topics

Indexing text documents Retrieving text documents Evaluation Query reformulations

Search Engines =

IR + Link Structure + Name Interpretation

22

Page 23: Information Retrieval Lebanese University Faculty of Economics and Business Administration – 1 st Branch Class: M1 Instructor: Dr. Lina A. Nimri 1.

Information Retrieval Information Retrieval vs Information Extractionvs Information Extraction

Information Retrieval– Given a set of query terms and a set of document

terms select only the most relevant documents [precision], and preferably all the relevant [recall].

Information Extraction– Extract from the text what the document means.

IR systems can FIND documents but need not “understand” them

23