Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
-
Upload
sharleen-chambers -
Category
Documents
-
view
227 -
download
3
Transcript of Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
2
What is Information Retrieval?
Google ?Yahoo ?MSN search?What goes on behind search engine???How do they work??????
3
..Many Question
What makes a system like Google search trick?How does it gather information?What trick does it use?Extending beyond the Web
How can those approaches be made better?Natural language understanding?User interactions?
4
..Many Question
How can we do to make things work quickly?Faster computer?Caching?Compression?
How do we decide it works well?For all queries?For special types of queries?On every collection of information?
5
Motivation
IR: representation, storage, organization of, and access to information items
Focus is on the user information needUser information need:
Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.
Emphasis is on the retrieval of information (not data)
6
Motivation
Data retrieval which documents contain a set of keywords? Well defined semantics a single erroneous object implies failure!
Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated
IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
7
Motivation
IR at the center of the stage IR in the last 20 years:
classification and categorization systems and languages user interfaces and visualization
Advent of the Web changed this perception once and for all
universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the
solutions!
8
Information Retrieval
The indexing and retrieval of textual documents.
Concerned firstly with retrieving relevant documents to a query.
Concerned secondly with retrieving from large sets of documents efficiently.
9
Typical IR Task
Given:A corpus of textual natural-language
documents.A user query in the form of a textual string.
Find:A ranked set of documents that are relevant
to the query.
10
Relevance
Relevance is a subjective judgment and may include:Being on the proper subject.Being timely (recent information).Being authoritative (from a trusted source).Satisfying the goals of the user and his/her
intended use of the information (information need).
11
Relevance
Much of IR depends upon idea that Similar vocabulary -> relevant to same queries
Usually look for documents matching query words “Similar” can be measured in many ways
String matching/comparison Same vocabulary used Probability that documents arise from same model Same meaning of text
12
Keyword Search
Simplest notion of relevance is that the query string appears verbatim in the document.
Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).
13
Problems with Keywords
May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café”
May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)
14
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect feedback.
Taking into account the authority of the source.
15
IR Basic Concepts
The User TaskRetrieval
information or datapurposeful
Browsingglancing aroundF1; cars, Le Mans, France, tourism
Retrieval
Browsing
Database
16
IR Basic Concepts
Logical view of documents
Document representation viewed as a continuum: logical view of documents might shift
Docs
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexing
structure Full text Index terms
17
IR System Architecture
UserInterface
Text Operations
Query Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager Module
Text Database
Text
18
IR System Components
Text: Operations forms index words (tokens). Stopword removal Stemming
Indexing: constructs an inverted index of word to document pointers.
Searching: retrieves documents that contain a given query token from the inverted index.
Ranking: scores all retrieved documents according to a relevance metric.
19
IR System Components
User Interface: manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results.
Query Operations: transform the query to improve retrieval: Query expansion using a thesaurus.
Query transformation using relevance feedback.
20
Web Search
Application of IR to HTML documents on the World Wide Web.
Differences: Must assemble document corpus by spidering the
web. Can exploit the structural layout information in
HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.
21
Web Search System
Query String
IRSystem
RankedDocuments
1. Page12. Page23. Page3 . .
Documentcorpus
Web Spider
22
History of IR
1960-70’s: Initial exploration of text retrieval systems
for “small” corpora of scientific abstracts, and law and business documents.
Development of the basic Boolean and vector-space models of retrieval.
Prof. Salton and his students at Cornell University are the leading researchers in the area.
23
History of IR
1980’s:Large document database systems, many
run by companies:Lexis-NexisDialogMEDLINE
24
History of IR
1990’s:Searching FTP able documents on the
InternetArchieWAIS
Searching the World Wide WebYahooAltavista