Search Information retrieval. Information and scientific progress Let’s go back into time with...

42
Search Information retrieval

Transcript of Search Information retrieval. Information and scientific progress Let’s go back into time with...

Page 1: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Search

Information retrieval

Page 2: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Information and scientific progress

Let’s go back into time with this video:

The American Engineer

Information retrieval—and information science—emerge in a post-WW2 atmosphere where scientific and technical advance is key to social progress. Information serves this goal.

Page 3: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

The basic IR task model

The goal of information retrieval is often described something like this:

Given a user information need (typically expressed as a search phrase or query), provide relevant documents from the collection.

Note that key term relevance. We’ll come back to it.

Page 4: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

A lovely diagram of the IR process

Diagram from Jaime Arguello, UNC-Chapel Hill

Page 5: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Background: indexing

Indexing is the process of representing the subject of documents. An index can comprise terms extracted from the document itself or terms from another source.

An index language specifies the concepts used for indexing. An index language includes terms to represent concepts and a syntax for assigning (and perhaps combining, etc) terms.

A thesaurus is a type of index language in which terms are controlled and relationships between terms established.

Page 6: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Background: indexing

Precoordinate terms combine concepts at the point of indexing: foraging for wild mushrooms is a precoordinate term.

Postcoordinate terms are combined only at the time of search. Foraging and wild mushrooms are postcoordinate terms. (The searcher would input “foraging for wild mushrooms.”)

Page 7: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Background: indexing

A classification (in this context) is a subject scheme that expresses complex concepts by means of a notation; a document is placed in the single class that best expresses its subject. Most classifications are enumerative; individual classes are defined in advance. Some classifications have synthetic elements (ways to customize classes for particular geographic locations, for example).

A faceted classification is a synthetic scheme in which various elements of the subject are combined at the time of indexing (via a syntax) to create classes.

Page 8: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Background: indexing

Exhaustivity is an indexing principle that describes how much of a document’s content is indexed. (The “main” subject? The first three “important” subjects? Everything mentioned in the document? )

Specificity is an indexing principle that describes the level of abstraction at which a document’s content is indexed. (Is the subject information retrieval or information retrieval evaluation methods or precision and recall measures?)

Page 9: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Background: indexing

The Cranfield experiments test the relative effectiveness of three types of indexing languages: • Single words extracted from the text itself. • Concepts (phrases) extracted from the text itself. • Terms drawn from a controlled vocabulary.

Multiple variations of these three were also created (single words plus synonyms, controlled terms plus broader and narrower concepts).

Page 10: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield experiments

Although the Cranfield tests did not involve computers (!), they illustrate the basis from which information retrieval evaluations are still conducted, both in the conceptualization of the retrieval task and in the methods employed to measure it.

Also, the results were SHOCKING.

Page 11: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Again: social context of IR beginnings

Cleverdon: librarian at a college of aeronautics. Interested in new indexing technologies. Wants to systematically test them.

The unspoken retrieval context: Facilitate the efficient progress of technical specialists working on specific problems.

Page 12: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield protocol

1. Assemble a test collection of 1400 research papers, primarily in aerodynamics.

2. Index each document with test languages.

3. Develop test queries.

4. Determine relevance of each document to each query, using 1-4 scale.

5. Search collection using a specified query, index language, relevance level, and search rule (number of matching terms, for example).

6. Determine which of the retrieved documents are relevant to the query.

Page 13: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield protocol

Where do the test queries come from?

Cleverdon and colleagues asked the authors of scientific papers to state in the form of a question “the problem to which their paper was addressed.”)

(The example given is “small deflection theory of simple supported cylinders.”)

In using this to approximate an information seeker, Cranfield focuses on specialists who have an extremely good idea of what they are looking for.

Page 14: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield protocol

Where do the relevance judgments come from?

Students do the first pass, and the original document authors confirm.

Again, this may be an excellent approximation of “the real thing,” but it is a certain sort of approximation.

Page 15: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield protocol

Search results were then described according to the precision and recall measures.

Precision: Percentage of documents retrieved that are relevant.

Recall: Percentage of relevant documents retrieved out of all possible relevant documents.

Page 16: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Precision and recall

Say I have a collection of 10 documents. For a query X, documents 2, 7, and 8 are relevant.

My search for query X retrieves documents 7 and 8 but not 2; it also retrieves documents 4 and 5.

Precision for this search is 2 relevant documents out of 4 retrieved, or 50 percent.

Recall for this search is 2 relevant documents out of 3 possible relevant, or 67 percent.

Page 17: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield results

They do some math to normalize the results to a single number. And, the best performing index language is...

Single terms extracted from the document itself with some word forms (e.g., alloy + alloys) added.

OMG!

Page 18: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield legacy

The Cranfield methodology is used today in evaluating IR systems: the TREC (text retrieval conference) sponsored by NIST, in which researchers use a shared collection and set of queries to compare their systems’ effectiveness uses the same protocol...including human assessors.

The Cranfield measures and means of conceptualizing retrieval also persist.

Page 19: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

2009 TReC Web track query

Query: appraisals

Description: How are home values appraised? I want to know how home appraisals are done.

Page 20: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Cranfield legacy

TREC assessors

Photo from NIST

Page 21: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Wait, what were those people assessing?

Relevance. Er, what is that, exactly?

Exactly.

Saracevic points out that IR—and, to some extent, information science as a whole—rests upon an uncertain concept. Is this problematic?

Page 22: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Relevance in information science

Saracevic noted a pattern in operational definitions of relevance as used by information scientists testing retrieval systems:

Relevance is the A of a B existing between a C and a D as determined by an E.

Page 23: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Relevance

Saracevic describes seven different views of relevance, which emphasize different elements of the concept:

1. Subject knowledge view. What is the relationship between a question and a subject?

2. Subject literature view. What is the relationship between a question and the academic literature on a subject?

Page 24: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Relevance

Saracevic describes seven different views of relevance, which emphasize different elements of the concept:

3. Logical view. What is the nature of the inference between a question and conclusions from a subject literature?

4. System view. What is the relationship between a file’s contents and a question, a user, and a subject?

Page 25: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Relevance

Saracevic describes seven different views of relevance, which emphasize different elements of the concept:

5. Destination’s view. What is the human judgment of the relation between a document and a question?

6. Pertinence view. What is the relationship between a seeker’s knowledge and a subject?

Page 26: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Relevance

Saracevic describes seven different views of relevance, which emphasize different elements of the concept:

7. Pragmatic view. What is the relation between a user’s problem and the information provided?

Page 27: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Back to representation and comparision

Diagram from Jaime Arguello, UNC-Chapel Hill

Page 28: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Document representations

In the Cranfield tests, documents were represented with a small set of index terms.

If terms from the document itself are the best index terms anyway (maybe!), then if we can use ALL the text of the document, why not? This is full-text indexing.

To speed processing, the text of documents is extracted and rearranged (tokenized) to form an inverted file.

Page 29: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Inverted index file

All the terms from all the documents are put in a table (T= term, and D= document). For a Boolean search (in a moment!) these tables just indicate presence and absence (1 or 0). Other models weight terms differently to rank retrieved documents.

Table from Joseph Tennis, UW

Page 30: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Other text processing

Retrieval systems may remove common words (stopwords) from the inverted file, although this is not as common as it once was.

Retrieval systems may also stem words to store just their roots.

Page 31: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Retrieval models

Most retrieval systems are variations of the following models:• Boolean.• Vector space.• Probabilistic.• Latent semantic indexing.

Page 32: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Boolean model

This is the oldest and simplest model; it puts most of the work on the searcher. Instead of searching for mere presence of index terms in a document (like the Cranfield tests), use Boolean operators (AND, OR, NOT) to describe the query more precisely.

( (“traumatic brain injury” OR TBI OR tbi OR “traumatic brain injuries”) AND (headache OR headaches) ) NOT concussion

Page 33: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Ranked results

Other retrieval models focus on using various statistical properties of texts (primarily) to produce ranked lists of results.

A key element in calculating these rankings is the frequency of significant terms. This measure relies on a property of language known as Zipf’s law.

Page 34: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Zipf’s law

Zipf’s law describes a distribution in which the ith most frequent object appears 1/iθ times the frequency of the most frequent object.

Er, what?

Zipf’s law applies to language, for any corpus, including a single document. So the most frequent word (say) takes up N percent of the document; the next most frequent word takes up N/2 percent of the document; the next most frequent word takes up N/3 percent, etc.

Zipf distributions appear for other phenomena as well.

Page 35: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Zipf’s law

A Zipf distribution at 300 data points.

Graph from Jacob Nielsen, useit.com

Page 36: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Implications of Zipf’s law

Zipf’s law holds true across languages, across types of text (written and spoken, etc) across complexity of topics, document genres, etc.

This statistical property implies that the most important words (for retrieval purposes) in a document are those that are most frequent in the document and least frequent in the collection.

Page 37: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Tf/idf

This relation is exploited by many retrieval models:

term frequency/ inverse document frequency

You can refine your calculation of tf-idf (e.g., by taking account of document length), but this is the basic idea.

Page 38: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Vector space model

The vector space model (originated by Salton), compares a query and a document as the correlation between two n-dimensional vectors. The cosine of the angle between the vectors is used to quantify the correlation.

Index terms in queries and documents are weighted based on tf-idf (and other properties).

Page 39: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Probabilistic model

The probabilistic model (introduced by Robertson and Sparck Jones) recursively refines an answer set based on guessing at the “ideal” answer.

Initial probabilistic models did not use weights, but later versions do.

Page 40: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Latent semantic indexing

Latent semantic indexing is a retrieval model that attempts to align documents with concepts, on the observation that terms that co-occur probably indicate something about a shared conceptual space. Documents and queries are mapped to a “concept space” and then compared.

LSA aims to return documents based on a conceptual match to a query, as opposed to a term match.

Page 41: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

Lots of other stuff...

Individual systems may incorporate many different sorts of information to adjust the rankings produced by one of these basic models: using text structure to refine weights (titles, etc), using your location or previous Web history to adjust rankings, promoting recent content, etc.

Page 42: Search Information retrieval. Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and.

What’s Google?

The original Google innovation was a ranking enhancement outside of the primary retrieval model.

They looked at the links to a page as a measure of information quality. This “PageRank” was used to adjust initial results of a retrieval system.

Google is not magic.