Special Topics on Information Retrieval

Special Topics onInformation Retrieval

Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/

[email protected]

Spoken Document Retrieval

Content of the section

• Definition of the task• Traditional architecture for SDR• Automatic Speech Recognition

– Basic ideas and common errors

• Approaches for SDR– Manipulating the ASR system– Recovering from recognition errors

Special Topics on Information Retrieval3

Motivation• Speech is the primary and most convenient

means of communication between humans• Multimedia documents are becoming

increasingly popular, and finding relevant information in them is a challenging task

“The problem of audio/speech retrieval is familiar to anyone who has returned from vacation to find an answering machine full of messages. If you are not fortunate, you may have to listen to the entire tape to find the urgent one from your boss”


Spoken document retrieval (SDR)• SDR refers to the task of finding segments

from recorded speech that are relevant to a user’s information need.– Large amount of information produced today is in

spoken form: TV and radio broadcasts, recordings of meetings, lectures and telephone conversations.

– There are currently no universal tools for speech retrieval.

Ideas for carrying out this task?


Ideal architecture

• An ideal system would simply concatenate an automatic speech recognition (ASR) system with a standard text indexing and retrieval system.

Problems with this architecture?


SpeechRecognizer

Text-basedIR system

AudioCollection Query

ResultsRecognized

text

Missing slide…about ASR

• General architecture• Dimensions of the problem


Basic components of the ASR system• Feature extraction

– Transforms the input waveform into a sequence of acoustic feature vectors; each vector representing the information of in a small time window of the signal

• Decoder– Combines information from the acoustic and language models

and finds the word sequence with the highest probability given the observed speech features.

– Acoustic models: have information on how phonemes in speech are realized as feature vectors

– Lexicon: list of words with a pronunciation for each word expressed as a phone sequence

– Language Model: estimates probabilities of word sequences


More about this topic in: Spring 2011; CS 462/662/762 Natural Language Processing; Dr. Thamar Solorio.

Common errors in the recognition stage

• Recognition errors are deletions, insertions and substitutions of legitimate words.– Errors at word level

• “governors” “governess”• “kilos” “killers”• “Sinn Fein” “shame fame”

– Errors on word boundaries• “Frattini” “Freeh teeny”

• “leap shark ” “Liebschard”


Transcription errors and IR• The main problem with the traditional SDR

architecture is the accuracy of the recognition output– Around 50% word accuracy on real-world tasks

• Example:

– What will happen with a query about “Saddam Hussein”?– How to handle the errors or incomplete output provided

by ASR systems?


Audio input the efforts by certain states to circumvent UN sanctions against the Saddam Hussein regime in Iraq

TextTranscription

the efforts by certain States to circumvent U. N. sanctions against the sit down the same regime in Iraq

SDR in non spontaneous speech

Conclusions from TREC 2000 editionUsing a corpus of 550 hours of Broadcast News

•Spoken news retrieval systems achieved almost the same performance as traditional IR systems.

– Even with error rates of around 40%, the eeffectiveness of an IR system falls less than 10%.

– Long queries are better that short queries– Deletations and substitutions are more important

than insertions, especially for long queries.


New challenges

• Spoken questions of short duration• Message-length documents

– For example, voice-mail messages, announcements, and so on.

• Other types of spoken documents such as dialogues, meetings, lectures, classes, etc.– Contexts where the word error rate is well over 50%.

• Applications such as:– Question answering– Summarizing speech


Main ideas for SDR

• Manipulating the ASR system (white box)– Retrieval at phonetic level– Adding alternative recognition results

• N most likely paths in the lattice• Using the complete word lattice

• Recovering from recognition errors (black box)– Query and/or document expansion– Using multiple recognizers– Applying phonetic codification


Alternative recognition results

• Speech recognizers aim to produce a transcription with as few errors as possible.– It is possible that the correct word appears among

the candidates the recognizer considers, it just gets mistakenly pruned away.

• Retrieval performance can be improved by adding several candidates to the transcription


Query/document expansion (1)

• The effect of out-of-vocabulary query words and other recognition errors can be reduced by adding to the query extra terms that have similar meaning or that are otherwise likely to appear in the same documents as the query terms.– From the top ranked documents for the given query– Using associations extracted from the whole

collection

• Common to use a parallel written document set.– This collection must be thematically related


Query/document expansion (2)

• Other approach to the OOV problem is to expand word queries into in-vocabulary phrases according to intrinsic acoustic confusability and language model scores.– For example, taliban may be expanded to tell a

band.

• The aim is to mimic mistakes the speech recognizer makes when transcribing the audio.

• It is dependent on the ASR system


Using multiple recognizers

• Different independently developed recognizers tend to make different kinds of errors and combining the outputs might allow some errors to be recovered.

• Combination of scores can be done by any traditional information fusion method.– Good results have been obtained with simple

linear combinations.


Using a phonetic codification

• Phonetic codifications allow to characterize words with similar pronunciations through the same code.

• Example of a Soundex codification:– Unix Sun Workstation → (U52000 S30000 W62300)– Unique some workstation → (U52000 S30000 W62300)

• The idea of this approach is to build an enriched representation of transcriptions by combining words and phonetic codes.


The algorithm at a glance1. Compute the phonetic codification for each

transcription using a given algorithm.2. Combine transcriptions and their phonetic

codifications in order to form an enriched document representation.

3. Remove unimportant tokens from the new document representation.

– Stop words and the most frequent codes.4. Create a combined index using words and codes.

– Incoming queries need to be represented in the same way


Example of the representation


Automatic transcription

…just your early discussions was roll wallenberg uh any recollection of of uh where he came from…

Phonetic codification

... J23000 Y60000 E64000 D22520 W20000 R40000 W45162 U00000 A50000 R24235 O10000 O10000 U00000 W60000 H00000 C50000 F65000 ...

Enriched representation

{just, early, discussions, roll, wallenberg, recollection, came, E64000, D22520, R40000, W45162, R24235}

• Query: Actions of Raoul Wallenberg{actions, raoul, wallenberg, A23520, R40000, W45162}

Geographic Information Retrieval


• Definition of the task– Need of GIR– Kinds of geographical queries

• Main challenges of GIR– Toponyms identification and Disambiguation– Indexing for GIR– Measuring document similarities

• Re-ranking of retrieval results


The need for GIR

Geographical information is recorded in a wide variety of media and document types

•Information technology for accessing geographical information has focused on the combination of digital maps and databases. GIS•Systems to retrieve geographically specific information from the relatively unstructured documents that compose the Web. GIR


The size of the need• It is estimated that one fifth of the queries submitted

to search engines have geographic meaning.– Among them, eighty percent can be associated with a

geographic place


OtherGeographic

Definition of the task• Geographical Information Retrieval (GIR) considers

the search for documents based not only on conceptual keywords, but also on spatial information.

• A geographic query is defined by a tuple:

<what, relation, where>Whisky making in the Scottish Islands

• <what> represents the thematic part • <where> is used to specify the geographical areas of interest. • <relation> specifies the “spatial relation”, which connects the

what and the where.


Different kinds of queries• With concrete locations:

– “ETA in France” (GC049)• With locations and simple rules of relevant locations:

– “Car bombings near Madrid” (GC030)• With locations and complex rules of relevant locations:

– “Automotive industry around the Sea of Japan” (GC036)• With very general locations that are not necessarily in a gazetteer:

– “Snowstorms in North America” (GC028)• With quasi-locations (e.g. political) that are not found in a gazetteer:

– “Malaria in the tropics” (GC034)• Describing characteristics of the geographical location:

– “Cities near active volcanoes” (GC040)


The problem• In classic IR, retrieved documents are ranked by

their similarity to the text of the query. • In a search engine with geographic capabilities,

the semantics of geographic terms should be considered as one of the ranking criteria.

The problem of weighting the geographic importance of a document can be reduced to computing the similarity between two geographic locations, one

associated with the query and other with the document.


Challenges of GIR• Detecting geographical references in the form of place

names within text documents and in users’ queries• Disambiguating place names to determine which

particular instance of a name is intended• Geometric interpretation of the meaning of vague

place names (‘Midlands’) and spatial relations (‘near’)• Indexing documents with respect to their geographic

context as well as their non-spatial thematic content• Ranking the relevance of documents with respect to

geography as well as theme• Developing effective user interfaces that help users to

find what they want


Detecting geographic references

• The process of geo-parsing is concerned with analyzing text to identify the presence of place names extension of Named Entity Recognition

• Problem is that place names (or toponyms) can be used to refer to places on Earth, but they also occur frequently within the names of organizations and as part of people’s names.– Washington, president or place? (PER vs. LOC)– Mexico, country or football team? (LOC vs. ORG)


Two main approaches

• Knowledge-based– Using an existing gazetteer

• List containing information on geographical references (e.g. name, name variations, coordinates, class, size, additional information).

• Data-driven or supervised– Using statistical or machine learning methods

• Typical features are: Capitalisation, numeric symbols, punctuation marks, position in the sentence and the words.

Advantages and disadvantages?


Disambiguating place names

• Once it has been established that a place name is being used in a geographic sense, the problem remains of determining uniquely the place to which the name refers Toponym resolution

– Paris is a place name, but it may refer to the capital of France, or to one of the more than a dozen Paris in the US, Canada and Gambia

The ambiguity of a toponym depends from the world knowledge that a system has


Human Errors in TR(taken from a presentation of Davide Buscaldi –UPV,Spain)

Selected Toponym Resources(taken from a presentation of Davide Buscaldi –UPV,Spain)

• Gazetteers– Geonames

http://www.geonames.org

– Wikipedia-World http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikipedia-World/en

• Structured Resources– Yahoo! GeoPlanet

http://developer.yahoo.com/geo/geoplanet/

– Getty Thesaurus of Geographical Names http://www.getty.edu/research/conducting_research/vocabularies/tgn/

– (Geo)WordNet http://users.dsic.upv.es/grupos/nle/resources/geo-wn/download.html

Methods for Toponym Resolution

• Three broad categories:– Map-based

• They need geographical coordinates– Knowledge-based

• They need hierarchical resources– Data-driven or supervised

• They need a large enough set of labeled data• Many names occurring only once (it is

impossible to estimate their probabilities)

• The right referent is the one with minimum average distance from the context locations

• Reported precision74% to 93%depending ontest collection

“One hundred years ago there existed in England the Association for the Promotion of the Unity of Christendom. ... A Birmingham newspaper printed in a column for children an article entitled “The True Story of Guy Fawkes,” ... An Anglican clergyman in Oxford sadly but frankly acknowledged to me that this is true. ... A notable example of this was the discussion of Christian unity by the Catholic Archbishop of Liverpool, …”


Conceptual Density TR: Buscaldi 2008• Adaptation of a WSD method based on Conceptual Density computed

over hierarchies of hypernyms to hierarchies of holonyms• Given an ambiguous place name, different subhierarchies are obtained

from WordNet: the sense related to the most densely marked subhierarchy is selected

WorldWorld

UKUK

EnglandEngland

LiverpoolLiverpool OxfordOxford

Birmingham(1)Birmingham(1)

USAUSA

Birmingham(2)Birmingham(2)

OxfordOxford

MississippiMississippi

AlabamaAlabama



Spatial and textual indexing• Once identified the toponyms, it is necessary

to take advantage of them for indexing.• Main approach consists in a combination of:

– A textual index, using all words except toponyms– A geographic index, considering only the toponyms

• Toponyms ambiguity may or may not be resolved implications of this?

• Geo-index may be enriched using synonyms and holonyms how to do it? Other related words?

Other indexing alternative?


Geographical relevance rankingRetrieval of relevant documents requires matching the query specification to the characteristics of the indexed documents

•In Geo-IR there is a need to match the geographical component of the query with the geographical context of documents.•Traditionally, they are used two scores, for thematic and geographic relevance, that are combined to find an overall relevance:

How to evaluate these similarities?How to consider the <relation> information?


About the ranking function• Consider the following two queries:

– “Car bombings near Madrid” bomb– “Automotive industry around the Sea of Japan”

• What happen with a document mentioning:– Barajas, or Legánes, or Toledo? Or talking about dynamite?– Toyota but not “Automotive industry ”, or “Mikura Island”?

• How to differentiate between different relations (“in”, “near”, “at the north”, etc.)?


Measuring geographic similarities• Main approach: query expansion using an external resource

and traditional word-comparison of documents. – Add to the query some related names of places

• Geographic or topological distance is lesser than a threshold.• Satisfy the query relation

– Documents may also be expanded at indexing time (synonyms and holonyms)

• Alternative approach: evaluate a geographic distance between the locations from the query and the document.– Using geographic distances

• Distance between points (latitude and longitude)• Intersection between Minimal Bounding Rectangles

– Topology distance• Computed from a given geographic resource


Relevance feedback in Geo-IR• Traditional IR systems are able to retrieve the majority of

the relevant documents for most queries, but that they have severe difficulties to generate a pertinent ranking of them.

• Idea: Use relevance feedback for selecting some relevant documents and then re-rank the retrieval list using this information.


Our proposed solution

• Based on a Markov random field (MRF) that aims at classifying the ranked documents as relevant or irrelevant.

• The MRF takes into account:– Information provided by the base retrieval system– Similarities among documents in the list– Relevance feedback information

• We reduced the problem of document re-ranking to that of minimizing an energy-function that represents a trade-off between document relevance and inter-document similarity.


Proposed architecture


Definition of the MRF• Each node represents a document from the original

retrieved list.• Each fi is a binary random variable

– fi = 1 indicates i-th document is relevant

– fi = 0 indicates that it is irrelevant.

• The task of the MRF is to find the most probable configuration F={f1,…, fN}.– Configuration that minimizes a given

energy function– Necessary to use an optimization

technique; we used ICM.


Energy function• Combines the following information:

– Inter-document similarity (interaction potential) – Query-document similarity and rank information

(observation potential)

• These two similarities are computed in a traditional way, without special treatment for the geographic information.


Observation potentialAssumption that relevant documents are very similar to

the query and at the same time it is very likely thatthey appear in the top positions

•Captures the affinity between the document associated to node fi and the query q.•Incorporates information from the initial retrieval system

– Use the position of documents in original list


Interaction potentialAssumption that relevant documents are very similar

to each other, and less similar to irrelevant documents

•Assess how much support give same-valued documents to keep current value, and how much support give oppose-valued documents to change to contrary value.


Relevance feedback• Use it as seed for building the initial configuration of

the MRF– We set fi = 1 for relevance feedback documents and we set

fj = 0 for the rest.

– The MRF starts the energy minimization process knowing what documents are potentially relevant to the query.

• Inference process consists of identifying further relevant documents in the list by propagating through the MRF the user relevance feedback information.


Evaluation

• We employed the Geo-CLEF document collection composed from news articles from years 1994 and 1995.

• 100 topics from GeoCLEF 2005 to 2008. • We evaluated results using the Mean Average

Precision (MAP) and the precision at N(P@N).• Initial results produced by the vectorial space

model configured in Lemur using a TFIDF weighting scheme.


Results


Final comments• Recent results have shown that traditional IR systems

are able to retrieve the majority of the relevant documents for most queries, but that they have severe difficulties to generate a pertinent ranking of them. Indicates that the thematic part is the most important from a recall perspective Suggest using Geo-processing as a post-retrieval stage (Re-ranking)

Suggest using user interfaces for carrying out the interpretation of the query <relation>


Question Answering


• Definition of the task• General architecture of a QA system

– Question classification– Passage retrieval– Answer extraction and ranking

• Answer validation• Multilingual question answering


Question answering

• Due to the great amount of documents available online, better retrieval methods are required.

• Question Answering (QA) systems are applications whose aim is to provide inexperienced users with a flexible access to the information.

• These systems allow users to write a query in natural language and to obtain not a set of documents which contain the answer, but the concise answer itself.


Input and output

• In resume, the goal of a QA system is to find the answer to an open domain question in a large document collection.– Input: questions (instead of keyword-based

queries)– Output: answers (instead of documents)

Given a question like: “Where is the Popocatepetl located?”, a QA system must respond “Mexico”, instead of just returning a list of documents related to the volcano


Related technologies• Information Retrieval

– Retrieve relevant documents to a user query from a document collection.

• Queries expressed by a set of keywords

• Information Extraction– Template filling from text (i.e., filling slots in a

database from sub-segments of text)• Slots in templates are previously defined

• Relational QA– Translate questions to relational DB queries.

• Answers are extracted from a given database.


Complexity of the task


Current work

• Questions about simple facts– Names, dates, quantities, lists of instances,

definition of terms and persons.

• Answers extracted from one single document– Direct answers; no (or small) inference required

• Mainly using one single document collection– Multilingual QA is the exception, but at the end

more methods use only one collection.


Examples of factoid questions


Question Answer

When did the reunification of East and West Germany take place? 1989

Who is the Prime Minister of Italy? Silvio Berlusconi

What is the capital of the Republic of South Africa? Pretoria

How many inhabitants does Sweden have? about 8,600,000

Who were the members of The Beatles? John, Paul, Ringo and George

How to –automatically– extract the answer to these questions?

Answering simple questions is not so easy• Sir Winston Leonard Spencer-Churchill (30 November

1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War….son of Lord Randolph Churchill, a politician, and Lady Jennie Jerome, daughter of American millionaire Leonard Jerome.

– When Winston Churchill born?– Where Winston Churchill born?– What is the name of his mother/father?– What is the name of his grandparent? What is the problem with these questions?


Typical QA pipeline


Question

QuestionAnalysis

QueryAnswer type

SearchEngine

DocumentCollection

PassageExtractor

AnswerSelector

Answers

NLP/knowledgeresources

DocsAnswer type

PassagesAnswer type

• Several systems retrieve passages directly from the collection.• Most systems return a list of candidate answers together with a

support passageHow to carry out these subtasks?

Question analysis

• This module processes the question, analyzes the question type, and produces a set of keywords for retrieval.– Depending on the answer extraction strategies,

sometimes is necessary to perform syntactic and semantic analysis of the questions.

• One problem: how to define a taxonomy of questions?– More systems use a simple classification including

main types and some subtypes


Question classification

How to achieve this classification?Types and subtypes in one single step?


Question Answer type

When did the reunification of East and West Germany take place? DATE

Who is the Prime Minister of Italy? PERSON

What is the capital of the Republic of South Africa? PLACE (subtype: city)

Where is the Popocatepetl located? PLACE (subtype: country)

How many inhabitants does Sweden have? QUANTITY

Who was Gennady Lyachin? DEFINITION

Main approaches

• Using handcrafted rules– Several rules for each kind of query

• By means of a supervised approach– Main approach considers only surface text

features (bag of words and bags of n-grams)• The most informative is the wh-word

– Questions are represented as binary feature vectors

– Best results using SVMs


Current classification accuracy up to 90%

Passage retrieval

• An important component common to many question answering systems.

• It process sets of documents (maybe the entire collection) and return ranked lists of passages scored with respect to query terms.

• Two main issues:– Defining the passages– Measuring their relevance to a given question

How to achieve these two tasks?


Dividing documents into passages• Discourse models

– Use the structural properties of the documents, such as sentences or paragraphs.

• Paragraphs of very different lengths.

• Semantic models– Divide each document into semantic pieces according to

the different topics in the document.• Requires automatic topic segmentation; also different lengths.

• Window models– Uses windows of a fixed size (usually a number of terms)

to determine passage boundaries.• May divide relevant sections in two passages.


Approaches for passage retrieval

• Overlap-based passage retrieval– Common terms between query and passage.

• Density-based pasasage retrieval– Favors query terms that appear close together

• Some works have also explored:– Using the order of the words– Combining term overlap and question

reformulations– Calculating similarity at syntactical level by using

edit distance between trees.


Some final comments

• Boolean querying schemes perform well in the question answering task.

• The performance differences between various passage retrieval algorithms vary with the choice of document retriever– Suggests significant interactions between document

retrieval and passage retrieval.

• The best algorithms in past evaluations employ density-based measures for scoring query terms


S.Tellex, B.Katz, J.Lin, A.Fernandes and G.Marton, Quantitative evaluation of passage retrieval algorithms for question answering, Proc. of SIGIR ’03, 2003, Toronto, Canada, pp. 41-47.

Answer extraction• This module performs detailed analysis to

passages and locate the question’s answer.– Usually it produces a list of candidate answers and

ranks them according to some scoring function.• Most methods consider:

– The answer expected type– The level of overlap between question and

passage– Redundancy of the answer in the passages– Restrictions on the semantic and syntactic roles of

the answer.


Difficulties

• When did Nixon visit China?– Richard Nixon's trip to China in February 1972 was

a critically important– On Feb.21, 1972, U.S. President Richard Nixon

arrived in Beijing – In 1972, a quiet, unassuming China was visited by

an American president, Richard Nixon.

• Important to do morphological and syntactical analysis, to have world knowledge, and to perform some kind of normalization of answers.


Evaluation of QA• Accuracy: indicates the percentage of

correctly answered questions.– This measure is calculated as the fraction of

correct answers plus correct nil with respect to the total number of questions.

• MRR: evaluates the list of candidate answers. It is the average of the reciprocal ranks of answers for a sample of queries.


Multi-stream QA• Several QA approaches, but most of them are

complementary. For instance, in the Portuguese QA track at CLEF 2008:

• The system “diue081” correctly responded 89% of the definition questions, but it only could respond to 35% of the factual questions.

• The combination of the correct answers from all participating systems (nine systems) outperformed by 49% the best individual result for factual questions.

• This fact indicates that a pertinent combination of various systems should allow to improve the individual results.


Traditional approaches• Challenge is to select the correct answer for a

given question by combining the evidence from different input systems– Dark Horse Approach: considers the confidence of

systems; each systems has different confidences associated to factual and definition questions.

– Answer Chorus Approach: it relies on the answer redundancies; it selects as the final respond the answer with the highest frequency across streams.

– Web Chorus Approach: it uses information from the Web to evaluate the relevance of candidate answers. It selects the answer with the greatest number of Web pages containing the answer terms along with the question terms.


Hybrid supervised approach• Combines features from traditional and

textual-entailment approaches that describe:– The redundancy of answers across streams– The compatibility between question and answer

types– The overlap and non-overlap information between

the question-answer pair and the support text.• As any other supervised approach it need a

tagged training data set.• Best results using SVMs


Multilingual QA

• In a multilingual scenario, it is expected that QA systems will be able to: – Answer questions formulated in several languages– Look for answers in a number of collections in

different languages

• Additional issues due to the language barrier:– Translation of incoming questions into all target

languages– Combination of relevant information extracted from

different languages


Translation of questions

• Most current systems translate questions to the documents’ language.

• Solution is very intuitive and seems effective, but it is too sensitive to the translation errors.– Translation errors caused a drop in answer accuracy

• Current work on:– Performing triangulated translation using English as a

pivot language– Combining the capacities of several translation machines– Selecting best translation for current collection dataset


Combining multilingual information• The goal is to integrate information obtained

from different languages into one single ranked list.

• Most systems rely on the translation of passages or answers to a common language.

• Two main approaches:– Combining passages

• Implement data fusion techniques from IR– Combining answers

• Use architectures from multi-stream QA


Resume of results from QA@CLEF


Resume of results from TREC 2007


Special Topics on Information Retrieval

Documents

Transcript of Special Topics on Information Retrieval