Information Retrival Rankingnin/Courses/Workshop13a/tf-idf.pdf · 2012. 10. 30. · Eddie Aronovich...

Information RetrivalRanking

tf-idf

Eddie Aronovich

October 30, 2012

Eddie Aronovich tf-idf

Table of contents

1 Information RetrivalDefinitionInformation retrival - CharacteristicsVector Space ModelProbabilistic Models

2 RankingTerm Frequencytf∗idf

DefinitionInformation retrival - CharacteristicsVector Space ModelProbabilistic Models

Inforamation Retrival

the techniques of storing and recovering and often disseminatingrecorded data especially through the use of a computerized system

http : //www .merriam−webster .comdictionary/information retrieval

Characteristics

Word frequency - The Automatic Creation of LiteratureAbstracts (Luhn 58)

Cue words, title/heading, structural - New methods inautomatic extracting (Edmindson 1969)

Cohesion - Adaptive method of automatic abstracting andindexing (EF Skorochodko 1971/2)

Structure based classification - Automatic condensation ofelectronic publications by sentence selection (Brandow 1995)

Characteristics

Similarity

Let ~D be some document vector

Let ~Q be some query vector

Cosine the angle represents the similarity between them

After normalization, we can use the dot-product instead:Sim(~D, ~Q) =

∑ti∈Q,D

wti,Q wti,D

Similarity

∑ti∈Q,D

wti,Q wti,D

Similarity

∑ti∈Q,D

wti,Q wti,D

Similarity

∑ti∈Q,D

wti,Q wti,D

Relevance

Ranking documets by their relevance to a query

P(R|D) - probability a document D is relevantP(R̄|D) - probability a document D is non-relevant

Precision:P(R|D) = P(R∩D)

Sensitivity:P(D|R) = P(R∩D)

Relevance

Ranking documets by their relevance to a queryP(R|D) - probability a document D is relevantP(R̄|D) - probability a document D is non-relevant

Relevance

How to classify a document ?

There is no right and wrong - but a list of methods including:

Bayesian approach:log P(R|D)

P(R̄|D)= log P(D|R)P(R)

P(D|R̄)P(R̄)

if P(R),P(R̄) are independent of D, then we get:

log P(D|R)

P(D|R̄)

Additional methods exists based on precision and sensitivity.

P(D|R̄)P(R̄)

log P(D|R)

P(D|R̄)

P(D|R̄)P(R̄)

log P(D|R)

P(D|R̄)

Term Frequencytf∗idf

Term Frequency (tf)

Term Frequency (tf):

tf (t,D) = F(t,D)max{F(w ,D):w∈D}

F() - some frequency functiont - a specific termw - all the terms in the documentThe maximum value will get a word that appears many time in thedocument

Inverse Document Frequency (idf)

Inverse Document Frequency (idf):

idf (t,D) = log |D||D∈D:t∈D|

D - All the documents we haveD,t - as beforeIf term t appears ni many documents, this value is low.We look for terms that are rare between documents

tf∗idf

tfidf (t,D,D) = tf (t,D) · idf (t,D)

Thank You !

Information Retrival Rankingnin/Courses/Workshop13a/tf-idf.pdf · 2012. 10. 30. · Eddie Aronovich...

Documents

Transcript of Information Retrival Rankingnin/Courses/Workshop13a/tf-idf.pdf · 2012. 10. 30. · Eddie Aronovich...