Information Retrieval - Stanford University · 5/21/17 1 Introduction to Information Retrieval...

Post on 05-Jul-2020

4 views 0 download

Transcript of Information Retrieval - Stanford University · 5/21/17 1 Introduction to Information Retrieval...

5/21/17

1

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval

BM25,BM25F,andUserBehaviorChrisManningandPanduNayak

IntroductiontoInformationRetrieval

IntroductiontoInformationRetrieval

Summary– BIM[Robertson&Spärck-Jones1976]§ Boilsdownto

where

§ Withconstantpi =0.5,simplifiestoIDFweighting:

RSV = log Nnixi=qi=1

RSV BIM = ciBIM ;

xi=qi=1∑ ci

BIM = log pi (1− ri )(1− pi )ri

document relevant(R=1) notrelevant(R=0)

termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)

Logoddsratio

IntroductiontoInformationRetrieval

GraphicalmodelforBIM– BernoulliNB

i ∈ q

Binaryvariablesxi = (tfi ≠ 0)

IntroductiontoInformationRetrieval

AkeylimitationoftheBIM§ BIM– likemuchoforiginalIR– wasdesignedfortitlesorabstracts,andnotformodernfulltextsearch

§ Wewanttopayattentiontotermfrequencyanddocumentlengths,justlikeinothermodelswediscuss

§ Want

§ Wantsomemodelofhowoftentermsoccurindocs

ci = logptf r0p0rtf

IntroductiontoInformationRetrieval

1.OkapiBM25 [Robertsonetal.1994,TRECCityU.]

§ BM25“BestMatch25”(theyhadabunchoftries!)§ DevelopedinthecontextoftheOkapisystem§ StartedtobeincreasinglyadoptedbyotherteamsduringtheTRECcompetitions

§ Itworkswell

§ Goal:besensitivetotermfrequencyanddocumentlengthwhilenotaddingtoomanyparameters§ (RobertsonandZaragoza2009;Spärck Jonesetal.2000)

5/21/17

2

IntroductiontoInformationRetrieval

§ Wordsaredrawnindependentlyfromthevocabularyusingamultinomialdistribution

Generativemodelfordocuments

... the draft is that each team is given a position in the draft …

basic

team each

thatof

is

the draft

designnfl

football

given

annual draftfootball

team

nfl

IntroductiontoInformationRetrieval

§ Distributionoftermfrequencies(tf)followsabinomialdistribution– approximatedbyaPoisson

Generativemodelfordocuments

... the draft is that each team is given a position in the draft …

draft

IntroductiontoInformationRetrieval

Poissondistribution§ ThePoissondistributionmodelstheprobabilityofk,thenumberofeventsoccurringinafixedintervaloftime/space,withknownaveragerateλ (=cf/T),independentofthelastevent

§ Examples§ Numberofcarsarrivingatthetollboothperminute§ Numberoftyposonapage

p(k) = λk

k!e−λ

IntroductiontoInformationRetrieval

Poissondistribution§ IfTislargeandpissmall,wecanapproximateabinomialdistributionwithaPoissonwhereλ =Tp

§ Mean=Variance=λ =Tp.§ Examplep= 0.08,T=20.Chanceof1occurrenceis:

§ Binomial

§ Poisson…alreadyclose

p(k) = λk

k!e−λ

P(1) = [(20)(.08)]1

1!e−(20)(.08) = 1.6

1e−1.6 = 0.3230

P(1) = 201

!

"#

$

%&(.08)1(.92)19 = .3282

IntroductiontoInformationRetrieval

Poissonmodel§ Assumethattermfrequenciesinadocument(tfi)followaPoissondistribution§ “Fixedinterval”impliesfixeddocumentlength…thinkroughlyconstant-sizeddocumentabstracts§…willfixlater

IntroductiontoInformationRetrieval

Poissondistributions

5/21/17

3

IntroductiontoInformationRetrieval

(One)PoissonModel§ Isareasonablefitfor“general”words§ Isapoorfitfortopic-specificwords

§ gethigherp(k)thanpredictedtoooften

Documents containingk occurrencesofword(λ =53/650)Freq Word 0 1 2 3 4 5 6 7 8 9 10 11 1253 expected 599 49 252 based 600 48 2

53 conditions 604 39 7

55 cathexis 619 22 3 2 1 2 0 151 comic 642 3 0 1 0 0 0 0 0 0 1 1 2

Harter, “A Probabilistic Approach to Automatic Keyword Indexing”, JASIST, 1975

IntroductiontoInformationRetrieval

Eliteness(“aboutness”)§ Modeltermfrequenciesusingeliteness§ Whatiseliteness?

§ Hiddenvariableforeachdocument-termpair,denotedasEi fortermi

§ Representsaboutness:atermiseliteinadocumentif,insomesense,thedocumentisabouttheconceptdenotedbytheterm

§ Elitenessisbinary§ Termoccurrencesdependonlyoneliteness…§ …butelitenessdependsonrelevance

IntroductiontoInformationRetrieval

ElitetermsTextfromtheWikipediapageontheNFLdraftshowingeliteterms

The National Football League Draft is an annual event in which the National Football League (NFL) teams select eligible college football players. It serves as the league’s most common source of player recruitment. The basic design of the draft is that each team is given a position in the draft order in reverse order relative to its record …

IntroductiontoInformationRetrieval

Graphicalmodelwitheliteness

i ∈ q

Frequencies(notbinary)

Binaryvariables

IntroductiontoInformationRetrieval

RetrievalStatusValue§ SimilartotheBIMderivation,wehave

where

andusingeliteness,wehave:

RSV elite = cielite

i∈q,tfi>0∑ (tfi );

p(TFi = tfi R) = p(TFi = tfi Ei = elite)p(Ei = elite R)+p(TFi = tfi Ei = elite)(1− p(Ei = elite R))

cielite(tfi ) = log

p(TFi = tfi R =1)p(TFi = 0 R = 0)p(TFi = 0 R =1)p(TFi = tfi R = 0)

IntroductiontoInformationRetrieval

2-Poissonmodel§ Theproblemswiththe1-PoissonmodelsuggestsfittingtwoPoissondistributions

§ Inthe“2-Poissonmodel”,thedistributionisdifferentdependingonwhetherthetermiseliteornot

§ whereπisprobabilitythatdocumentiseliteforterm§ but,unfortunately,wedon’tknowπ,λ,μ

p(TFi = ki R) =

5/21/17

4

IntroductiontoInformationRetrieval

Let’sgetanidea:Graphingfordifferentparametervaluesofthe2-Poisson

cielite(tfi )

IntroductiontoInformationRetrieval

Qualitativeproperties

§

§ increasesmonotonicallywithtfi

§ …butasymptoticallyapproachesamaximumvalueas[nottrueforsimplescalingoftf]

§ … withtheasymptoticlimitbeing

cielite(0) = 0

cielite(tfi )

ciBIM

Weightofelitenessfeature

tfi →∞

IntroductiontoInformationRetrieval

Approximatingthesaturationfunction§ Estimatingparametersforthe2-Poissonmodelisnoteasy

§ …Soapproximateitwithasimpleparametriccurvethathasthesamequalitativeproperties

tfk1 + tf

IntroductiontoInformationRetrieval

Saturationfunction

§ Forhighvaluesofk1,incrementsintfi continuetocontributesignificantlytothescore

§ Contributionstailoffquicklyforlowvaluesofk1

IntroductiontoInformationRetrieval

“Early”versionsofBM25§ Version1:usingthesaturationfunction

§ Version2:BIMsimplificationtoIDF

§ (k1+1) factordoesn’tchangeranking,butmakestermscore1whentfi = 1

§ Similartotf-idf,buttermscoresarebounded

ciBM 25v1(tfi ) = ci

BIM tfik1 + tfi

ciBM 25v2 (tfi ) = log

Ndfi

×(k1 +1)tfik1 + tfi

IntroductiontoInformationRetrieval

Documentlengthnormalization§ Longerdocumentsarelikelytohavelargertfi values

§ Whymightdocumentsbelonger?§ Verbosity:suggestsobservedtfi toohigh§ Largerscope:suggestsobservedtfi mayberight

§ Arealdocumentcollectionprobablyhasbotheffects§ …soshouldapplysomekindofpartialnormalization

5/21/17

5

IntroductiontoInformationRetrieval

Documentlengthnormalization§ Documentlength:

§ avdl:Averagedocumentlengthovercollection§ Lengthnormalizationcomponent

§ b = 1 fulldocumentlengthnormalization§ b = 0 nodocumentlengthnormalization

dl = tfii∈V∑

B = (1− b)+ b dlavdl

"

#$

%

&', 0 ≤ b ≤1

IntroductiontoInformationRetrieval

Documentlengthnormalization

IntroductiontoInformationRetrieval

OkapiBM25§ Normalizetf usingdocumentlength

§ BM25rankingfunction

t !fi =tfiB

ciBM 25(tfi ) = log

Ndfi

×(k1 +1)t "fik1 + t "fi

= log Ndfi

×(k1 +1)tfi

k1((1− b)+ bdlavdl

)+ tfi

RSV BM 25 = ciBM 25

i∈q∑ (tfi );

IntroductiontoInformationRetrieval

OkapiBM25

§ k1 controlstermfrequencyscaling§ k1 = 0 isbinarymodel;k1 largeisrawtermfrequency

§ b controlsdocumentlengthnormalization§ b = 0 isnolengthnormalization;b = 1 isrelativefrequency(fullyscalebydocumentlength)

§ Typically,k1 issetaround1.2–2andb around0.75§ IIRsec.11.4.3discussesincorporatingquerytermweightingand(pseudo)relevancefeedback

RSV BM 25 = log Ndfii∈q

∑ ⋅(k1 +1)tfi

k1((1− b)+ bdlavdl

)+ tfi

IntroductiontoInformationRetrieval

WhyisBM25betterthanVSMtf-idf?§ Supposeyourqueryis[machinelearning]§ Supposeyouhave2documentswithtermcounts:

§ doc1:learning1024;machine1§ doc2:learning16;machine8

§ tf-idf:log2 tf *log2 (N/df)§ doc1:11*7+1*10 = 87§ doc2:5*7+4*10=75

§ BM25:k1 =2§ doc1:7*3+10*1=31§ doc2:7*2.67+10*2.4=42.7

IntroductiontoInformationRetrieval

2.Rankingwithfeatures§ Textualfeatures

§ Zones:Title,author,abstract,body,anchors,…§ Proximity§ …

§ Non-textualfeatures§ Filetype§ Fileage§ Pagerank§ …

5/21/17

6

IntroductiontoInformationRetrieval

Rankingwithzones§ Straightforwardidea:

§ Applyyourfavoriterankingfunction(BM25)toeachzoneseparately

§ Combinezonescoresusingaweightedlinearcombination

§ Butthatseemstoimplythattheelitenesspropertiesofdifferentzonesaredifferentandindependentofeachother§ …whichseemsunreasonable

IntroductiontoInformationRetrieval

Rankingwithzones§ Alternateidea

§ Assumeelitenessisaterm/documentpropertysharedacrosszones

§ …buttherelationshipbetweenelitenessandtermfrequenciesarezone-dependent§ e.g.,denseruseofelitetopicwordsintitle

§ Consequence§ Firstcombineevidenceacrosszonesforeachterm§ Thencombineevidenceacrossterms

IntroductiontoInformationRetrieval

BM25Fwithzones§ Calculateaweightedvariantoftotaltermfrequency§ …andaweightedvariantofdocumentlength

wherevz iszoneweighttfzi istermfrequencyinzonezlenz islengthofzonezZ isthenumberofzones

tfi = vztfziz=1

Z

∑ dl = vzlenzz=1

Z

∑ avdl = Averageacrossalldocuments

dl

IntroductiontoInformationRetrieval

SimpleBM25Fwithzones

§ Simpleinterpretation:zonez is“replicated”vz times

§ Butwemaywantzone-specificparameters(k1, b,IDF)

RSV SimpleBM 25F = log Ndfii∈q

∑ ⋅(k1 +1)tfi

k1((1− b)+ bdlavdl

)+ tfi

IntroductiontoInformationRetrieval

BM25F§ Empirically,zone-specificlengthnormalization(i.e.,zone-specificb)hasbeenfoundtobeuseful

tfi = vztfziBzz=1

Z

Bz = (1− bz )+ bzlenzavlenz

"

#$

%

&', 0 ≤ bz ≤1

RSV BM 25F = log Ndfii∈q

∑ ⋅(k1 +1)tfik1 + tfi

See Robertson and Zaragoza (2009: 364)

IntroductiontoInformationRetrieval

Rankingwithnon-textualfeatures§ Assumptions

§ Usualindependenceassumption§ Independentofeachotherandofthetextualfeatures§ AllowsustofactoroutinBIM-stylederivation

§ Relevanceinformationisqueryindependent§ Usuallytrueforfeatureslikepagerank,age,type,…§ Allowsustokeepallnon-textualfeaturesintheBIM-stylederivationwherewedropnon-queryterms

p(Fj = f j R =1)p(Fj = f j R = 0)

5/21/17

7

IntroductiontoInformationRetrieval

Rankingwithnon-textualfeatures

where

andisanartificiallyaddedfreeparametertoaccountforrescalings intheapproximations§ CaremustbetakeninselectingVj dependingonFj.E.g.

§ Explainswhyworkswell

RSV = cii∈q∑ (tfi )+ λ jVj ( f j )

j=1

F

Vj ( f j ) = logp(Fj = f j R =1)p(Fj = f j R = 0)

λ j

log( !λ j + f j )f j!λ j + f j

1!λ j + exp(− f j !!λ j )

RSV BM 25 + log(pagerank)

IntroductiontoInformationRetrieval

UserBehavior§ Search Results for “CIKM” (in 2010!)

38

# of clicks received

Taken with slight adaptation from Fan Guo and Chao Liu’s 2009/2010 CIKM tutorial: Statistical Models for Web Search: Click Log Analysis

IntroductiontoInformationRetrieval

UserBehavior§ Adapt ranking to user clicks?

39

# of clicks received

IntroductiontoInformationRetrieval

UserBehavior§ Tools needed for non-trivial cases

40

# of clicks received

IntroductiontoInformationRetrieval

Websearchclicklog

An example

41

IntroductiontoInformationRetrieval

WebSearchClickLog§ How large is the click log?

§ search logs: 10+ TB/day

§ In existing publications:§ [Silverstein+99]: 285M sessions§ [Craswell+08]: 108k sessions§ [Dupret+08] : 4.5M sessions (21 subsets * 216k sessions)§ [Guo +09a] : 8.8M sessions from 110k unique queries§ [Guo+09b]: 8.8M sessions from 110k unique queries§ [Chapelle+09]: 58M sessions from 682k unique queries§ [Liu+09a]: 0.26PB data from 103M unique queries

42

5/21/17

8

IntroductiontoInformationRetrieval

InterpretClicks:anExample

§ Clicks are good…§ Are these two clicks

equally “good”?

§ Non-clicks may have excuses:§ Not relevant§ Not examined

43

IntroductiontoInformationRetrieval

Eye-trackingUserStudy

44

IntroductiontoInformationRetrieval

§ Higher positions receive more user attention (eye fixation) and clicks than lower positions.

§ This is true even in the extreme setting where the order of positions is reversed.

§ “Clicks are informative but biased”.

45

[Joachims+07]

ClickPosition-bias

NormalPosition

Percen

tage

ReversedImpression

Percen

tage

IntroductiontoInformationRetrieval

Userbehavior§ Userbehaviorisanintriguingsourceofrelevancedata

§ Usersmake(somewhat)informedchoiceswhentheyinteractwithsearchengines

§ Potentiallyalotofdataavailableinsearchlogs

§ Buttherearesignificantcaveats§ Userbehaviordatacanbeverynoisy§ Interpretinguserbehaviorcanbetricky§ Spamcanbeasignificantproblem§ Notallquerieswillhaveuserbehavior

IntroductiontoInformationRetrieval

FeaturesbasedonuserbehaviorFrom[Agichtein,Brill,Dumais 2006;Joachims 2002]§ Click-throughfeatures

§ Clickfrequency,clickprobability,clickdeviation§ Clickonnextresult?previous result?above?below>?

§ Browsingfeatures§ Cumulativeandaveragetimeonpage,ondomain,onURLprefix;deviationfromaveragetimes

§ Browsepathfeatures§ Query-textfeatures

§ Queryoverlapwithtitle,snippet,URL,domain,nextquery§ Querylength

IntroductiontoInformationRetrieval

Incorporatinguserbehaviorintorankingalgorithm§ IncorporateuserbehaviorfeaturesintoarankingfunctionlikeBM25F§ ButrequiresanunderstandingofuserbehaviorfeaturessothatappropriateVj functionsareused

§ Incorporateuserbehaviorfeaturesintolearnedrankingfunction

§ Eitherofthesewaysofincorporatinguserbehaviorsignalsimproveranking

5/21/17

9

IntroductiontoInformationRetrieval

Resources§ S.E.RobertsonandH.Zaragoza.2009.TheProbabilistic

RelevanceFramework:BM25andBeyond.FoundationsandTrendsinInformationRetrieval 3(4):333-389.

§ K.Spärck Jones,S.Walker,andS.E.Robertson.2000.Aprobabilisticmodelofinformationretrieval:Developmentandcomparativeexperiments.Part1.InformationProcessingandManagement779–808.

§ T.Joachims.OptimizingSearchEnginesusingClickthroughData.2002.SIGKDD.

§ E.Agichtein,E.Brill,S.Dumais.2006.ImprovingWebSearchRankingByIncorporatingUserBehaviorInformation.2006.SIGIR.