Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...
Transcript of Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...
Introduc)ontoInforma)onRetrieval
Introduc*onto
Informa(onRetrieval
CS276:Informa*onRetrievalandWebSearchPanduNayakandPrabhakarRaghavan
Lecture2:Thetermvocabularyandpos*ngslists
Introduc)ontoInforma)onRetrieval
Recapofthepreviouslecture Basicinvertedindexes:
Structure:Dic*onaryandPos*ngs
Keystepinconstruc*on:Sor*ng Booleanqueryprocessing
Intersec*onbylinear*me“merging” Simpleop*miza*ons
Overviewofcoursetopics
Ch. 1
2
Introduc)ontoInforma)onRetrieval
PlanforthislectureElaboratebasicindexing Preprocessingtoformthetermvocabulary
Documents Tokeniza*on Whattermsdoweputintheindex?
Pos*ngs Fastermerges:skiplists Posi*onalpos*ngsandphrasequeries
3
Introduc)ontoInforma)onRetrieval
Recallthebasicindexingpipeline
Tokenizer
Token stream. Friends Romans Countrymen Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed.
Friends, Romans, countrymen.
4
Introduc)ontoInforma)onRetrieval
Parsingadocument Whatformatisitin?
pdf/word/excel/html? Whatlanguageisitin? Whatcharactersetisinuse?
Each of these is a classification problem, which we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
5
Introduc)ontoInforma)onRetrieval
Complica*ons:Format/language Documentsbeingindexedcanincludedocsfrommanydifferentlanguages Asingleindexmayhavetocontaintermsofseverallanguages.
Some*mesadocumentoritscomponentscancontainmul*plelanguages/formats FrenchemailwithaGermanpdfaXachment.
Whatisaunitdocument? Afile? Anemail?(Perhapsoneofmanyinanmbox.) Anemailwith5aXachments? Agroupoffiles(PPTorLaTeXasHTMLpages)
Sec. 2.1
6
Introduc)ontoInforma)onRetrieval
TOKENSANDTERMS
7
Introduc)ontoInforma)onRetrieval
Tokeniza*on Input:“Friends,Romans,Countrymen” Output:Tokens
Friends Romans Countrymen
Atokenisasequenceofcharactersinadocument Eachsuchtokenisnowacandidateforanindexentry,a`erfurtherprocessing Describedbelow
Butwhatarevalidtokenstoemit?
Sec. 2.2.1
8
Introduc)ontoInforma)onRetrieval
Tokeniza*on Issuesintokeniza*on:
Finland’scapital→Finland?Finlands?Finland’s? Hewle9‐Packard→Hewle9andPackardastwotokens? state‐of‐the‐art:breakuphyphenatedsequence. co‐educa>on lowercase,lower‐case,lowercase? Itcanbeeffec*vetogettheusertoputinpossiblehyphens
SanFrancisco:onetokenortwo? Howdoyoudecideitisonetoken?
Sec. 2.2.1
9
Introduc)ontoInforma)onRetrieval
Numbers 3/12/91 Mar.12,1991 12/3/91 55B.C. B‐52 MyPGPkeyis324a3df234cb23e (800)234‐2333
O`enhaveembeddedspaces OlderIRsystemsmaynotindexnumbers
Buto`enveryuseful:thinkaboutthingslikelookinguperrorcodes/stacktracesontheweb
(Oneanswerisusingn‐grams:Lecture3)
Willo`enindex“meta‐data”separately Crea*ondate,format,etc.
Sec. 2.2.1
10
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues French
L'ensemble→onetokenortwo? L?L’ ?Le? Wantl’ensembletomatchwithunensemble
Un*latleast2003,itdidn’tonGoogle Interna*onaliza*on!
Germannouncompoundsarenotsegmented LebensversicherungsgesellschaTsangestellter ‘lifeinsurancecompanyemployee’ Germanretrievalsystemsbenefitgreatlyfromacompoundspli>er
module Cangivea15%performanceboostforGerman
Sec. 2.2.1
11
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues
ChineseandJapanesehavenospacesbetweenwords: 莎拉波娃现在居住在美国东南部的佛罗里达。 Notalwaysguaranteedauniquetokeniza*on
FurthercomplicatedinJapanese,withmul*plealphabetsintermingled Dates/amountsinmul*pleformats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
12
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues Arabic(orHebrew)isbasicallywriXenrighttole`,butwithcertainitemslikenumberswriXenle`toright
Wordsareseparated,butleXerformswithinawordformcomplexligatures
←→←→←start ‘Algeriaachieveditsindependencein1962a`er132yearsofFrenchoccupa*on.’
WithUnicode,thesurfacepresenta*oniscomplex,butthestoredformisstraighlorward
Sec. 2.2.1
13
Introduc)ontoInforma)onRetrieval
Stopwords
Withastoplist,youexcludefromthedic*onaryen*relythecommonestwords.Intui*on: TheyhaveliXleseman*ccontent:the,a,and,to,be Therearealotofthem:~30%ofpos*ngsfortop30words
Butthetrendisawayfromdoingthis: Goodcompressiontechniques(lecture5)meansthespacefor
includingstopwordsinasystemisverysmall Goodqueryop*miza*ontechniques(lecture7)meanyoupayliXle
atquery*meforincludingstopwords. Youneedthemfor:
Phrasequeries:“KingofDenmark” Varioussong*tles,etc.:“Letitbe”,“Tobeornottobe” “Rela*onal”queries:“flightstoLondon”
Sec. 2.2.2
14
Introduc)ontoInforma)onRetrieval
Normaliza*ontoterms Weneedto“normalize”wordsinindexedtextaswellasquerywordsintothesameform WewanttomatchU.S.A.andUSA
Resultisterms:atermisa(normalized)wordtype,whichisanentryinourIRsystemdic*onary
Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g., dele*ngperiodstoformaterm
U.S.A.,USAUSA
dele*nghyphenstoformaterm an>‐discriminatory,an>discriminatoryan>discriminatory
Sec. 2.2.3
15
Introduc)ontoInforma)onRetrieval
Normaliza*on:otherlanguages Accents:e.g.,Frenchrésumévs.resume. Umlauts:e.g.,German:Tuebingenvs.Tübingen
Shouldbeequivalent Mostimportantcriterion:
Howareyourusersliketowritetheirqueriesforthesewords?
Eveninlanguagesthatstandardlyhaveaccents,userso`enmaynottypethem O`enbesttonormalizetoade‐accentedterm
Tuebingen,Tübingen,TubingenTubingen
Sec. 2.2.3
16
Introduc)ontoInforma)onRetrieval
Normaliza*on:otherlanguages Normaliza*onofthingslikedateforms
7月30日 vs. 7/30
Japanese use of kana vs. Chinese characters
Tokeniza*onandnormaliza*onmaydependonthelanguageandsoisintertwinedwithlanguagedetec*on
Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform
Morgen will ich in MIT … Is this
German “mit”?
Sec. 2.2.3
17
Introduc)ontoInforma)onRetrieval
Casefolding ReduceallleXerstolowercase
excep*on:uppercaseinmid‐sentence? e.g.,GeneralMotors Fedvs.fed SAILvs.sail
O`enbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza*on…
Googleexample: QueryC.A.T. #1resultwasfor“cat”(well,Lolcats)notCaterpillarInc.
Sec. 2.2.3
18
Introduc)ontoInforma)onRetrieval
Normaliza*ontoterms
Analterna*vetoequivalenceclassingistodoasymmetricexpansion
Anexampleofwherethismaybeuseful Enter:window Search:window,windows Enter:windows Search:Windows,windows,window Enter:Windows Search:Windows
Poten*allymorepowerful,butlessefficient
Sec. 2.2.3
19
Introduc)ontoInforma)onRetrieval
Thesauriandsoundex Dowehandlesynonymsandhomonyms?
E.g.,byhand‐constructedequivalenceclasses car=automobile color=colour
Wecanrewritetoformequivalence‐classterms Whenthedocumentcontainsautomobile,indexitundercar‐automobile(andvice‐versa)
Orwecanexpandaquery Whenthequerycontainsautomobile,lookundercaraswell
Whataboutspellingmistakes? Oneapproachissoundex,whichformsequivalenceclassesofwordsbasedonphone*cheuris*cs
Moreinlectures3and920
Introduc)ontoInforma)onRetrieval
Lemma*za*on Reduceinflec*onal/variantformstobaseform E.g.,
am,are,is→be
car,cars,car's,cars'→car
theboy'scarsaredifferentcolors→theboycarbedifferentcolor
Lemma*za*onimpliesdoing“proper”reduc*ontodic*onaryheadwordform
Sec. 2.2.4
21
Introduc)ontoInforma)onRetrieval
Stemming Reducetermstotheir“roots”beforeindexing “Stemming”suggestcrudeaffixchopping
languagedependent e.g.,automate(s),automa>c,automa>onallreducedtoautomat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress and compress ar both accept as equival to compress
Sec. 2.2.4
22
Introduc)ontoInforma)onRetrieval
Porter’salgorithm CommonestalgorithmforstemmingEnglish
Resultssuggestit’satleastasgoodasotherstemmingop*ons
Conven*ons+5phasesofreduc*ons phasesappliedsequen*ally eachphaseconsistsofasetofcommands sampleconven*on:Oftherulesinacompoundcommand,selecttheonethatappliestothelongestsuffix.
Sec. 2.2.4
23
Introduc)ontoInforma)onRetrieval
TypicalrulesinPorter sses→ss ies→i a)onal→ate )onal→)on
Rulessensi*vetothemeasureofwords (m>1)EMENT→
replacement→replac cement→cement
Sec. 2.2.4
24
Introduc)ontoInforma)onRetrieval
Otherstemmers Otherstemmersexist,e.g.,Lovinsstemmer
hXp://www.comp.lancs.ac.uk/compu*ng/research/stemming/general/lovins.htm Single‐pass,longestsuffixremoval(about250rules)
Fullmorphologicalanalysis–atmostmodestbenefitsforretrieval
Dostemmingandothernormaliza*onshelp? English:verymixedresults.Helpsrecallbutharmsprecision
opera*ve(den*stry)⇒ oper opera*onal(research)⇒ oper opera*ng(systems)⇒ oper
Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish!
Sec. 2.2.4
25
Introduc)ontoInforma)onRetrieval
Language‐specificity Manyoftheabovefeaturesembodytransforma*onsthatare Language‐specificand O`en,applica*on‐specific
Theseare“plug‐in”addendatotheindexingprocess Bothopensourceandcommercialplug‐insareavailableforhandlingthese
Sec. 2.2.4
26
Introduc)ontoInforma)onRetrieval
Dic*onaryentries–firstcutensemble.french
時間.japanese
MIT.english
mit.german
guaranteed.english
entries.english
sometimes.english
tokenization.english
These may be grouped by language (or
not…). More on this in ranking/query
processing.
Sec. 2.2
27
Introduc)ontoInforma)onRetrieval
FASTERPOSTINGSMERGES:SKIPPOINTERS/SKIPLISTS
28
Introduc)ontoInforma)onRetrieval
Recallbasicmerge Walkthroughthetwopos*ngssimultaneously,in*melinearinthetotalnumberofpos*ngsentries
128
31
2 4 8 41 48 64
1 2 3 8 11 17 21
Brutus
Caesar 2 8
If the list lengths are m and n, the merge takes O(m+n) operations.
Can we do better? Yes (if index isn’t changing too fast).
Sec. 2.3
29
Introduc)ontoInforma)onRetrieval
Augmentpos*ngswithskippointers(atindexing*me)
Why? Toskippos*ngsthatwillnotfigureinthesearchresults.
How? Wheredoweplaceskippointers?
128 2 4 8 41 48 64
31 1 2 3 8 11 17 21 31 11
41 128
Sec. 2.3
30
Introduc)ontoInforma)onRetrieval
Queryprocessingwithskippointers
128 2 4 8 41 48 64
31 1 2 3 8 11 17 21 31 11
41 128
Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller.
But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.
Sec. 2.3
31
Introduc)ontoInforma)onRetrieval
Wheredoweplaceskips? Tradeoff:
Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.
Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.
Sec. 2.3
32
Introduc)ontoInforma)onRetrieval
Placingskips Simpleheuris*c:forpos*ngsoflengthL,use√Levenly‐spacedskippointers.
Thisignoresthedistribu*onofqueryterms. Easyiftheindexisrela*velysta*c;harderifLkeepschangingbecauseofupdates.
Thisdefinitelyusedtohelp;withmodernhardwareitmaynot(Bahleetal.2002)unlessyou’rememory‐based TheI/Ocostofloadingabiggerpos*ngslistcanoutweighthegainsfromquickerinmemorymerging!
Sec. 2.3
33
Introduc)ontoInforma)onRetrieval
PHRASEQUERIESANDPOSITIONALINDEXES
34
Introduc)ontoInforma)onRetrieval
Phrasequeries Wanttobeabletoanswerqueriessuchas“stanforduniversity” –asaphrase
Thusthesentence“IwenttouniversityatStanford”isnotamatch. Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks
Manymorequeriesareimplicitphrasequeries
Forthis,itnolongersufficestostoreonly<term:docs>entries
Sec. 2.4
35
Introduc)ontoInforma)onRetrieval
AfirstaXempt:Biwordindexes Indexeveryconsecu*vepairoftermsinthetextasaphrase
Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords friendsromans romanscountrymen
Eachofthesebiwordsisnowadic*onaryterm Two‐wordphrasequery‐processingisnowimmediate.
Sec. 2.4.1
36
Introduc)ontoInforma)onRetrieval
Longerphrasequeries Longerphrasesareprocessedaswedidwithwild‐cards:
stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:
stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.
Can have false positives!
Sec. 2.4.1
37
Introduc)ontoInforma)onRetrieval
Extendedbiwords Parsetheindexedtextandperformpart‐of‐speech‐tagging
(POST). Bucketthetermsinto(say)Nouns(N)andar*cles/
preposi*ons(X). CallanystringoftermsoftheformNX*Nanextended
biword. Eachsuchextendedbiwordisnowmadeaterminthedic*onary.
Example:catcherintheryeNXXN
Queryprocessing:parseitintoN’sandX’s Segmentqueryintoenhancedbiwords Lookupinindex:catcherrye
Sec. 2.4.1
38
Introduc)ontoInforma)onRetrieval
Issuesforbiwordindexes Falseposi*ves,asnotedbefore Indexblowupduetobiggerdic*onary
Infeasibleformorethanbiwords,bigevenforthem
Biwordindexesarenotthestandardsolu*on(forallbiwords)butcanbepartofacompoundstrategy
Sec. 2.4.1
39
Introduc)ontoInforma)onRetrieval
Solu*on2:Posi*onalindexes Inthepos*ngs,storeforeachtermtheposi*on(s)inwhichtokensofitappear:
<term,numberofdocscontainingterm;doc1:posi*on1,posi*on2…;doc2:posi*on1,posi*on2…;etc.>
Sec. 2.4.2
40
Introduc)ontoInforma)onRetrieval
Posi*onalindexexample
Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel
Butwenowneedtodealwithmorethanjustequality
<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>
Which of docs 1,2,4,5 could contain “to be
or not to be”?
Sec. 2.4.2
41
Introduc)ontoInforma)onRetrieval
Processingaphrasequery Extractinvertedindexentriesforeachdis*nctterm:to,be,or,not.
Mergetheirdoc:posi)onliststoenumerateallposi*onswith“tobeornottobe”. to:
2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;... be:
1:17,19;4:17,191,291,430,434;5:14,19,101;...
Samegeneralmethodforproximitysearches
Sec. 2.4.2
42
Introduc)ontoInforma)onRetrieval
Proximityqueries LIMIT!/3STATUTE/3FEDERAL/2TORT
Again,here,/kmeans“withinkwordsof”.
Clearly,posi*onalindexescanbeusedforsuchqueries;biwordindexescannot.
Exercise:Adaptthelinearmergeofpos*ngstohandleproximityqueries.Canyoumakeitworkforanyvalueofk? ThisisaliXletrickytodocorrectlyandefficiently SeeFigure2.12ofIIR There’slikelytobeaproblemonit!
Sec. 2.4.2
43
Introduc)ontoInforma)onRetrieval
Posi*onalindexsize Youcancompressposi*onvalues/offsets:we’lltalkaboutthatinlecture5
Nevertheless,aposi*onalindexexpandspos*ngsstoragesubstan)ally
Nevertheless,aposi*onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.
Sec. 2.4.2
44
Introduc)ontoInforma)onRetrieval
Posi*onalindexsize Needanentryforeachoccurrence,notjustonceperdocument
Indexsizedependsonaveragedocumentsize Averagewebpagehas<1000terms SECfilings,books,evensomeepicpoems…easily100,000terms
Consideratermwithfrequency0.1%
Why?
100 1 100,000
1 1 1000
Positional postings Postings Document size
Sec. 2.4.2
45
Introduc)ontoInforma)onRetrieval
Rulesofthumb Aposi*onalindexis2–4aslargeasanon‐posi*onalindex
Posi*onalindexsize35–50%ofvolumeoforiginaltext
Caveat:allofthisholdsfor“English‐like”languages
Sec. 2.4.2
46
Introduc)ontoInforma)onRetrieval
Combina*onschemes
Thesetwoapproachescanbeprofitablycombined Forpar*cularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingposi*onalpos*ngslists Evenmoresoforphraseslike“TheWho”
Williamsetal.(2004)evaluateamoresophis*catedmixedindexingscheme Atypicalwebquerymixturewasexecutedin¼ofthe*meofusingjustaposi*onalindex
Itrequired26%morespacethanhavingaposi*onalindexalone
Sec. 2.4.3
47
Introduc)ontoInforma)onRetrieval
Resourcesfortoday’slecture IIR2 MG3.6,4.3;MIR7.2 Porter’sstemmer:
hXp://www.tartarus.org/~mar*n/PorterStemmer/
SkipListstheory:Pugh(1990) Mul*levelskiplistsgivesameO(logn)efficiencyastrees
H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. hXp://www.seg.rmit.edu.au/research/research.php?author=4
D.Bahle,H.Williams,andJ.Zobel.Efficientphrasequeryingwithanauxiliaryindex.SIGIR2002,pp.215‐221.
48