Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...

Introduc)ontoInforma)onRetrieval

Introduc*onto

Informa(onRetrieval

CS276:Informa*onRetrievalandWebSearchPanduNayakandPrabhakarRaghavan

Lecture2:Thetermvocabularyandpos*ngslists


Recapofthepreviouslecture  Basicinvertedindexes:

  Structure:Dic*onaryandPos*ngs

  Keystepinconstruc*on:Sor*ng  Booleanqueryprocessing

  Intersec*onbylinear*me“merging”  Simpleop*miza*ons

  Overviewofcoursetopics

Ch. 1

2


PlanforthislectureElaboratebasicindexing  Preprocessingtoformthetermvocabulary

  Documents  Tokeniza*on Whattermsdoweputintheindex?

  Pos*ngs  Fastermerges:skiplists  Posi*onalpos*ngsandphrasequeries

3


Recallthebasicindexingpipeline

Tokenizer

Token stream. Friends Romans Countrymen Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents to be indexed.

Friends, Romans, countrymen.

4


Parsingadocument  Whatformatisitin?

  pdf/word/excel/html?  Whatlanguageisitin?  Whatcharactersetisinuse?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Sec. 2.1

5


Complica*ons:Format/language  Documentsbeingindexedcanincludedocsfrommanydifferentlanguages  Asingleindexmayhavetocontaintermsofseverallanguages.

  Some*mesadocumentoritscomponentscancontainmul*plelanguages/formats  FrenchemailwithaGermanpdfaXachment.

  Whatisaunitdocument?  Afile?  Anemail?(Perhapsoneofmanyinanmbox.)  Anemailwith5aXachments?  Agroupoffiles(PPTorLaTeXasHTMLpages)

Sec. 2.1

6


TOKENSANDTERMS

7


Tokeniza*on  Input:“Friends,Romans,Countrymen”  Output:Tokens

  Friends  Romans  Countrymen

  Atokenisasequenceofcharactersinadocument  Eachsuchtokenisnowacandidateforanindexentry,a`erfurtherprocessing  Describedbelow

  Butwhatarevalidtokenstoemit?

Sec. 2.2.1

8


Tokeniza*on  Issuesintokeniza*on:

  Finland’scapital→Finland?Finlands?Finland’s?  Hewle9‐Packard→Hewle9andPackardastwotokens?  state‐of‐the‐art:breakuphyphenatedsequence.  co‐educa>on  lowercase,lower‐case,lowercase?  Itcanbeeffec*vetogettheusertoputinpossiblehyphens

  SanFrancisco:onetokenortwo?  Howdoyoudecideitisonetoken?

Sec. 2.2.1

9


Numbers  3/12/91 Mar.12,1991 12/3/91  55B.C.  B‐52  MyPGPkeyis324a3df234cb23e  (800)234‐2333

  Oènhaveembeddedspaces  OlderIRsystemsmaynotindexnumbers

  Butoènveryuseful:thinkaboutthingslikelookinguperrorcodes/stacktracesontheweb

  (Oneanswerisusingn‐grams:Lecture3)

  Willoènindex“meta‐data”separately  Crea*ondate,format,etc.

Sec. 2.2.1

10


Tokeniza*on:languageissues  French

  L'ensemble→onetokenortwo?  L?L’ ?Le?  Wantl’ensembletomatchwithunensemble

  Un*latleast2003,itdidn’tonGoogle  Interna*onaliza*on!

  Germannouncompoundsarenotsegmented  LebensversicherungsgesellschaTsangestellter  ‘lifeinsurancecompanyemployee’  Germanretrievalsystemsbenefitgreatlyfromacompoundspli>er

module  Cangivea15%performanceboostforGerman

Sec. 2.2.1

11


Tokeniza*on:languageissues

  ChineseandJapanesehavenospacesbetweenwords:  莎拉波娃现在居住在美国东南部的佛罗里达。   Notalwaysguaranteedauniquetokeniza*on

  FurthercomplicatedinJapanese,withmul*plealphabetsintermingled  Dates/amountsinmul*pleformats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Sec. 2.2.1

12


Tokeniza*on:languageissues  Arabic(orHebrew)isbasicallywriXenrighttole`,butwithcertainitemslikenumberswriXenle`toright

  Wordsareseparated,butleXerformswithinawordformcomplexligatures

  ←→←→←start  ‘Algeriaachieveditsindependencein1962a`er132yearsofFrenchoccupa*on.’

  WithUnicode,thesurfacepresenta*oniscomplex,butthestoredformisstraighlorward

Sec. 2.2.1

13


Stopwords

  Withastoplist,youexcludefromthedic*onaryen*relythecommonestwords.Intui*on:  TheyhaveliXleseman*ccontent:the,a,and,to,be  Therearealotofthem:~30%ofpos*ngsfortop30words

  Butthetrendisawayfromdoingthis:  Goodcompressiontechniques(lecture5)meansthespacefor

includingstopwordsinasystemisverysmall  Goodqueryop*miza*ontechniques(lecture7)meanyoupayliXle

atquery*meforincludingstopwords.  Youneedthemfor:

  Phrasequeries:“KingofDenmark”  Varioussong*tles,etc.:“Letitbe”,“Tobeornottobe”  “Rela*onal”queries:“flightstoLondon”

Sec. 2.2.2

14


Normaliza*ontoterms  Weneedto“normalize”wordsinindexedtextaswellasquerywordsintothesameform  WewanttomatchU.S.A.andUSA

  Resultisterms:atermisa(normalized)wordtype,whichisanentryinourIRsystemdic*onary

  Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g.,  dele*ngperiodstoformaterm

  U.S.A.,USAUSA

  dele*nghyphenstoformaterm  an>‐discriminatory,an>discriminatoryan>discriminatory

Sec. 2.2.3

15


Normaliza*on:otherlanguages  Accents:e.g.,Frenchrésumévs.resume.  Umlauts:e.g.,German:Tuebingenvs.Tübingen

  Shouldbeequivalent  Mostimportantcriterion:

  Howareyourusersliketowritetheirqueriesforthesewords?

  Eveninlanguagesthatstandardlyhaveaccents,userso`enmaynottypethem  O`enbesttonormalizetoade‐accentedterm

  Tuebingen,Tübingen,TubingenTubingen

Sec. 2.2.3

16


Normaliza*on:otherlanguages  Normaliza*onofthingslikedateforms

  7月30日 vs. 7/30

  Japanese use of kana vs. Chinese characters

  Tokeniza*onandnormaliza*onmaydependonthelanguageandsoisintertwinedwithlanguagedetec*on

  Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform

Morgen will ich in MIT … Is this

German “mit”?

Sec. 2.2.3

17


Casefolding  ReduceallleXerstolowercase

  excep*on:uppercaseinmid‐sentence?  e.g.,GeneralMotors  Fedvs.fed  SAILvs.sail

  O`enbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza*on…

  Googleexample:  QueryC.A.T.  #1resultwasfor“cat”(well,Lolcats)notCaterpillarInc.

Sec. 2.2.3

18


Normaliza*ontoterms

  Analterna*vetoequivalenceclassingistodoasymmetricexpansion

  Anexampleofwherethismaybeuseful  Enter:window Search:window,windows  Enter:windows Search:Windows,windows,window  Enter:Windows Search:Windows

  Poten*allymorepowerful,butlessefficient

Sec. 2.2.3

19


Thesauriandsoundex  Dowehandlesynonymsandhomonyms?

  E.g.,byhand‐constructedequivalenceclasses  car=automobile color=colour

  Wecanrewritetoformequivalence‐classterms  Whenthedocumentcontainsautomobile,indexitundercar‐automobile(andvice‐versa)

  Orwecanexpandaquery  Whenthequerycontainsautomobile,lookundercaraswell

  Whataboutspellingmistakes?  Oneapproachissoundex,whichformsequivalenceclassesofwordsbasedonphone*cheuris*cs

  Moreinlectures3and920


Lemma*za*on  Reduceinflec*onal/variantformstobaseform  E.g.,

  am,are,is→be

  car,cars,car's,cars'→car

  theboy'scarsaredifferentcolors→theboycarbedifferentcolor

  Lemma*za*onimpliesdoing“proper”reduc*ontodic*onaryheadwordform

Sec. 2.2.4

21


Stemming  Reducetermstotheir“roots”beforeindexing  “Stemming”suggestcrudeaffixchopping

  languagedependent  e.g.,automate(s),automa>c,automa>onallreducedtoautomat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress and compress ar both accept as equival to compress

Sec. 2.2.4

22


Porter’salgorithm  CommonestalgorithmforstemmingEnglish

  Resultssuggestit’satleastasgoodasotherstemmingop*ons

  Conven*ons+5phasesofreduc*ons  phasesappliedsequen*ally  eachphaseconsistsofasetofcommands  sampleconven*on:Oftherulesinacompoundcommand,selecttheonethatappliestothelongestsuffix.

Sec. 2.2.4

23


TypicalrulesinPorter  sses→ss  ies→i  a)onal→ate  )onal→)on

  Rulessensi*vetothemeasureofwords  (m>1)EMENT→

  replacement→replac  cement→cement

Sec. 2.2.4

24


Otherstemmers  Otherstemmersexist,e.g.,Lovinsstemmer

  hXp://www.comp.lancs.ac.uk/compu*ng/research/stemming/general/lovins.htm  Single‐pass,longestsuffixremoval(about250rules)

  Fullmorphologicalanalysis–atmostmodestbenefitsforretrieval

  Dostemmingandothernormaliza*onshelp?  English:verymixedresults.Helpsrecallbutharmsprecision

  opera*ve(den*stry)⇒ oper  opera*onal(research)⇒ oper  opera*ng(systems)⇒ oper

  Definitely useful for Spanish, German, Finnish, …   30% performance gains for Finnish!

Sec. 2.2.4

25


Language‐specificity  Manyoftheabovefeaturesembodytransforma*onsthatare  Language‐specificand  O`en,applica*on‐specific

  Theseare“plug‐in”addendatotheindexingprocess  Bothopensourceandcommercialplug‐insareavailableforhandlingthese

Sec. 2.2.4

26


Dic*onaryentries–firstcutensemble.french

時間.japanese

MIT.english

mit.german

guaranteed.english

entries.english

sometimes.english

tokenization.english

These may be grouped by language (or

not…). More on this in ranking/query

processing.

Sec. 2.2

27


FASTERPOSTINGSMERGES:SKIPPOINTERS/SKIPLISTS

28


Recallbasicmerge  Walkthroughthetwopos*ngssimultaneously,in*melinearinthetotalnumberofpos*ngsentries

128

31

2 4 8 41 48 64

1 2 3 8 11 17 21

Brutus

Caesar 2 8

If the list lengths are m and n, the merge takes O(m+n) operations.

Can we do better? Yes (if index isn’t changing too fast).

Sec. 2.3

29


Augmentpos*ngswithskippointers(atindexing*me)

  Why?  Toskippos*ngsthatwillnotfigureinthesearchresults.

  How?  Wheredoweplaceskippointers?

128 2 4 8 41 48 64

31 1 2 3 8 11 17 21 31 11

41 128

Sec. 2.3

30


Queryprocessingwithskippointers

128 2 4 8 41 48 64

31 1 2 3 8 11 17 21 31 11

41 128

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.

Sec. 2.3

31


Wheredoweplaceskips?  Tradeoff:

  Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.

  Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.

Sec. 2.3

32


Placingskips  Simpleheuris*c:forpos*ngsoflengthL,use√Levenly‐spacedskippointers.

  Thisignoresthedistribu*onofqueryterms.  Easyiftheindexisrela*velysta*c;harderifLkeepschangingbecauseofupdates.

  Thisdefinitelyusedtohelp;withmodernhardwareitmaynot(Bahleetal.2002)unlessyou’rememory‐based  TheI/Ocostofloadingabiggerpos*ngslistcanoutweighthegainsfromquickerinmemorymerging!

Sec. 2.3

33


PHRASEQUERIESANDPOSITIONALINDEXES

34


Phrasequeries  Wanttobeabletoanswerqueriessuchas“stanforduniversity” –asaphrase

  Thusthesentence“IwenttouniversityatStanford”isnotamatch.  Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks

  Manymorequeriesareimplicitphrasequeries

  Forthis,itnolongersufficestostoreonly<term:docs>entries

Sec. 2.4

35


AfirstaXempt:Biwordindexes  Indexeveryconsecu*vepairoftermsinthetextasaphrase

  Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords  friendsromans  romanscountrymen

  Eachofthesebiwordsisnowadic*onaryterm  Two‐wordphrasequery‐processingisnowimmediate.

Sec. 2.4.1

36


Longerphrasequeries  Longerphrasesareprocessedaswedidwithwild‐cards:

  stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:

stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.

Can have false positives!

Sec. 2.4.1

37


Extendedbiwords  Parsetheindexedtextandperformpart‐of‐speech‐tagging

(POST).  Bucketthetermsinto(say)Nouns(N)andar*cles/

preposi*ons(X).  CallanystringoftermsoftheformNX*Nanextended

biword.  Eachsuchextendedbiwordisnowmadeaterminthedic*onary.

  Example:catcherintheryeNXXN

  Queryprocessing:parseitintoN’sandX’s  Segmentqueryintoenhancedbiwords  Lookupinindex:catcherrye

Sec. 2.4.1

38


Issuesforbiwordindexes  Falseposi*ves,asnotedbefore  Indexblowupduetobiggerdic*onary

  Infeasibleformorethanbiwords,bigevenforthem

  Biwordindexesarenotthestandardsolu*on(forallbiwords)butcanbepartofacompoundstrategy

Sec. 2.4.1

39


Solu*on2:Posi*onalindexes  Inthepos*ngs,storeforeachtermtheposi*on(s)inwhichtokensofitappear:

<term,numberofdocscontainingterm;doc1:posi*on1,posi*on2…;doc2:posi*on1,posi*on2…;etc.>

Sec. 2.4.2

40


Posi*onalindexexample

  Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel

  Butwenowneedtodealwithmorethanjustequality

<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>

Which of docs 1,2,4,5 could contain “to be

or not to be”?

Sec. 2.4.2

41


Processingaphrasequery  Extractinvertedindexentriesforeachdis*nctterm:to,be,or,not.

  Mergetheirdoc:posi)onliststoenumerateallposi*onswith“tobeornottobe”.  to:

  2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;...  be:

  1:17,19;4:17,191,291,430,434;5:14,19,101;...

  Samegeneralmethodforproximitysearches

Sec. 2.4.2

42


Proximityqueries  LIMIT!/3STATUTE/3FEDERAL/2TORT

  Again,here,/kmeans“withinkwordsof”.

  Clearly,posi*onalindexescanbeusedforsuchqueries;biwordindexescannot.

  Exercise:Adaptthelinearmergeofpos*ngstohandleproximityqueries.Canyoumakeitworkforanyvalueofk?  ThisisaliXletrickytodocorrectlyandefficiently  SeeFigure2.12ofIIR  There’slikelytobeaproblemonit!

Sec. 2.4.2

43


Posi*onalindexsize  Youcancompressposi*onvalues/offsets:we’lltalkaboutthatinlecture5

  Nevertheless,aposi*onalindexexpandspos*ngsstoragesubstan)ally

  Nevertheless,aposi*onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.

Sec. 2.4.2

44


Posi*onalindexsize  Needanentryforeachoccurrence,notjustonceperdocument

  Indexsizedependsonaveragedocumentsize  Averagewebpagehas<1000terms  SECfilings,books,evensomeepicpoems…easily100,000terms

  Consideratermwithfrequency0.1%

Why?

100 1 100,000

1 1 1000

Positional postings Postings Document size

Sec. 2.4.2

45


Rulesofthumb  Aposi*onalindexis2–4aslargeasanon‐posi*onalindex

  Posi*onalindexsize35–50%ofvolumeoforiginaltext

  Caveat:allofthisholdsfor“English‐like”languages

Sec. 2.4.2

46


Combina*onschemes

  Thesetwoapproachescanbeprofitablycombined  Forpar*cularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingposi*onalpos*ngslists  Evenmoresoforphraseslike“TheWho”

  Williamsetal.(2004)evaluateamoresophis*catedmixedindexingscheme  Atypicalwebquerymixturewasexecutedin¼ofthe*meofusingjustaposi*onalindex

  Itrequired26%morespacethanhavingaposi*onalindexalone

Sec. 2.4.3

47


Resourcesfortoday’slecture  IIR2  MG3.6,4.3;MIR7.2  Porter’sstemmer:

hXp://www.tartarus.org/~mar*n/PorterStemmer/

  SkipListstheory:Pugh(1990)  Mul*levelskiplistsgivesameO(logn)efficiencyastrees

  H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. hXp://www.seg.rmit.edu.au/research/research.php?author=4

  D.Bahle,H.Williams,andJ.Zobel.Efficientphrasequeryingwithanauxiliaryindex.SIGIR2002,pp.215‐221.

48

Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...

Documents

Transcript of Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...