Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...

48
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 2: The term vocabulary and pos*ngs lists

Transcript of Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...

Page 1: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Introduc*onto

Informa(onRetrieval

CS276:Informa*onRetrievalandWebSearchPanduNayakandPrabhakarRaghavan

Lecture2:Thetermvocabularyandpos*ngslists

Page 2: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Recapofthepreviouslecture  Basicinvertedindexes:

  Structure:Dic*onaryandPos*ngs

  Keystepinconstruc*on:Sor*ng  Booleanqueryprocessing

  Intersec*onbylinear*me“merging”  Simpleop*miza*ons

  Overviewofcoursetopics

Ch. 1

2

Page 3: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

PlanforthislectureElaboratebasicindexing  Preprocessingtoformthetermvocabulary

  Documents  Tokeniza*on Whattermsdoweputintheindex?

  Pos*ngs  Fastermerges:skiplists  Posi*onalpos*ngsandphrasequeries

3

Page 4: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Recallthebasicindexingpipeline

Tokenizer

Token stream. Friends Romans Countrymen Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents to be indexed.

Friends, Romans, countrymen.

4

Page 5: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Parsingadocument  Whatformatisitin?

  pdf/word/excel/html?  Whatlanguageisitin?  Whatcharactersetisinuse?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Sec. 2.1

5

Page 6: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Complica*ons:Format/language  Documentsbeingindexedcanincludedocsfrommanydifferentlanguages  Asingleindexmayhavetocontaintermsofseverallanguages.

  Some*mesadocumentoritscomponentscancontainmul*plelanguages/formats  FrenchemailwithaGermanpdfaXachment.

  Whatisaunitdocument?  Afile?  Anemail?(Perhapsoneofmanyinanmbox.)  Anemailwith5aXachments?  Agroupoffiles(PPTorLaTeXasHTMLpages)

Sec. 2.1

6

Page 7: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

TOKENSANDTERMS

7

Page 8: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Tokeniza*on  Input:“Friends,Romans,Countrymen”  Output:Tokens

  Friends  Romans  Countrymen

  Atokenisasequenceofcharactersinadocument  Eachsuchtokenisnowacandidateforanindexentry,a`erfurtherprocessing  Describedbelow

  Butwhatarevalidtokenstoemit?

Sec. 2.2.1

8

Page 9: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Tokeniza*on  Issuesintokeniza*on:

  Finland’scapital→Finland?Finlands?Finland’s?  Hewle9‐Packard→Hewle9andPackardastwotokens?  state‐of‐the‐art:breakuphyphenatedsequence.  co‐educa>on  lowercase,lower‐case,lowercase?  Itcanbeeffec*vetogettheusertoputinpossiblehyphens

  SanFrancisco:onetokenortwo?  Howdoyoudecideitisonetoken?

Sec. 2.2.1

9

Page 10: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Numbers  3/12/91 Mar.12,1991 12/3/91  55B.C.  B‐52  MyPGPkeyis324a3df234cb23e  (800)234‐2333

  O`enhaveembeddedspaces  OlderIRsystemsmaynotindexnumbers

  Buto`enveryuseful:thinkaboutthingslikelookinguperrorcodes/stacktracesontheweb

  (Oneanswerisusingn‐grams:Lecture3)

  Willo`enindex“meta‐data”separately  Crea*ondate,format,etc.

Sec. 2.2.1

10

Page 11: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Tokeniza*on:languageissues  French

  L'ensemble→onetokenortwo?  L?L’ ?Le?  Wantl’ensembletomatchwithunensemble

  Un*latleast2003,itdidn’tonGoogle  Interna*onaliza*on!

  Germannouncompoundsarenotsegmented  LebensversicherungsgesellschaTsangestellter  ‘lifeinsurancecompanyemployee’  Germanretrievalsystemsbenefitgreatlyfromacompoundspli>er

module  Cangivea15%performanceboostforGerman

Sec. 2.2.1

11

Page 12: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Tokeniza*on:languageissues

  ChineseandJapanesehavenospacesbetweenwords:  莎拉波娃现在居住在美国东南部的佛罗里达。   Notalwaysguaranteedauniquetokeniza*on

  FurthercomplicatedinJapanese,withmul*plealphabetsintermingled  Dates/amountsinmul*pleformats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Sec. 2.2.1

12

Page 13: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Tokeniza*on:languageissues  Arabic(orHebrew)isbasicallywriXenrighttole`,butwithcertainitemslikenumberswriXenle`toright

  Wordsareseparated,butleXerformswithinawordformcomplexligatures

  ←→←→←start  ‘Algeriaachieveditsindependencein1962a`er132yearsofFrenchoccupa*on.’

  WithUnicode,thesurfacepresenta*oniscomplex,butthestoredformisstraighlorward

Sec. 2.2.1

13

Page 14: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Stopwords

  Withastoplist,youexcludefromthedic*onaryen*relythecommonestwords.Intui*on:  TheyhaveliXleseman*ccontent:the,a,and,to,be  Therearealotofthem:~30%ofpos*ngsfortop30words

  Butthetrendisawayfromdoingthis:  Goodcompressiontechniques(lecture5)meansthespacefor

includingstopwordsinasystemisverysmall  Goodqueryop*miza*ontechniques(lecture7)meanyoupayliXle

atquery*meforincludingstopwords.  Youneedthemfor:

  Phrasequeries:“KingofDenmark”  Varioussong*tles,etc.:“Letitbe”,“Tobeornottobe”  “Rela*onal”queries:“flightstoLondon”

Sec. 2.2.2

14

Page 15: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Normaliza*ontoterms  Weneedto“normalize”wordsinindexedtextaswellasquerywordsintothesameform  WewanttomatchU.S.A.andUSA

  Resultisterms:atermisa(normalized)wordtype,whichisanentryinourIRsystemdic*onary

  Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g.,  dele*ngperiodstoformaterm

  U.S.A.,USAUSA

  dele*nghyphenstoformaterm  an>‐discriminatory,an>discriminatoryan>discriminatory

Sec. 2.2.3

15

Page 16: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Normaliza*on:otherlanguages  Accents:e.g.,Frenchrésumévs.resume.  Umlauts:e.g.,German:Tuebingenvs.Tübingen

  Shouldbeequivalent  Mostimportantcriterion:

  Howareyourusersliketowritetheirqueriesforthesewords?

  Eveninlanguagesthatstandardlyhaveaccents,userso`enmaynottypethem  O`enbesttonormalizetoade‐accentedterm

  Tuebingen,Tübingen,TubingenTubingen

Sec. 2.2.3

16

Page 17: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Normaliza*on:otherlanguages  Normaliza*onofthingslikedateforms

  7月30日 vs. 7/30

  Japanese use of kana vs. Chinese characters

  Tokeniza*onandnormaliza*onmaydependonthelanguageandsoisintertwinedwithlanguagedetec*on

  Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform

Morgen will ich in MIT … Is this

German “mit”?

Sec. 2.2.3

17

Page 18: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Casefolding  ReduceallleXerstolowercase

  excep*on:uppercaseinmid‐sentence?  e.g.,GeneralMotors  Fedvs.fed  SAILvs.sail

  O`enbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza*on…

  Googleexample:  QueryC.A.T.  #1resultwasfor“cat”(well,Lolcats)notCaterpillarInc.

Sec. 2.2.3

18

Page 19: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Normaliza*ontoterms

  Analterna*vetoequivalenceclassingistodoasymmetricexpansion

  Anexampleofwherethismaybeuseful  Enter:window Search:window,windows  Enter:windows Search:Windows,windows,window  Enter:Windows Search:Windows

  Poten*allymorepowerful,butlessefficient

Sec. 2.2.3

19

Page 20: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Thesauriandsoundex  Dowehandlesynonymsandhomonyms?

  E.g.,byhand‐constructedequivalenceclasses  car=automobile color=colour

  Wecanrewritetoformequivalence‐classterms  Whenthedocumentcontainsautomobile,indexitundercar‐automobile(andvice‐versa)

  Orwecanexpandaquery  Whenthequerycontainsautomobile,lookundercaraswell

  Whataboutspellingmistakes?  Oneapproachissoundex,whichformsequivalenceclassesofwordsbasedonphone*cheuris*cs

  Moreinlectures3and920

Page 21: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Lemma*za*on  Reduceinflec*onal/variantformstobaseform  E.g.,

  am,are,is→be

  car,cars,car's,cars'→car

  theboy'scarsaredifferentcolors→theboycarbedifferentcolor

  Lemma*za*onimpliesdoing“proper”reduc*ontodic*onaryheadwordform

Sec. 2.2.4

21

Page 22: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Stemming  Reducetermstotheir“roots”beforeindexing  “Stemming”suggestcrudeaffixchopping

  languagedependent  e.g.,automate(s),automa>c,automa>onallreducedtoautomat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress and compress ar both accept as equival to compress

Sec. 2.2.4

22

Page 23: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Porter’salgorithm  CommonestalgorithmforstemmingEnglish

  Resultssuggestit’satleastasgoodasotherstemmingop*ons

  Conven*ons+5phasesofreduc*ons  phasesappliedsequen*ally  eachphaseconsistsofasetofcommands  sampleconven*on:Oftherulesinacompoundcommand,selecttheonethatappliestothelongestsuffix.

Sec. 2.2.4

23

Page 24: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

TypicalrulesinPorter  sses→ss  ies→i  a)onal→ate  )onal→)on

  Rulessensi*vetothemeasureofwords  (m>1)EMENT→

  replacement→replac  cement→cement

Sec. 2.2.4

24

Page 25: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Otherstemmers  Otherstemmersexist,e.g.,Lovinsstemmer

  hXp://www.comp.lancs.ac.uk/compu*ng/research/stemming/general/lovins.htm  Single‐pass,longestsuffixremoval(about250rules)

  Fullmorphologicalanalysis–atmostmodestbenefitsforretrieval

  Dostemmingandothernormaliza*onshelp?  English:verymixedresults.Helpsrecallbutharmsprecision

  opera*ve(den*stry)⇒ oper  opera*onal(research)⇒ oper  opera*ng(systems)⇒ oper

  Definitely useful for Spanish, German, Finnish, …   30% performance gains for Finnish!

Sec. 2.2.4

25

Page 26: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Language‐specificity  Manyoftheabovefeaturesembodytransforma*onsthatare  Language‐specificand  O`en,applica*on‐specific

  Theseare“plug‐in”addendatotheindexingprocess  Bothopensourceandcommercialplug‐insareavailableforhandlingthese

Sec. 2.2.4

26

Page 27: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Dic*onaryentries–firstcutensemble.french

時間.japanese

MIT.english

mit.german

guaranteed.english

entries.english

sometimes.english

tokenization.english

These may be grouped by language (or

not…). More on this in ranking/query

processing.

Sec. 2.2

27

Page 28: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

FASTERPOSTINGSMERGES:SKIPPOINTERS/SKIPLISTS

28

Page 29: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Recallbasicmerge  Walkthroughthetwopos*ngssimultaneously,in*melinearinthetotalnumberofpos*ngsentries

128

31

2 4 8 41 48 64

1 2 3 8 11 17 21

Brutus

Caesar 2 8

If the list lengths are m and n, the merge takes O(m+n) operations.

Can we do better? Yes (if index isn’t changing too fast).

Sec. 2.3

29

Page 30: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Augmentpos*ngswithskippointers(atindexing*me)

  Why?  Toskippos*ngsthatwillnotfigureinthesearchresults.

  How?  Wheredoweplaceskippointers?

128 2 4 8 41 48 64

31 1 2 3 8 11 17 21 31 11

41 128

Sec. 2.3

30

Page 31: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Queryprocessingwithskippointers

128 2 4 8 41 48 64

31 1 2 3 8 11 17 21 31 11

41 128

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.

Sec. 2.3

31

Page 32: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Wheredoweplaceskips?  Tradeoff:

  Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.

  Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.

Sec. 2.3

32

Page 33: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Placingskips  Simpleheuris*c:forpos*ngsoflengthL,use√Levenly‐spacedskippointers.

  Thisignoresthedistribu*onofqueryterms.  Easyiftheindexisrela*velysta*c;harderifLkeepschangingbecauseofupdates.

  Thisdefinitelyusedtohelp;withmodernhardwareitmaynot(Bahleetal.2002)unlessyou’rememory‐based  TheI/Ocostofloadingabiggerpos*ngslistcanoutweighthegainsfromquickerinmemorymerging!

Sec. 2.3

33

Page 34: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

PHRASEQUERIESANDPOSITIONALINDEXES

34

Page 35: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Phrasequeries  Wanttobeabletoanswerqueriessuchas“stanforduniversity” –asaphrase

  Thusthesentence“IwenttouniversityatStanford”isnotamatch.  Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks

  Manymorequeriesareimplicitphrasequeries

  Forthis,itnolongersufficestostoreonly<term:docs>entries

Sec. 2.4

35

Page 36: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

AfirstaXempt:Biwordindexes  Indexeveryconsecu*vepairoftermsinthetextasaphrase

  Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords  friendsromans  romanscountrymen

  Eachofthesebiwordsisnowadic*onaryterm  Two‐wordphrasequery‐processingisnowimmediate.

Sec. 2.4.1

36

Page 37: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Longerphrasequeries  Longerphrasesareprocessedaswedidwithwild‐cards:

  stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:

stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.

Can have false positives!

Sec. 2.4.1

37

Page 38: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Extendedbiwords  Parsetheindexedtextandperformpart‐of‐speech‐tagging

(POST).  Bucketthetermsinto(say)Nouns(N)andar*cles/

preposi*ons(X).  CallanystringoftermsoftheformNX*Nanextended

biword.  Eachsuchextendedbiwordisnowmadeaterminthedic*onary.

  Example:catcherintheryeNXXN

  Queryprocessing:parseitintoN’sandX’s  Segmentqueryintoenhancedbiwords  Lookupinindex:catcherrye

Sec. 2.4.1

38

Page 39: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Issuesforbiwordindexes  Falseposi*ves,asnotedbefore  Indexblowupduetobiggerdic*onary

  Infeasibleformorethanbiwords,bigevenforthem

  Biwordindexesarenotthestandardsolu*on(forallbiwords)butcanbepartofacompoundstrategy

Sec. 2.4.1

39

Page 40: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Solu*on2:Posi*onalindexes  Inthepos*ngs,storeforeachtermtheposi*on(s)inwhichtokensofitappear:

<term,numberofdocscontainingterm;doc1:posi*on1,posi*on2…;doc2:posi*on1,posi*on2…;etc.>

Sec. 2.4.2

40

Page 41: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Posi*onalindexexample

  Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel

  Butwenowneedtodealwithmorethanjustequality

<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>

Which of docs 1,2,4,5 could contain “to be

or not to be”?

Sec. 2.4.2

41

Page 42: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Processingaphrasequery  Extractinvertedindexentriesforeachdis*nctterm:to,be,or,not.

  Mergetheirdoc:posi)onliststoenumerateallposi*onswith“tobeornottobe”.  to:

  2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;...  be:

  1:17,19;4:17,191,291,430,434;5:14,19,101;...

  Samegeneralmethodforproximitysearches

Sec. 2.4.2

42

Page 43: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Proximityqueries  LIMIT!/3STATUTE/3FEDERAL/2TORT

  Again,here,/kmeans“withinkwordsof”.

  Clearly,posi*onalindexescanbeusedforsuchqueries;biwordindexescannot.

  Exercise:Adaptthelinearmergeofpos*ngstohandleproximityqueries.Canyoumakeitworkforanyvalueofk?  ThisisaliXletrickytodocorrectlyandefficiently  SeeFigure2.12ofIIR  There’slikelytobeaproblemonit!

Sec. 2.4.2

43

Page 44: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Posi*onalindexsize  Youcancompressposi*onvalues/offsets:we’lltalkaboutthatinlecture5

  Nevertheless,aposi*onalindexexpandspos*ngsstoragesubstan)ally

  Nevertheless,aposi*onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.

Sec. 2.4.2

44

Page 45: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Posi*onalindexsize  Needanentryforeachoccurrence,notjustonceperdocument

  Indexsizedependsonaveragedocumentsize  Averagewebpagehas<1000terms  SECfilings,books,evensomeepicpoems…easily100,000terms

  Consideratermwithfrequency0.1%

Why?

100 1 100,000

1 1 1000

Positional postings Postings Document size

Sec. 2.4.2

45

Page 46: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Rulesofthumb  Aposi*onalindexis2–4aslargeasanon‐posi*onalindex

  Posi*onalindexsize35–50%ofvolumeoforiginaltext

  Caveat:allofthisholdsfor“English‐like”languages

Sec. 2.4.2

46

Page 47: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Combina*onschemes

  Thesetwoapproachescanbeprofitablycombined  Forpar*cularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingposi*onalpos*ngslists  Evenmoresoforphraseslike“TheWho”

  Williamsetal.(2004)evaluateamoresophis*catedmixedindexingscheme  Atypicalwebquerymixturewasexecutedin¼ofthe*meofusingjustaposi*onalindex

  Itrequired26%morespacethanhavingaposi*onalindexalone

Sec. 2.4.3

47

Page 48: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary

Introduc)ontoInforma)onRetrieval

Resourcesfortoday’slecture  IIR2  MG3.6,4.3;MIR7.2  Porter’sstemmer:

hXp://www.tartarus.org/~mar*n/PorterStemmer/

  SkipListstheory:Pugh(1990)  Mul*levelskiplistsgivesameO(logn)efficiencyastrees

  H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. hXp://www.seg.rmit.edu.au/research/research.php?author=4

  D.Bahle,H.Williams,andJ.Zobel.Efficientphrasequeryingwithanauxiliaryindex.SIGIR2002,pp.215‐221.

48