Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii...
Transcript of Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii...
![Page 1: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/1.jpg)
IntroducingInformationRetrievalandWebSearch
borrowingfrom:PanduNayak
![Page 2: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/2.jpg)
InformationRetrieval
• InformationRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisfiesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).
– Thesedayswefrequentlythinkfirstofwebsearch,buttherearemanyothercases:
• E-mailsearch• Searchingyourlaptop• Corporateknowledgebases• Legalinformationretrieval
2
![Page 3: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/3.jpg)
BasicassumptionsofInformationRetrieval
• Collection:Asetofdocuments– Assumeitisastaticcollectionforthemoment
• Goal:Retrievedocumentswithinformationthatisrelevanttotheuser’sinformationneedandhelpstheusercompleteatask
3
Sec. 1.1
![Page 4: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/4.jpg)
howtrapmicealive
Theclassicsearchmodel
Collection
User task
Info need
Query
Results
Search engine
Query refinement
Get rid of mice in a politically correct way
Info about removing mice without killing them
Misconception?
Misformulation?
Search
![Page 5: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/5.jpg)
Howgoodaretheretrieveddocs?
▪ Precision:Fractionofretrieveddocsthatarerelevanttotheuser’sinformationneed
▪ Recall:Fractionofrelevantdocsincollectionthatareretrieved
▪ Moreprecisedefinitionsandmeasurementstofollowlater
5
Sec. 1.1
![Page 6: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/6.jpg)
Term-documentincidencematrices
![Page 7: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/7.jpg)
Unstructureddatain1620
• WhichplaysofShakespearecontainthewordsBrutusANDCaesarbutNOTCalpurnia?
• OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutlinescontainingCalpurnia?
• Whyisthatnottheanswer?– Slow(forlargecorpora)– NOTCalpurniaisnon-trivial– Otheroperations(e.g.,findthewordRomansnearcountrymen)notfeasible
– Rankedretrieval(bestdocumentstoreturn)• Laterlectures
7
Sec. 1.1
![Page 8: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/8.jpg)
Term-documentincidencematrices
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Sec. 1.1
![Page 9: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/9.jpg)
Incidencevectors
• Sowehavea0/1vectorforeachterm.• Toanswerquery:takethevectorsforBrutus,CaesarandCalpurnia(complemented)➔ bitwiseAND.– 110100AND– 110111AND– 101111=– 100100
9
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1worser 1 0 1 1 1 0
![Page 10: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/10.jpg)
Answerstoquery
• Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.
10
Sec. 1.1
![Page 11: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/11.jpg)
Biggercollections
• ConsiderN=1milliondocuments,eachwithabout1000words.
• Avg6bytes/wordincludingspaces/punctuation– 6GBofdatainthedocuments.
• SaythereareM=500Kdistincttermsamongthese.
11
Sec. 1.1
![Page 12: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/12.jpg)
Can’tbuildthematrix
• 500Kx1Mmatrixhashalf-a-trillion0’sand1’s.
• Butithasnomorethanonebillion1’s.– matrixisextremelysparse.
• What’sabetterrepresentation?– Weonlyrecordthe1positions.
12
Why?
Sec. 1.1
![Page 13: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/13.jpg)
TheInvertedIndexThekeydatastructureunderlying
modernIR
![Page 14: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/14.jpg)
Invertedindex• Foreachtermt,wemuststorealistofalldocumentsthatcontaint.– IdentifyeachdocbyadocID,adocumentserialnumber
• Canweusedfixed-sizearraysforthis?
14
WhathappensifthewordCaesarisaddedtodocument14?
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45173
2 31
174
54101
![Page 15: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/15.jpg)
Invertedindex• Weneedvariable-sizepostingslists
– Ondisk,acontinuousrunofpostingsisnormalandbest
– Inmemory,canuselinkedlistsorvariablelengtharrays
• Sometradeoffsinsize/easeofinsertion
15
Dictionary PostingsSorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45173
2 31
174
54101
![Page 16: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/16.jpg)
Tokenizer
Token stream Friends Romans Countrymen
Invertedindexconstruction
Linguisticmodules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed
Friends, Romans, countrymen.
Sec. 1.2
![Page 17: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/17.jpg)
Initialstagesoftextprocessing
• Tokenization– Cutcharactersequenceintowordtokens
• Dealwith“John’s”,astate-of-the-artsolution
• Normalization– Maptextandquerytermtosameform
• YouwantU.S.A.andUSAtomatch
• Stemming– Wemaywishdifferentformsofaroottomatch
• authorize,authorization
• Stopwords– Wemayomitverycommonwords(ornot)
• the,a,to,of
![Page 18: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/18.jpg)
Indexersteps:Tokensequence
• Sequenceof(Modifiedtoken,DocumentID)pairs.
I did enact Julius Caesar I was killed
i’ the Capitol; Brutus killed me.
Doc 1
So let it be with Caesar. The noble
Brutus hath told you Caesar was ambitious
Doc 2
Sec. 1.2
![Page 19: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/19.jpg)
Indexersteps:Sort
• Sortbyterms– AndthendocID
Coreindexingstep
Sec. 1.2
![Page 20: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/20.jpg)
Indexersteps:Dictionary&Postings
• Multipletermentriesinasingledocumentaremerged.
• SplitintoDictionaryandPostings
• Doc.frequencyinformationisadded.
Whyfrequency?Willdiscusslater.
Sec. 1.2
![Page 21: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/21.jpg)
Wheredowepayinstorage?
21Pointers
Termsandcounts
IRsystemimplementation•Howdoweindexefficiently?•Howmuchstoragedoweneed?
Sec. 1.2
ListsofdocIDs
![Page 22: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/22.jpg)
Queryprocessingwithaninvertedindex
![Page 23: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/23.jpg)
Theindexwejustbuilt
• Howdoweprocessaquery?– Later-whatkindsofqueriescanweprocess?
23
Ourfocus
Sec. 1.3
![Page 24: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/24.jpg)
Queryprocessing:AND
• Considerprocessingthequery:BrutusANDCaesar– LocateBrutusintheDictionary;
• Retrieveitspostings.– LocateCaesarintheDictionary;
• Retrieveitspostings.– “Merge”thetwopostings(intersectthedocumentsets):
24
12834
2 4 8 16 32 641 2 3 5 8 13 21
BrutusCaesar
Sec. 1.3
![Page 25: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/25.jpg)
Themerge
• Walkthroughthetwopostingssimultaneously,intimelinearinthetotalnumberofpostingsentries
25
341282 4 8 16 32 64
1 2 3 5 8 13 21BrutusCaesar
If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.
Sec. 1.3
![Page 26: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/26.jpg)
Intersectingtwopostingslists(a“merge”algorithm)
26
![Page 27: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/27.jpg)
TheBooleanRetrievalModel&ExtendedBooleanModels
![Page 28: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/28.jpg)
Booleanqueries:Exactmatch
• TheBooleanretrievalmodelisbeingabletoaskaquerythatisaBooleanexpression:– BooleanQueriesarequeriesusingAND,ORandNOTtojoinqueryterms
• Viewseachdocumentasasetofwords• Isprecise:documentmatchesconditionornot.
– PerhapsthesimplestmodeltobuildanIRsystemon
• Primarycommercialretrievaltoolfor3decades.• ManysearchsystemsyoustilluseareBoolean:
– Email,librarycatalog,MacOSXSpotlight28
Sec. 1.3
![Page 29: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/29.jpg)
Example:WestLawhttp://www.westlaw.com/
• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992; new federated search added 2010)
• Tens of terabytes of data; ~700,000 users • Majority of users still use boolean queries • Example query:
– What is the statute of limitations in cases involving the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
• /3 = within 3 words, /S = in same sentence
29
Sec. 1.4
![Page 30: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/30.jpg)
Example:WestLawhttp://www.westlaw.com/
• Anotherexamplequery:– Requirementsfordisabledpeopletobeabletoaccessaworkplace
– disabl!/paccess!/swork-sitework-place(employment/3place)
• NotethatSPACEisdisjunction,notconjunction!• Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearch
• ManyprofessionalsearchersstilllikeBooleansearch– Youknowexactlywhatyouaregetting
• Butthatdoesn’tmeanitactuallyworksbetter….
Sec. 1.4
![Page 31: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/31.jpg)
Booleanqueries: Moregeneralmerges
• Exercise:Adaptthemergeforthequeries: BrutusANDNOTCaesar BrutusORNOTCaesar
• CanwestillrunthroughthemergeintimeO(x+y)?Whatcanweachieve?
31
Sec. 1.3
![Page 32: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/32.jpg)
Queryoptimization
• Whatisthebestorderforqueryprocessing?• ConsideraquerythatisanANDofnterms.• Foreachofthenterms,getitspostings,thenANDthemtogether.
Brutus
CaesarCalpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Query:BrutusANDCalpurniaANDCaesar32
Sec. 1.3
![Page 33: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/33.jpg)
Queryoptimizationexample
• Processinorderofincreasingfreq:– startwithsmallestset,thenkeepcuttingfurther.
33
Thisiswhywekeptdocumentfreq.indictionary
Executethequeryas(CalpurniaANDBrutus)ANDCaesar.
Sec. 1.3
Brutus
CaesarCalpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
![Page 34: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/34.jpg)
Moregeneraloptimization
• e.g.,(maddingORcrowd)AND(ignobleORstrife)• Getdoc.freq.’sforallterms.• EstimatethesizeofeachORbythesumofitsdoc.freq.’s(conservative).
• ProcessinincreasingorderofORsizes.
34
Sec. 1.3
![Page 35: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/35.jpg)
Exercise
• Recommendaqueryprocessingorderfor
• Whichtwotermsshouldweprocessfirst?
Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812
35
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
![Page 36: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/36.jpg)
Phrasequeriesandpositionalindexes
![Page 37: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/37.jpg)
Phrasequeries
• Wewanttobeabletoanswerqueriessuchas“stanforduniversity”–asaphrase
• Thusthesentence“IwenttouniversityatStanford”isnotamatch.– Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks
– Manymorequeriesareimplicitphrasequeries• Forthis,itnolongersufficestostoreonly<term:docs>entries
Sec. 2.4
![Page 38: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/38.jpg)
Afirstattempt:Biwordindexes
• Indexeveryconsecutivepairoftermsinthetextasaphrase
• Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords– friendsromans– romanscountrymen
• Eachofthesebiwordsisnowadictionaryterm• Two-wordphrasequery-processingisnowimmediate.
Sec. 2.4.1
![Page 39: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/39.jpg)
Longerphrasequeries
• Longerphrasescanbeprocessedbybreakingthemdown
• stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:
stanforduniversityANDuniversitypaloANDpaloalto
Withoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.
Canhavefalsepositives!
Sec. 2.4.1
![Page 40: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/40.jpg)
Issuesforbiwordindexes
• Falsepositives,asnotedbefore• Indexblowupduetobiggerdictionary
– Infeasibleformorethanbiwords,bigevenforthem
• Biwordindexesarenotthestandardsolution(forallbiwords)butcanbepartofacompoundstrategy
Sec. 2.4.1
![Page 41: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/41.jpg)
Solution2:Positionalindexes
• Inthepostings,store,foreachtermtheposition(s)inwhichtokensofitappear:
<term,numberofdocscontainingterm;doc1:position1,position2…;doc2:position1,position2…;etc.>
Sec. 2.4.2
![Page 42: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/42.jpg)
Positionalindexexample
• Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel
• Butwenowneedtodealwithmorethanjustequality
<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>
Which of docs 1,2,4,5 could contain “to be
or not to be”?
Sec. 2.4.2
![Page 43: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/43.jpg)
Processingaphrasequery
• Extractinvertedindexentriesforeachdistinctterm:to,be,or,not.
• Mergetheirdoc:positionliststoenumerateallpositionswith“tobeornottobe”.– to:
• 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;...
– be:• 1:17,19;4:17,191,291,430,434;5:14,19,101;...
• Samegeneralmethodforproximitysearches
Sec. 2.4.2
![Page 44: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/44.jpg)
Positionalindexsize
• Apositionalindexexpandspostingsstoragesubstantially– Eventhoughindicescanbecompressed
• Nevertheless,apositionalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.
Sec. 2.4.2
![Page 45: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/45.jpg)
Positionalindexsize
• Needanentryforeachoccurrence,notjustonceperdocument
• Indexsizedependsonaveragedocumentsize– Averagewebpagehas<1000terms– SECfilings,books,evensomeepicpoems…easily100,000terms
• Consideratermwithfrequency0.1%
Why?
1001100,000
111000
PositionalpostingsPostingsDocumentsize
Sec. 2.4.2
![Page 46: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/46.jpg)
Rulesofthumb
• Apositionalindexis2–4aslargeasanon-positionalindex
• Positionalindexsize35–50%ofvolumeoforiginaltext
– Caveat:allofthisholdsfor“English-like”languages
Sec. 2.4.2
![Page 47: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/47.jpg)
Combinationschemes
• Thesetwoapproachescanbeprofitablycombined– Forparticularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingpositionalpostingslists
• Evenmoresoforphraseslike“TheWho”
• Williamsetal.(2004)evaluateamoresophisticatedmixedindexingscheme– Atypicalwebquerymixturewasexecutedin¼ofthetimeofusingjustapositionalindex
– Itrequired26%morespacethanhavingapositionalindexalone
Sec. 2.4.3
![Page 48: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/48.jpg)
Structuredvs.UnstructuredData
![Page 49: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/49.jpg)
IRvs.databases:Structuredvsunstructureddata
• Structureddatatendstorefertoinformationin“tables”
49
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
![Page 50: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/50.jpg)
Unstructureddata
• Typicallyreferstofreetext• Allows
– Keywordqueriesincludingoperators– Moresophisticated“concept”queriese.g.,
• findallwebpagesdealingwithdrugabuse
• Classicmodelforsearchingtextdocuments
50
![Page 51: Introducing Information Retrieval and Web Search · • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar](https://reader033.fdocuments.us/reader033/viewer/2022050522/5fa5f71558679d041d14f62a/html5/thumbnails/51.jpg)
Semi-structureddata
• Infactalmostnodatais“unstructured”• E.g.,thisslidehasdistinctlyidentifiedzonessuchastheTitleandBullets
• …tosaynothingoflinguisticstructure
• Facilitates“semi-structured”searchsuchas– TitlecontainsdataANDBulletscontainsearch
• Oreven– TitleisaboutObjectOrientedProgrammingANDAuthorsomethinglikestro*rup
– where*isthewild-cardoperator
51