Comp-8380: Information Retrieval
Transcript of Comp-8380: Information Retrieval
what is IRcourse schedulegrading scheme
Comp-8380: Information Retrieval
Jianguo Lu
January 10, 2021
1 / 50
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
2 / 50
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
3 / 50
what is IRcourse schedulegrading scheme
IR not long time ago
4 / 50
what is IRcourse schedulegrading scheme
5 / 50
what is IRcourse schedulegrading scheme
now IR is mostly about search engines
there are many search engines ...
6 / 50
what is IRcourse schedulegrading scheme
7 / 50
what is IRcourse schedulegrading scheme
8 / 50
what is IRcourse schedulegrading scheme
9 / 50
what is IRcourse schedulegrading scheme
10 / 50
what is IRcourse schedulegrading scheme
11 / 50
what is IRcourse schedulegrading scheme
12 / 50
what is IRcourse schedulegrading scheme
IR is more than web searchThese days we frequently think first of web search, but there aremany other cases:
digital library searchE-mail search, Searching your desktop and laptop computersCorporate knowledge bases, local business search, expertsearchLegal information retrieval, patent searchnews searchimage and video search(micro-)blog searchproduct search, federated searchsocial search, community Q&A, question-answeringrecommender systemsopinion mining
13 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/
14 / 50
what is IRcourse schedulegrading scheme
Structured vs. unstructured data
in the 90’s. todayInformation retrieval is finding material of an unstructured naturethat satisfies an information need from within large collections
15 / 50
what is IRcourse schedulegrading scheme
other definitions
Jaime ArguelloInformation retrieval (IR) is the science and practice ofdesigning, developing, and evaluating systems that matchinformation seekers with the information they seek.
Gerard Salton, 1968:Information retrieval is a field concerned with the structure,analysis, organization, storage, and retrieval ofinformation.
16 / 50
what is IRcourse schedulegrading scheme
The search task
Given a query and a corpus, find relevant itemsquery: user’s expression of their information needcorpus: a repository of retrievable itemsrelevance: satisfaction of the user’s information need
Corpus: definition from Webstera : all the writings or works of a particular kind or on aparticular subject; especially : the complete works of an authorb : a collection or body of knowledge or evidence; especially :a collection of recorded utterances used as a basis for thedescriptive analysis of a language
17 / 50
what is IRcourse schedulegrading scheme
Why is IR fascinating?
Information retrieval is an uncertain processQuery
users don’t know what they wantusers don’t know how to convey what they wantcomputers can’t elicit information like a librariancomputers can’t understand natural language text
Relevancethe search engine can only guess what is relevantthe search engine can only guess if a user is satisfiedover time, we can only guess how users adjust their short- andlong-term behavior
18 / 50
what is IRcourse schedulegrading scheme
classic search model
19 / 50
what is IRcourse schedulegrading scheme
A query is an impoverished description of the user’sinformation needHighly ambiguous to anyone other than the user
Retrieval ModelA formal method that predicts the degree of relevance of adocument to a query
20 / 50
what is IRcourse schedulegrading scheme
taxonomy of IR models
Document Propertytextlinks
multimedia
IR modelsBooleanvector
probalistic
Semistructured textproximal nodes
xml based
webpage rank
hubs and authorities (HITs)
Multimediaimage retrieval
audiovideo
Set theoreticfuzzy
extended booleanset-based
algebraicgeneralized vector
LSINN
probablisticBM25
language modelsBayersian networks
21 / 50
what is IRcourse schedulegrading scheme
Boolean Retrieval Model
The user describes their information need using booleanconstraints (e.g., AND, OR, and AND NOT)The burden is on the user to formulate a good boolean query
22 / 50
what is IRcourse schedulegrading scheme
Example
Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.
grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia
Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)
so we need to index the text
23 / 50
what is IRcourse schedulegrading scheme
Example
Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.
grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia
Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)
so we need to index the text
23 / 50
what is IRcourse schedulegrading scheme
Example
Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.
grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia
Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)
so we need to index the text
23 / 50
what is IRcourse schedulegrading scheme
what is an index
24 / 50
what is IRcourse schedulegrading scheme
index construction process
25 / 50
what is IRcourse schedulegrading scheme
Initial stages of text processing
TokenizationCut character sequence into word tokens
NormalizationMap text and query term to same form
You want U.S.A. and USA to matchStemming
We may wish different forms of a root to matchauthorize, authorization
Stop wordsWe may want to omit very common words (modern methodsmay not)
the, a, to, of
26 / 50
what is IRcourse schedulegrading scheme
postings
Multiple term entriesin a single documentare merged.Split into Dictionaryand PostingsDoc. frequencyinformation is added.
27 / 50
what is IRcourse schedulegrading scheme
28 / 50
what is IRcourse schedulegrading scheme
query processing
Consider processing the query:Brutus AND Caesar
Locate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.ÒMergeÓ the two postings (intersect the document sets):
brutus 1 2 4 11 31 45 173 174
caesar 1 2 4 5 6 16 57 132
29 / 50
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
30 / 50
what is IRcourse schedulegrading scheme
tentative schedule
boolean modeltext transformationbuild a search engine using Lucenevector space modelrepresentation learningevaluation methods in information retrievallink analysis and PageRankdocument classificationdocument clusteringweb crawling. Data cleaning (e.g. near-duplicate detection)
31 / 50
what is IRcourse schedulegrading scheme
Text Book
[IIR] Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Press, 2008.
32 / 50
what is IRcourse schedulegrading scheme
Other reference books
SE Search Engines: Information Retrieval in Practice, by BruceCroft, Donald Metzler and Trevor Strohman.
MIR Modern Information Retrieval, by R. Baeza-Yates and B.Ribeiro-Neto. 2-nd edition 2010.
MMD Anand Rajaraman and Jeff Ullman, Mining of massivedatasets , 2013.
33 / 50
what is IRcourse schedulegrading scheme
IIR 02: The term vocabulary and postings lists
Phrase queries: “Stanford University”Proximity queries: Gates near MicrosoftWe need an index that captures position information forphrase queries and proximity queries.
34 / 50
what is IRcourse schedulegrading scheme
IIR 04: Index construction
masterassign
mapphase
reducephase
assign
parser
splits
parser
parser
inverter
postings
inverter
inverter
a-f
g-p
q-z
a-f g-p q-z
a-f g-p q-z
a-f
segmentfiles
g-p q-z
35 / 50
what is IRcourse schedulegrading scheme
statistic properties of text
0 1 2 3 4 5 6 7
01
23
45
67
log10 rank
log
10
cf
Zipf’s law, heaps’ law, power law.the mechanism: Yule process, Preferential attachment
36 / 50
what is IRcourse schedulegrading scheme
IIR 06: Scoring, term weighting and the vector space model
Ranking search resultsBoolean queries only give inclusion or exclusion of documents.For ranked retrieval, we measure the proximity between the query andeach document.One formalism for doing this: the vector space model
Key challenge in ranked retrieval: evidence accumulation for a term ina document
1 vs. 0 occurence of a query term in the document3 vs. 2 occurences of a query term in the documentUsually: more is betterBut by how much?Need a scoring function that translates frequency into score or weight
37 / 50
what is IRcourse schedulegrading scheme
Language models
assign a probability to a sequence of m words by means of aprobability distribution.How to compute this joint probability:
P(its,water, is, so, transparent, that) (1)P(w1w2 . . .wn) = ΠP(wi)? (2)
38 / 50
what is IRcourse schedulegrading scheme
Text classification & Naive Bayes
Text classification = assigning documents automatically topredefined classesExamples:
CS vs. Non-CS papersPapers in Software Engineering vs. Databasepositive/negative reviewsSpams
Naive Bayes (Multinomial and Bernoulli model), Supportvector machine, feature selection, representation learning,neural networks.
39 / 50
what is IRcourse schedulegrading scheme
Neural network based representation learning
Answer analogical questions, e.g
Man : Woman = King :?
The answer will be Queen.An application of deep learning
40 / 50
what is IRcourse schedulegrading scheme
clustering
Flat clusteringHierarchical agglomerative clustering (HAC)Single-link and complete-link clusteringCentroid and group-average agglomerative clustering (GAAC)Bisecting K-meansHow to label clusters automatically
41 / 50
what is IRcourse schedulegrading scheme
HAC
42 / 50
what is IRcourse schedulegrading scheme
Latent Semantic Indexing
how to find semantically related documents?matrix decompositionSVD
43 / 50
what is IRcourse schedulegrading scheme
Crawling
44 / 50
what is IRcourse schedulegrading scheme
Link analysis / PageRank
which web page is more important?who are in a community?PageRank algorithmgraph analysis and mining. Modularity maximizationalgorithms.
45 / 50
what is IRcourse schedulegrading scheme
Outline
1 what is IR
2 course schedule
3 grading scheme
46 / 50
what is IRcourse schedulegrading scheme
marking scheme
exam 50%project 50%
47 / 50
what is IRcourse schedulegrading scheme
project
text analysisgraph analysisbuild searching engineenhance the search engine by adding one or more features,such as:
semantic searchclassificationclustering (returning results (papers) are clustered into severalareas)ranking (ranked by PageRank algorithm)personalizationrecommendation (recommend most similar papers)...
48 / 50
what is IRcourse schedulegrading scheme
The projectThe tentative plan for the project is:
10%: Phase one. Implement one of the project topicsdiscussed in class
Workable in Jupyter Notebook.Have good explanation in Notebook MarkdownPresentations finish before reading weekEarlier presenters choose the topic they want.Later presenters need to implement and present differentfeatures.Example topics: text statistics, smoothing, Naive Bayesclassification, Word embedding, Graph embedding.
25% Phase two. Add one more topic and improve your firsttopic.
Rank documents using the PageRank algorithm using citationdataReturn results by categories (By running clustering algorithms)Search for similar papers (e.g., running doc2vec)Finish by
15% Phase three: Implement the search engine in Lucene, andpossibly integrate the results from phase one and two. (e.g.,for for similar documents, suggest search queries).
49 / 50
what is IRcourse schedulegrading scheme
open source search engines
LuceneJava-basedrelatively simple IR techniques
GalagoJava-basedused by the book [SE] Search Engines: Information Retrievalin Practice, by Bruce Croft et al.
50 / 50