Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting...
Transcript of Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting...
Ghislain Fourny
Information Retrieval12. Wrap-Up
Picture copyright: johan2011/123RF Stock Photo
2
IntroductionBoolean queriesTerm vocabulary and posting listsTolerant retrievalEvaluationScale upIndex compressionVector space modelProbabilistic information retrievalLanguage modelsIndexing the Web
Lecture Overview
Basics of Information Retrieval
Advanced topics
Alternate methodologies
3
Data Shapes: Text
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Doneceget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer variusornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci.Aenean ac posuere tellus. Ut id commodo turpis.
Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa aclacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetrajusto massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit ameteuismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetracongue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuereipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.
4
Boolean retrieval
lawyer ANDPenang AND NOT silver
InputSet of documents
OutputSubset of documents
query
5
Document
Documents
6
Term
SherlocklawyerSwitzerlandUnterwalden nid dem WaldETH Zürichpersonwatchrunpaperbook...
7
Boolean retrieval
lawyer ANDPenang AND NOT silver
InputSet of documents
OutputSubset of documents
query
8
Model and abstraction
Document as a list of words(with duplicates)
Simplification
Document as a set of words
Document as a vector of booleans
(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0)
9
Incidence MatrixDocuments
Term
s
1 2 3 4 5 6 7 8 9 10
t
u
v
w
x
y
10
Warm up
a
b
c
d
e
f
g
1 2 3 5 6 8
3 4 7 8 9
1 2 4 5 7
1 3 5 8 9
2 3 4 7
1 2 4 5 8 9
3 5 7 8
6
5
5
5
4
6
4
11
Intersection algorithm
1 2 4 5 8 9 10 12
1 3 4 6 7 8 11 12
List A
List B
Intersection of A and B 1 4 8
12
Index construction
Collect documents
Tokenizing
Linguistic preprocessing
Build the index (postings list)
13
Type
You come most carefully upon your hour
thinebetimeLaerteshourthyfairTake
My hour is almost come
Possess it merely That it should come to this
Type=equivalence class (same sequences)
14
Stop words
aanandareasatbebyforfromhashein
isititsofonthatthetowaswerewillwith
15
Query expansionUpon indexing
Lift
Elevator
1 5
41
Lift |
Upon querying
Lift |
Expansion
Lift OR Elevator
Lift
Elevator 41
6
5 6
41 5 6
Expansion
Lift
Elevator
1 5
41 6
16
Porter Stemmer
https://tartarus.org/martin/PorterStemmer/
(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous
17
Skip lists
1 2 3 4 5 6 7 8 9 10 12
In practicep
Number of postings
13 1411 15 16
18
Bi-word indices (Phrase search feature)
Help ETH Zurich to flexibly react to new challenges and to set new accents in the future.
Index
Help ETH
ETH Zurich
Zurich to
to flexibly
flexibly react
react to
19
Positional index (phrase search feature)
Help C,1: 1
ETH C,1: 2
Zurich C,1: 3
to C,3: 4, 7, 11
flexibly C,1: 5
react C,1: 6
"ETH Zurich"|
20
Search structures
Hash tables Trees (B, B+)
21
B+-tree
almost carefully
fair
is Laertes
most
be
come hour mymerely
it takepossess
that
should
youupon yourthine
timethy to
this
possess
come is merely that thy upon
4 4
2
But it's fine if the root has less.
22
Wildcard queries
foo*eth*barmultiple wildcards
23
Permuterm index
plant
$plant
t$plan
nt$pla
ant$pl
lant$p
plant$
Rotations
24
k-grams
computer
$c, co, om, mp, pu, ut, te, er, r$
$co, com, omp, mpu, put, ute, ter, er$
$com, comp, ompu, mput, pute, uter, ter$
$comp, compu, omput, mpute, puter, uter$
$compu, comput, ompute, mputer, puter$
$comput, compute, omputer, mputer$
$, c, o, m, p, u, t, e, r, $1-grams
2-grams
4-grams
3-grams
5-grams
6-grams
7-grams
...
Not very useful
Not space efficient
Usable zone
25
Edit distance# a t e
# 0 1 2 3
c 1 1 2 3
a 2 1 2 3
t 3 2 1 2ate
26
Jaccard coefficient
$co
com
mpu
put
uteter
er$$cm
cmp
∩
∪
= 5 / 10 = 0.5
omp
27
Soundex algorithm
Change... To...A E H I O U W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6
28
Memory hierarchy
Memory (RAM)
Disk (Secondary storage)
Tapes, DVDs (Tertiary storage)
Cache (CPU), level 1 and 2
Volatile
Non volatile
29
TermIDs
t1
t2
t3
t4
t5
t6
t7
1 2 3
3 4 7
1 2 4
1 3 5
2 3 4
1 2 4
3 5 7
t1
t2
t3
t4
t5
t6
t7
t1
t2
t3
t4
t5
t6
t7
...
...
...
...
...
...
...
30
Blocked Sort-Based Indexing
31
Single-Pass In-Memory Indexing
32
MapReduce
ETH
computer
data
CPU
information
1
2
1
2
1
2
1
2
ETH
computer
information
33
Logarithmic Merging
I0 I1
Z0 Z1
I2
n postings 2n postings 4n postings
34
Heap's law#
Term
s
(M)
(T)
M = kpT
30 k 100
# Tokens
35
0
10000000
20000000
30000000
40000000
50000000
60000000
122
444
767
089
311
1513
3815
6117
8420
0622
2924
5226
7528
9831
2033
4335
6637
8940
1142
3444
5746
8049
0351
2553
4855
7157
9460
1662
3964
6266
8569
0871
3073
5375
7677
9980
2182
4484
6786
9089
1391
3593
5895
8198
04
Zipf's law
Frequency =k
Rank
36
Compression: Front coding
6
5
5
5
4
6
4
4 bytes bytes
8automat*a8○e9○ic10○ion
4 bytes (less bytes)
Only everyk terms
3
k
37
Variable byte encodingvariable byte encoding000000010010001101000101011001111001 00001001 00011001 00101001 00111001 01001001 01011001 01101001 01111010 00001010 00011010 00101010 00111010 01001010 01011010 01101010 0111...1001 1000 0000
decimal01234567891011121314151617181920212223...64
binary011011100101110111100010011010101111001101111011111000010001100101001110100101011011010111...1000000
fits
on 3
bits
fits
on 6
bits
50%less space
38
Gamma encoding
19binary
10011
001111110 Length in unary
111100011
39
Ranked retrieval
lawyerPenangsilver
2
1
3
InputSet of documents
OutputRanked subset of documents
query
4
40
Parametric search
Title Algorithms|
Author
Publication Date
Language
Country
Cost $
Search
to $
41
Parametric indicesTitle
Author
Publication Date
Language
Country
Cost
Search structure Posting lists
42
Term frequency, (Inverted) Document frequency,
idffoo 5bar 10foobar 3
tf A Bfoo 5 1bar 0 4foobar 2 1
tf-idf A Bfoo 25 5bar 0 40foobar 6 3
43
Model and abstraction
Document as a list of words(with duplicates)
Simplification
Document as a vector of numbers
(0. 1.2 0.15 0.34 2.4 23.5.4324.5 0.13)
Document as a bag of words
44
Vector-Space Model
d1
d2
d3
d4
d5
Documents= vectors in thefirst quadrant
of
RM
45
Queries as vectors
d1
d2
d3
d4
d5
Queries= points in thefirst quadrant
of
RMq1
q2
d3 is a goodresult of q2!
46
Inner product as score
✓
�!x .�!y =I=MX
i=1
xiyi
47
Evidence accumulation
ETH
tftq
computer
data
1 2 3 5 6
3 tftd 7 8
1 2 4 5 7
1 3 5
6
idft, ||d||
5
5
1 2 3 4 5 6 7
||q|| tftq ⇥ idft ⇥ tftd ⇥ idftkqk ⇥ kdk
48
SMART notation
atc.lnbQuery weights
Sublinear term frequency
Natural document frequency
Byte-size normalization
49
Probabilistic Information Retrieval
SortP (R = 1|D = d ^Q = q)
P (R = 1|D = e ^Q = q)
P (R = 1|D = f ^Q = q)
P (R = 1|D = g ^Q = q)
50
... falling back to Ranked Retrieval and evidence accumulation!
RSVd =X
k|dk=1^qk=1
logN
dft
This justifies idf weighting in the Vector-Space Model!
51
Language models
Enters a query q
Thought experiment: imagine that:• we picked a random document and built its model• we used this model to generate a new document• that document turns out to be q
What document is the most likely to have been picked and to have generated q?
52
Results
Ret
urne
d re
sults
Relevant Not relevant
Precision =
53
Results
Posi
tives
Relevant
Neg
ativ
es
Recall =
54
Specificity
Specificity =
Not relevant
55
F measure: harmonic mean
F↵ =1
↵P + 1�↵
R
Weighting
↵ = 1↵ = 0
56
Precision-Recall curvesPrecision
Recall0.10 0.5
57
ROC CurvesRecall (Sensitivity)
1 - Specificity