Learning to Match Using Local and Distributed
Representations of Text for Web Search
Nick CraswellMicrosoft
Bellevue, USA
*work done while at Microsoft
Fernando DiazSpotify*
New York, USA
Bhaskar MitraMicrosoft, UCL
Cambridge, UK
The Duet Model:
The document ranking task
Given a query rank documents
according to relevance
The query text has few terms
The document representation can be
long (e.g., body text) or short (e.g., title)
query
ranked results
search engine w/ an
index of retrievable items
This paper is focused on ranking documents
based on their long body text
Many DNN models for short text ranking
(Huang et al., 2013)
(Severyn and Moschitti, 2015)
(Shen et al., 2014)
(Palangi et al., 2015)
(Hu et al., 2014)
(Tai et al., 2015)
But few for long document ranking…
(Guo et al., 2016)
(Salakhutdinov and Hinton, 2009)
Challenges in short vs. long text retrieval
Short-text
Vocabulary mismatch more serious problem
Long-text
Documents contain mixture of many topics
Matches in different parts of the document non-uniformly important
Term proximity is important
The “black swans” of Information Retrieval
The term black swan originally referred to impossible
events. In 1697, Dutch explorers encountered black
swans for the very first time in western Australia. Since
then, the term is used to refer to surprisingly rare events.
In IR, many query terms and intents are
never observed in the training data
Exact matching is effective in making the
IR model robust to rare events
Desiderata of document ranking
Exact matching
Important if query term is rare / fresh
Frequency and positions of matches
good indicators of relevance
Term proximity is important
Inexact matching
Synonymy relationships
united states president ↔ Obama
Evidence for document aboutness
Documents about Australia likely to contain
related terms like Sydney and koala
Proximity and position is important
Different text representations for matching
Local representation
Terms are considered distinct entities
Term representation is local (one-hot vectors)
Matching is exact (term-level)
Distributed representation
Represent text as dense vectors (embeddings)
Inexact matching in the embedding space
Local (one-hot) representation Distributed representation
A tale of two queries
“pekarovic land company”
Hard to learn good representation for
rare term pekarovic
But easy to estimate relevance based
on patterns of exact matches
Proposal: Learn a neural model to
estimate relevance from patterns of
exact matches
“what channel are the seahawks on today”
Target document likely contains ESPN
or sky sports instead of channel
An embedding model can associate
ESPN in document to channel in query
Proposal: Learn embeddings of text
and match query with document in
the embedding space
The Duet Architecture
Use a neural network to model both functions and learn their parameters jointly
The Duet architecture
Linear combination of two models
trained jointly on labelled query-
document pairs
Local model operates on lexical
interaction matrix
Distributed model projects n-graph
vectors of text into an embedding
space and then estimates match
Sum
Query text
Generate query
term vector
Doc text
Generate doc
term vector
Generate interaction matrix
Query
term vector
Doc
term vector
Local model
Fully connected layers for matching
Query text
Generate query
embedding
Doc text
Generate doc
embedding
Hadamard product
Query
embedding
Doc
embedding
Distributed model
Fully connected layers for matching
Local model
Local model: term interaction matrix
𝑋𝑖,𝑗 = 1, 𝑖𝑓 𝑞𝑖 = 𝑑𝑗0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
In relevant documents,
→Many matches, typically clustered
→Matches localized early in document
→Matches for all query terms
→In-order (phrasal) matches
Local model: estimating relevance
← document words →
Convolve using window of size 𝑛𝑑 × 1
Each window instance compares a query term w/
whole document
Fully connected layers aggregate evidence
across query terms - can model phrasal matches
Distributed model
Distributed model: input representation
dogs → [ d , o , g , s , #d , do , og , gs , s# , #do , dog , ogs , gs#, #dog, dogs, ogs#, #dogs, dogs# ]
(we consider 2K most popular n-graphs only for encoding)
d o g s h a v e o w n e r s c a t s h a v e s t a f f
n-g
rap
h
enco
din
g
concatenate
Ch
an
nels =
2K
[words x channels]
con
volu
tio
np
oo
ling
Query
embedding
…
…
…
Had
am
ard
pro
duct
Had
am
ard
pro
duct
Fu
lly c
on
nect
ed
query document
Distributed model: estimating relevance
Convolve over query and
document terms
Match query with moving
windows over document
Learn text embeddings
specifically for the task
Matching happens in
embedding space
* Network architecture slightly
simplified for visualization – refer paper
for exact details
Putting the two models together…
The Duet model
Training sample: 𝑄,𝐷+, 𝐷1− 𝐷2
− 𝐷3− 𝐷4
−
𝐷+ = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑟𝑎𝑡𝑒𝑑 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑜𝑟 𝐺𝑜𝑜𝑑𝐷− = 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 2 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑤𝑜𝑟𝑠𝑒 𝑡ℎ𝑎𝑛 𝐷+
Optimize cross-entropy loss
Implemented using CNTK (GitHub link)
Data
Need large-scale training data
(labels or clicks)
We use Bing human labelled
data for both train and test
Results
Key finding: Duet performs significantly better than local and distributed
models trained individually
Random negatives vs. judged negatives
Key finding: training w/ judged
bad as negatives significantly
better than w/ random negatives
Local vs. distributed model
Key finding: local and distributed
model performs better on
different segments, but
combination is always better
Effect of training data volume
Key finding: large quantity of training data necessary for learning good
representations, less impactful for training local model
Term importance
Local model
Only query terms have an impact
Earlier occurrences have bigger impact
Query: united states president
Visualizing impact of dropping terms on model score
Term importance
Distributed model
Non-query terms (e.g., Obama and
federal) has positive impact on score
Common words like ‘the’ and ‘of’
probably good indicators of well-
formedness of content
Query: united states president
Visualizing impact of dropping terms on model score
Types of models
If we classify models by query level performance there is a clear clustering of lexical (local) and semantic (distributed) models
Duet on other IR tasks
Promising early results on TREC
2017 Complex Answer Retrieval
(TREC-CAR)
Duet performs significantly
better when trained on large
data (~32 million samples)
(PAPER UNDER REVIEW)
Summary
Both exact and inexact matching is important for IR
Deep neural networks can be used to model both types of matching
Local model more effective for queries containing rare terms
Distributed model benefits from training on large datasets
Combine local and distributed model to achieve state-of-the-art performance
Get the model:
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
Top Related