Deep Learning for Information Retrieval - Hang Li...star wars the force awakens reviews Star Wars:...
Transcript of Deep Learning for Information Retrieval - Hang Li...star wars the force awakens reviews Star Wars:...
Machine Learning for Information Retrieval
Hang Li
Noah’s Ark Lab
Huawei Technologies
The Third Asian Summer School in Information Access
Kyoto JapanAug 5, 2016
Outline of Tutorial
• Introduction
• Part 1: Learning to Rank
• Part 2: Learning to Match
• Part 3: Deep Learning for Information Retrieval
• Summary
Overview of Information Retrieval
Information and Knowledge Base
Information Retrieval System
Query
Relevant Result
Intent:Key Words,Question
Content:Documents,Images,Relational Tables
Machine Learning can play an important role
• Key questions: how to represent intent and content, how to match intent and content• Ranking, indexing, etc are less essential• Interactive IR is not particularly considered here
Approach in Traditional IR
Query:star wars the force awakens reviews
Star Wars: Episode VIIThree decades after the defeat of the Galactic
Empire, a new threat arises.
||||||||
,),(
dq
dqdqfVSM
0
0
1
q
1
0
1
d
• Representing query and document as tf-idf vectors• Calculating cosine similarity between them• BM25, LM4IR, etc can be considered as non-linear variants
),( dqf
Document:
Approach in Modern IR
• Conducting query and document understanding• Representing query and document as feature vectors• Calculating multiple matching scores between query and document using learning to match• Training ranker with matching scores as features using learning to rank
Document:Query:star wars the force awakens reviews
Star Wars: Episode VIIThree decades after the defeat of the Galactic
Empire, a new threat arises.
qm
q
v
v
q
1
dn
d
v
v
d
1),( dqf
“Easy” Problems in IR
• Search
– Matching between query and document
• Question Answering from Documents
– Matching between question and answer
• Learning to match & learning to rank: well studied so far
• Deep Learning may not help so much
“Hard” Problems in IR
• Image Retrieval
– Matching between text and image
– Not the same as traditional setting
• Question Answering from Knowledge Base
– Complicated matching between question and fact in knowledge base
• Generation-based Question Answering
– Generating answer to question based on facts in knowledge base
• Not well studied so far
• Deep Learning for Information Retrieval can make a big deal
Part 1. Learning to Rank
8
1.1. Overview of Learning to Rank
9
Ranking Plays Key Role in Many Applications
10
Ranking Problem: Example = Document Retrieval
qnq
q
q
d
d
d
,
2,
1,
q
query
documents
ranking of documents
NdddD ,,, 21
),( dqf
11
ranking based on relevance, importance,
preference
Ranking ProblemExample = Recommenders System
12
Item1 Item2 Item3 ... ItemN
User1 5 4
User2 1 2 2
... ? ? ?
UserM 4 3
Ranking ProblemExample = Machine Translation
13
Re-Ranking Model
1000
2
1
e
e
e
ranked sentencecandidates in target language
GenerativeModel
f
sentence source language
e~
re-ranked topsentence in targetlanguage
1.2. Problem and Approaches of Learning to Rank
14
Ranking Problem: Example = Document Search
qnq
q
q
d
d
d
,
2,
1,
q
query
documents
ranking of documents
NdddD ,,, 21
),( dqf
15
ranking based on relevance, importance,
preference
Traditional Approach = Probabilistic Model
Nd
d
d
2
1
q
),|(~
),|(~
),|(~
22
11
nn dqrPd
dqrPd
dqrPd
query
documents
ranking of documents
}0,1{
),|(
R
dqrP
16
BM25 [Robertson & Walker 94]
Nd
d
d
2
1
qquery
documents
ranking function
qdw wtfavgdl
dlbkb
wtfk
)()1(
)()1(
17
),|(~
),|(~
),|(~
22
11
nn dqrPd
dqrPd
dqrPd
ranking of documents
}0,1{
),|(
R
dqrP
PageRank[Page et al, 1999]
ndL
dPdP
ij dMd j
j
i
1)1(
)(
)()(
)(
New Approach = Learning to Rank
1,1
2,1
1,1
1
nd
d
d
q
mnm
m
m
m
d
d
d
q
,
2,
1,
Learning System
Ranking System
1mq
),(
),(
),(
11 ,11,1
2,112,1
1,111,1
mm nmmnm
mmm
mmm
dqfd
dqfd
dqfd
),( dqf
19
NdddD ,,, 21
1. Data Labeling(rank) 3. Learning
)(xf
2. Feature Extraction
1,1
2,1
1,1
1
nd
d
d
q
mnm
m
m
m
d
d
d
q
,
2,
1,
11 ,1,1
2,12,1
1,11,1
1
nn yd
yd
yd
q
mm nmnm
mm
mm
m
yd
yd
yd
q
,,
2,2,
1,1,
11 ,1,1
2,12,1
1,11,1
nn yx
yx
yx
mm nmnm
mm
mm
yx
yx
yx
,,
2,2,
1,1,
Training Process
20
1. Data Labeling(rank)
2. Feature Extraction
Testing Process
21
3. Ranking with )(xf
)(
)(
)(
111 ,1,1,1
2,12,12,1
1,11,11,1
mmm nmnmnm
mmm
mmm
yxfx
yxfx
yxfx
4. Evaluation
EvaluationResult
1,1
2,1
1,1
1
mnm
m
m
m
d
d
d
q
11 ,1,1
2,12,1
1,11,1
1
mm nmnm
mm
mm
m
yd
yd
yd
q
11 ,1,1
2,12,1
1,11,1
mm nmnm
mm
mm
yx
yx
yx
Notes
• Features are functions of query and document
• Query and associated documents form a group
• Groups are i.i.d. data
• Feature vectors within group are not i.i.d. data
• Ranking model is function of features
• Several data labeling methods (here labeling of grade)
22
Issues in Learning to Rank
• Data Labeling
• Feature Extraction
• Evaluation Measure
• Learning Method (Model, Loss Function, Algorithm)
23
Data Labeling Problem
• E.g., relevance of documents w.r.t. query
24
Doc A
Doc B
Doc C
Query
Data Labeling Methods
• Labeling of Grades
– Multiple levels (e.g., relevant, partially relevant, irrelevant)
– Widely used in IR
• Labeling of Ordered Pairs
– Ordered pairs between documents (e.g. A>B, B>C)
– Implicit relevance judgment: derived from click-through data
• Creation of List
– List (or permutation) of documents is given
– Ideal but difficult to implement
25
Implicit Relevance Judgment
Doc A
Doc B
Doc CB > A
ranking of documents at search system
users often clicked on Doc B
ordered pair
26
Feature Extraction
27
Doc A
Doc B
Doc C
Query
BM25
BM25
BM25
PageRank
PageRank
PageRank
.............
.............
.............
Query-document feature
Document feature
Feature Vectors
Example Features
28
Evaluation Measures
• Important to rank top results correctly
• Measures– NDCG (Normalized Discounted Cumulative Gain)
– MAP (Mean Average Precision)
– MRR (Mean Reciprocal Rank)
– WTA (Winners Take All)
– Kendall’s Tau
29
NDCG
• Evaluating ranking using labeled grades
• NDCG at position j
30
)1log(/)12(1
1
)( in
j
i
ir
j
NDCG (cont’)
• Example: perfect ranking
– (3, 3, 2, 2, 1, 1, 1) grade r=3,2,1
– (7, 7, 3, 3, 1, 1, 1) gain
– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount
– (7, 11.41, 12.91, …) DCG
– (1/7, 1/11.41, 1/12.91, …) normalizing factor
– (1, 1,1,1,1,1,1) NDCG for perfect ranking
12 )( jr
)1log(/1 j
)1log(/)12(1
)( ij
i
ir
jn
31
NDCG (cont’)
• Example: imperfect ranking
– (2, 3, 2, 3, 1, 1, 1)
– (3, 7, 3, 7, 1, 1, 1) Gain
– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount
– (3, 7.41, 8.91, … ) DCG
– (1/7, 1/11.41, 1/12.91, …) normalizing factor
– (0.43, 0.65, 0.69, ….) NDCG
• Imperfect ranking decreases NDCG
32
Relations with Other Learning Tasks
• No need to predict category
vs Classification
• No need to predict value of
vs Regression
• Relative ranking order is more important
vs Ordinal regression
• Learning to rank can be approximated by classification, regression, ordinal regression
33
),( dqf
Ordinal Regression (Ordinal Classification)
• Categories are ordered
– 5, 4, 3, 2, 1
– e.g., rating restaurants
• Prediction
– Map to ordered categories
34
Three Major Approaches
• Pointwise approach
• Pairwise approach
• Listwise approach
• SVM based
• Boosting based
• Neural Network based
• Others
35
Categorization of Learning to rank Methods
36
Pointwise Approach
• Transforming ranking to regression, classification, or ordinal classification
• Query-document group structure is ignored
37
Pointwise Approach
38
Pointwise Approach
39
Pairwise Approach
• Transforming ranking to pairwise classification
• Query-document group structure is ignored
40
Pairwise Approach
41
Listwise Approach
• List as instance
• Query-document group structure is used
• Straightforwardly represents learning to rank problem
42
Listwise Approach
43
Evaluation Results
• Pairwise approach and listwise approach perform better than pointwise approach
• LabmdaMART performs best in Yahoo Learning to rank Challenge
• No significant difference among pairwise and listwise methods
44
45
1.3. Methods of Learning to Rank
46
Ranking SVM
Hebrich et al., 1999
47
Pairwise Classification
• Converting document list to document pairs
48
Doc A
Doc B
Doc C
Query
Doc A
Doc B Doc C
Query
Doc ADoc B
Doc C
grade 1
grade 2
grade 3
Transforming Ranking to Pairwise Classification
• Input space: X
• Ranking function
• Ranking:
• Linear ranking function:
• Transforming to pairwise classification:
RXf :
);();( wxfwxfxx jiji
xwwxf ,);(
);();( 0, wxfwxfxxw jiji
ij
ji
jixx
xxyzxx
1
1 ),,(
49
Ranking Problem
50
w
grade 1
grade 2
grade 3
1x
2x
3x
w
Transformed Pairwise Classification Problem
51
);( wxf31 xx
32 xx
21 xx
+1-1
12 xx
23 xx
13 xx
Positive Examples
Negative Examples
Redundant
Ranking SVM
• Pairwise classification on differences of feature vectors
• Corresponding positive and negative examples
• Negative examples are redundant and can be discarded
• Hyper plane passes the origin
• Soft margin and kernel can be used
• Ranking SVM = pairwise classification SVM
52
Learning of Ranking SVM
2
1
)2()1( ||||,1min wxxwyl
i
iiiw
53
),0max(][ ss C2
1
0
,,1 1,
||||2
1min
)2()1(
1
2
,
i
iiii
N
i
iw
Nixxwy
Cw
IR SVM
Cao et al., 2006
54
Cost-sensitive Pairwise Classification
• Converting to document pairs
55
Doc A
Doc B
Doc C
Query
Doc A
Doc B Doc C
Query
Doc ADoc B
Doc C
grade 1
grade 2
grade 3
CriticalNot Critical
Problems with Ranking SVM
• Not sufficient emphasis on correct ranking on topgrades: 3, 2, 1ranking 1: 2 3 2 1 1 1 1ranking 2: 3 2 1 2 1 1 1ranking 2 should be better than ranking 1Ranking SVM views them as the same
• Numbers of pairs vary according to queriesq1: 3 2 2 1 1 1 1q2: 3 3 2 2 2 1 1 1 1 1number of pairs for q1 : 2*(2-2) + 4*(3-1) + 8*(2-1) = 14number of pairs for q2: 6*(3-2) + 10*(3-1) + 15*(2-1) = 31Ranking SVM is biased toward q2
56
IR SVM
• Solving the two problems of Ranking SVM
• Higher weight on important grade pairs
• Normalization weight on pairs in query
• IR SVM = Ranking SVM using modified hinge loss
57
)(ik
)(iq
Modified Hinge Loss function
58
2)2()1(
)(
1
)( ||||,1min wxxwy iiiiq
l
i
ikw
1
2
0.5
1 )( )2()1( xxfy
Loss
Learning of IR SVM
59
2)2()1(
)(
1
)( ||||,1min wxxwy iiiiq
l
i
ikw
2
0
,,1 1,
||||2
1min
)()(
)2()1(
1
2
,
iqik
i
i
iiii
l
i
iiw
C
lixxwy
Cw
AdaRank
Xu and Li, 2007
60
Listwise Loss
61
111 ,1,1,1
2,11,22,1
1,11,11,1
1
nnn yx
yx
yx
q
mmm nmnmnm
mmm
mmm
m
yx
yx
yx
q
,,,
2,2,2,
1,1,1,
AdaRank
• Optimizing exponential loss function
• Algorithm: AdaBoost-like algorithm for ranking
62
Loss Function of AdaRank
63
Any evaluation measuretaking value between [-1,+1]
AdaRank Algorithm
64
Part 2. Learning to Match
65
2-1. Semantic Matching in Search
66
Query Document Mismatch isBiggest Challenge in Search
67
Query Document Mismatch
• Same intent can be represented by different queries (representations)
• Search is still mainly based on term level matching
• Query document mismatch occurs, when searcher and author use different representations
68
Examples of Query Document Mismatch
Query Document Term Matching
Semantic Matching
seattle best hotel seattle best hotels
no yes
pool schedule swimmingpoolschedule
no yes
natural logarithm transformation
logarithm transformation
partial yes
china kong china hong kong partial no
why are windows so expensive
why are macs so expensive
partial no
69
Matching at Different Levels
Semantic Matching
Form Phrase Sense Topic Structure
Term Matching
Structure Identification
Topic Identification
Similar Query Finding
Phrase Identification
Spelling Error Correction
Sense
michael jordan berkele
query form: michael jordan berkeley
phrase: michael jordan
similar query: michael i. jordan
main phrase: michael jordan
Phrase
Term
Structure
topic: machine learning, berkeley
Topic
phrase: berkeley
Query Understanding
Title Structure Identification
Topic Identification
Key Phrase Identification
Phrase IdentificationTerm
Topic
Homepage of Michael Jordan
Michael Jordan is Professor in the
Department of Electrical Engineering
……
key phrase: michael jordan, professor,
electrical engineeringKey Phrase
topic: machine learning, berkeley
main phrase in title: michael jordan
phrase: michael jordan, professor,
department, electrical engineering]
Phrase
Structure
Document Understanding
Query
Representation
Document
RepresentationQuery Document
Matching
Relevance Ranking
Query form: michael jordan berkeley
Similar query: michael i jordan
Main phrase: michael jordan
Phrase: michael jordan, berkeley
Topic : machine learning
Document: michael jordan homepage
Main phrase in title: michael jordan
Key phrase: michael jordan, berkeley
Phrase: michael jordan, professor,
department of electrical engineering
Topic : machine learning, berkeley
Semantic Matching
Matching in Different Ways
74
q d
q’
d’
c
Query Reformulation
Document transformation
Query and document transformation
No transformation
Web Search System
Ranking
Document
UnderstandingIndexing
Index
User Interface
Web
User
Crawling
Query
Document
Matching
Query
UnderstandingRetrieving
Machine Learning for Query Document Matching in Web Search
76
Learning for Matching between Query and Document
• Learning matching function
• Using training data
• and can be id’s or feature vectors
• can be binary or numerical values
• Using relations in data and/or prior knowledge
77
),|(or ),( dqrpdqf MM
),,(,),,,( 111 NNN rdqrdq
Nqqq ,,, 21 Nddd ,,, 21
Nrrr ,,, 21
Long Tail Challenge
• Head pages have rich anchor texts and click data
• Tail queries and pages suffer more from mismatch
• Problem of propagating information and knowledge from head to tail
78
Relation between Matching and Ranking
• In traditional IR: – Ranking = matching
• Web search: – Ranking and matching become separated– Learning to rank becomes state-of-the-art
– Matching = feature learning for ranking
79
or ),(),( 25 dqfdqf BM
)(),(),( 25 dgdqfdqf PageRankBM
)|(),( qdPdqf LMIR
Matching vs Ranking
Matching Ranking
Prediction Matchingdegree between query and document
Ranking list of documents
Model f(q, d) f(q,d1), f(q,d2), … f(q,dn)
Challenge Mismatch Correct ranking on top
80
In search, first matching and then ranking
Matching Functions as Features in Learning to Rank
• Term level matching:
• Phrase level matching:
• Sense level matching:
• Topic level matching:
• Structure level matching:
• Term level matching (spelling, stemming):
81
),(25 dqfBM
),( dqfP
),( dqfT
),( dqfC
),( dqfS
qq '
),(25 dqf BMn
2-2. Overview of Learning to Match
83
Matching between Heterogeneous Data is Everywhere
• Matching between user and product (collaborative filtering)
• Matching between text and image (image annotation)
• Matching between people (dating)
• Matching between languages (machine translation)
• Matching between receptor and ligand (drug design)
84
Formulation of Learning Problem
• Learning matching function
• Training data
• Generated according to
85
),( yxf
),,(,),,,( 111 NNN ryxryx
),|(~ ),|(~ ),(~ YXRPrXYPyXPx
Formulation of Learning Problem
• Loss Function
• Risk Function
• Objective Function in Learning
86
) ),(,( yxfrL
RYXryxdPyxfrLryxPyxfrR ),,() ),(,(),,() ),(,(
N
iiii
FffyxfrL
1
)() ),(,(min
Matching Problem: Instance Matching
1
4
51
1
x1
xm
y1 yny2 y3
x2
x3
87
Instances
Can be represented as matching between nodes in bipartite graph
Matching Problem: Feature Matching
1
4
51
1
x1
xm
y1 yny2 y3
x2
x3
88
Features
Can be represented as matching between objects in two spaces
Matching Problem: Structure Matching
1
4
51
1
x1
xm
y1 yny2 y3
x2
x3
89
Structures
2-3. Methods of Learning to Match
90
Regularized Latent Semantic Indexing
Wang et al. 2011
Regularized Latent Semantic Indexing
• Motivation– Matching between query and document at topic level– Scale up to large datasets (vs. existing methods)
• Approach– Matrix Factorization– Regularization on topics and documents (vs. Sparse Coding)– Learning problem can be easily decomposed
• Results– l1 on topics leads to sparse topics and l2 on documents leads to
accurate matching– Comparable with existing methods in topic discovery and search
relevance– But can easily scale up to large document sets
Query and Document Matching in Topic Space
q
d1
dnd2
Document Space
d2
dn
d1
Topic Space
q
Regularized Latent Semantic Indexing
term representation
of doc n topics
topic representation
of doc n
topics are sparse
documents are
smooth
D
term
documentdocument
topicterm
topic
U
0
0
0
1
0
0
2
0
0
9
0
0
0
5
0
0
0
0
0
0
6
1
0
4
0
0
0
0
1
0
0
0
0
≈ V×
Optimization Strategy
Coordinate Decent
Analytic Solution
RLSI Algorithm
terms processed in parallel
docs processed in parallel
• Single machine multi core version• Multiple machine version (MapReduce and MPI)
Regularized Matching in Latent Space (RMLS)
Wu et al., 2013
97
Matching in Latent Space
• Motivation
– Matching between query and document in latent space
• Assumption
– Queries have similarity
– Document have similarity
– Click-through data represent “similarity” relations between queries and documents
• Approach
– Projection to latent space
– Regularization or constraints
• Results
– Significantly enhance accuracy of query document matching
Matching in Latent Space
q1
qmq2
d1
dn
d2
Query Space Document Space
q1d2
qm
dn
d1
q2 Latent Space
qL
dL
• Matching between Heterogeneous Data
• Example: Image Annotation
hook
fishing
singer
solider
worrier
microphone
Projecting Keywords and Images into Latent Space
Partial Least Square (PLS)• Setting
– Two spaces:
• Input
– Training data:
• Output
– Matching function
• Assumption
– Two linear (and orthonormal ) transformations
– Dot product as similarity function
• Optimization
DdQq ,
Nccdq iiii )},,{(
),( dqf
dq LL ,
dLqL dq ,
ILLILLdLqLc d
T
dq
T
q
dq
idiqiLL
iidq
, st. , maxarg),(
,
Solution of Partial Least Square
• Non-convex optimization
• Global optimal solution exists
• Global optimum can be found by solving SVD (Singular Value Decomposition)
Regularized Mapping to Latent Space (RMLS)
• Setting
– Two spaces:
• Input
– Training data:
• Output
– Matching function
• Assumption
– Two linear (and sparse) transformations
– Dot product as similarity function
• Optimization
DdQq ,
Nccdq iiii )},,{(
),( dqf
dq LL ,
dLqL dq ,
ddqqddqq
dq
idiqiLL
lllldLqLcii
dq
||||,||||,||,|| st. , maxarg),(
,
Solution of Regularized Mapping to Latent Space (RMLS)
• Coordinate Descent
• Repeat
– Fix , update
– Fix , update
• No guarantee to find global optimum
• Updates can be parallelized by rows
qL dL
dLqL
Comparison between PLS and RMLS
PLS RMLS
Assumption Orthogonal L1 and L2 Regularization
OptimizationMethod
Singular Value Decomposition
CoordinateDescent
Optimality Globaloptimum
Local optimum
Efficiency Low High
Scalability Low High
Part 3. Deep Learning for Information Retrieval
107
3-1: Overview of Deep Learning for Information Retrieval
Hard Problems in IR
Q: How tall is Yao Ming?
Q: A dog catching a ball
Name Height Weight
Yao Ming 2.29m 134kg
Liu Xiang 1.89m 85kg
Key Questions: How to Represent Intent and Content, How to Match Intent and Content
Image Retrieval
Question Answering from Knowledge Base
Q:How far is sun from earth?
The average distance between the Sun andthe Earth is about 92,935,700 miles.
Generation-based Question Answering
A: It is about 93 million miles
(No tag on images)
Deep Learning and IR
Intent ContentMatching
Deep Learning
Recent Progress: Deep Learning Is Particularly Effective for Hard IR Problems
3-2: Basics of Deep Learning
Word Embedding
Word Embedding
• Motivation: representing words with low-dimensional real-valued vectors, utilizing them as input to deep learning methods, vs one-hot vectors
• Method: SGNS (Skip-Gram with Negative Sampling)
• Tool: Word2Vec
• Input: words and their contexts in documents
• Output: embeddings of words
• Assumption: similar words occur in similar contexts
• Interpretation: factorization of mutual information matrix
• Advantage: compact representations (usually 100~ dimensions)
Skip-Gram with Negative Sampling(Mikolov et al., 2013)
• Input: occurrences between words and contexts
• Probability model:
5 1 2
2 1
3 1
1w
2w
3w
1c 2c 3c4c 5cM
cwecwcwDP
1
1)(),|1(
cwecwcwDP
1
1)(),|0(
Skip-Gram with Negative Sampling
• Word vector and context vector: lower dimensional (parameter ) vectors
• Goal: learning of the probability model from data
• Take co-occurrence data as positive examples
• Negative sampling: randomly sample k unobserved pairs
as negative examples
• Objective function in learning
• Algorithm: stochastic gradient descent
),( Ncw
)(log)(log),(# ~ NPC
w c
cwkcwcwLN
E
cw
,
Interpretation as Matrix Factorization(Levy & Goldberg 2014)
• Pointwise Mutual Information Matrix
)()(
),(log
cPwP
cwP
3 -.5 2
1 -0.5
1.5 1
1w
2w
3w
1c 2c 3c4c 5c
M
Interpretation as Matrix Factorization
7 0.5 1
2.2 3
1 1.5 1
1w
2w
3w
1t 2t 3t
TWCM Matrix factorization, equivalent to SGNS
W
3 -.5 2
1 -0.5
1.5 1
1w
2w
3w
1c 2c 3c4c 5c
M
Word embedding
Recurrent Neural Network
Recurrent Neural Network
• Motivation: representing sequence of words and utilizing the representation in deep learning methods
• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)
• Output: sequence of internal representations (hidden states)
• Variants: LSTM and GRU, to deal with long distance dependency
• Learning of model: stochastic gradient descent
• Advantage: handling arbitrarily long sequence; can be used as part of deep model for sequence processing (e.g., language modeling)
Recurrent Neural Network (RNN)(Mikolov et al. 2010)
the cat sat on the mat
the cat sat …. mat
),( 1 ttt xhfh
tx
1th
tx
th1th
Recurrent Neural Network
)tanh(),( 11 hxtxthttt bxWhWxhfh
1th tx
+1
th
…
……
Long Term Short Memory (LSTM)(Hochreiter & Schmidhuber, 1997)
)tanh(
)tanh(
)(
)(
)(
1
1
1
1
1
ttt
ttttt
gtgxtght
otoxtoht
ftfxtfht
itixtiht
coh
cfgic
bxWhWg
bxWhWo
bxWhWf
bxWhWi
• A memory (vector) to store Svalues of previous state• Input gate, output gate, and kforget gate to control • Gate: element-wise product kwith vector of values in [0,1]
),( 1 tt xh
),( 1 tt xh ),( 1 tt xh
),( 1 tt xh
th
ti to
tf
tgtc
m
Input Gate Output Gate
Forget Gate
Gated Recurrent Unit (GRU)(Cho et al., 2014)
ttttt
gtgxttght
ztzxtzht
rtrxtrht
gzhzh
bxWhrWg
bxWhWz
bxWhWr
)1(
))(tanh(
)(
)(
1
1
1
1
• A memory (vector) to store Svalues of previous state• Reset gate and update gate to kcontrol
tx
th
1th
1th
),( 1tt hx
tr
tg
tz
),( 1tt hx
Reset Gate
Update Gate
Recurrent Neural Network Language Model
)max(soft)|(
)tanh(
11
1
bWhxxxPp
bxWhWh
tttt
hxtxtht
Model Objective of Learning
T
t
tpT 1
ˆlog1
th
tx
1th
1tx
tx1tx
• Input one sequence and output another• In training, input sequence is same as output sequence
Convolutional Neural Network
Convolutional Neural Network
• Motivation: representing sequence of words and utilizing the representation in deep learning methods
• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)
• Output: representation of input sequence
• Learning of model: stochastic gradient descent
• Advantage: robust extraction of n-gram features; can be used as part of deep model for sequence processing (e.g., sentence classification)
Convolutional Neural Network (CNN) (Kim 2014, Blunsom et al. 2014, Hu et al., 2014)
the cat sat on the mat Concatenation
……
the cat sat on the mat
cat sat
the cat sat
the cat
sat on
cat sat on
cat sat
on the
sat on the
sat on
the mat
on the mat
on the
sat on
the cat sat
the cat
the mat
on the mat
sat on
Convolution
Max pooling
• Shared parameters on same level• Fixed length, zero padding
Example: Image Convolution
0
0
3
1
Dark Pixel Value = 1, Light Pixel Value = 0Dot in Filter = 1, Others = 0
Filter
Leow Wee Kheng
Example: Image Convolution
0 0 0 0 0
0 0 1 1 0
0 1 3 2 0
0 1 3 1 0
0 1 1 0 0
Feature Map
• Scanning image with filter having 3*3 cells, among them 3 are dot cells• Counting number of dark pixels overlapping with dot cells at each position• Creating feature map (matrix), each element represents similarity between filer pattern and pixel pattern at one position• Equivalent to extracting feature using the filter• Translation-invariant
Convolution Operation
Convolution
Filter Feature Map Neuron
TT
hi
T
i
T
ii
i
l
i
flfl
fl
i
l
fll
i
flfl
i
xxxz
iz
liz
lfbw
lifz
Ffbzwz
],,[
location for vectors wordedcancatenat frominput is
1layer from location for neuron ofinput is
function sigmoid is
layer in typeofneuron of parameters are ,
layer in location for typeofneuron ofoutput is
,,2,1 )(
11
)0(
)0(
)1(
),(),(
),(
),()1(),(),(
)1( liz
),( flb
+1
),( flw
),( fliz
…
Convolution
Equivalent to n-gram feature extraction at each position
Max Pooling
lifzz
lifz
zzz
fl
i
fl
i
fl
i
fl
i
fl
i
fl
i
layer in location for typeof pooling ofinput are ,
layer in location for typeof pooling ofoutput is
),max(
),1(
2
),1(
12
),(
),1(
2
),1(
12
),(
Max Pooling
Equivalent to n-gram feature selection
Sentence Classification Using Convolutional Neural Network
)(CNN
)max(soft)(
xz
bWzxfy
Concatenation
……
Convolution
Max Pooling
)(Lz
x
y
3-3. Methods of Deep Learning for Information Retrieval
Retrieval based Question Answering
Hu et al. 2014
Ji et al. 2014
Retrieval-based Question Answering
Q: What is the population of Hong Kong?A: It is 7.18 million as in 2013.
Q: How many people are there in Hong Kong?A: There are about 7 million.
Question Answering System
Q:Do you know Hong Kong’smmpopulation?
A:There are about 7 mmmillion.
Learning System
Retrieval based Question Answering System
Index of Questions and
Answers
Matching Ranking
Question
Retrieval
RetrievedQuestions and Answers
RankedAnswers
MatchingModels
RankingModel
Online
Offline
Best Answer
MatchedAnswers
Deep Match CNN- Architecture I
MLP
…… ……
• First represent two sentences as vectors, and then match the vectors
Sentence X Sentence Y
Deep Match CNN - Architecture II
• Represent and match two sentences simultaneously
• Two dimensional model
138
MLP
Matching Degree
…
2D Convolution
More 2D Convolution & Pooling
Max-Pooling
1D Convolution
Sentence X
Sen
ten
ce Y
Generation based Question Answering
Shang et al. 2015
Generation-based Question Answering
Q: What is the population of Hong Kong?
A: It is 7.18 million as in 2013.
Q: How many people are there in HonglmmKong?
A: There are about 7 million.
Question Answering System
Q: Do you know HongmmlKong’s population?
A: It is 7 million
Learning System
Neural Responding Machine
• Encoding questions to internal representations
• Decoding internal representations to answers
• Using GRU
Question
Answer
Encoder
Txxx 21x
Decoder
tyyy 21y
c
h
ContextGenerator
Decoder
1c2c
tc 'Tc
1s 1ts ts'Ts
1y 1ty ty 'Ty
… …
GRU is () function,softmax is ()
ctorcontext ve is
decoder of statehidden is
hot vector-one is
),,(
),,(),|(
11
111
fg
c
s
y
csyfs
csygyyyP
t
t
t
tttt
ttttt
x
Similar to attention mechanism in RNN Encoder-Decoder
Encoder
Global EncoderLocal Encoder
1x1tx tx Tx
1h 1th th Th
1ts
tc
… …
stateshidden global and local ofion concatenat is :
weightis ctor,context ve is
),(,: 1
1
g
T
l
j
tjt
tjtj
g
T
l
j
T
j
tjt
hh
c
shqhhc
Combination of global and local encoders
GRU is ()
encoder of statehidden is
embedding wordis
),( 1
f
h
x
hxfh
t
t
ttt
Question Answering from Relational Database
Yin et al. 2016
Question Answering from Relational Database
Relational Database
Q: How many people participated in the mmgame in Beijing?
A: 4,200SQL: select #_participants, where city=beijing
Q: When was the latest game hosted?A: 2012SQL: argmax(city, year)
Question Answering System
Q: Which city hosted the mmlongest Olympic game mmbefore the game in Beijing? A: Athens
Learning System
year city #_days #_medals
2000 Sydney 20 2,000
2004 Athens 35 1,500
2008 Beijing 30 2,500
2012 London 40 2,300
Neural Enquirer
• Query Encoder: encoding query• Table Encoder: encoding entries in table• Five Executors: executing query against table
Conducting matching between question and database entries multiple times
Query Encoder and Table Encoder
• Creating query embedding using RNN• Creating table embedding for each entry using DNN
RNN
…
Query Representation
Query
DNN
Entry Representation
Field Value
Table Representation
Query Encoder Table Encoder
Executors
• Five layers, except last layer, each layer has reader, mannotator, and memory • Reader fetches important representation for each row,me.g., city=beijing• Annotator encodes result representation for each row, me.g., row where city=beijing
Select #_participants where city = beijing
Question Answering from Knowledge Graph
Yin et al. 2016
Question Answering from Knowledge Graph
(Yao-Ming, spouse, Ye-Li)(Yao-Ming, born, Shanghai)(Yao-Ming, height, 2.29m)… …(Ludwig van Beethoven, place ofbirth, Germany)… …
Knowledge Graph
Q: How tall is Yao Ming?A: He is 2.29m tall and is visible from space.(Yao Ming, height, 2.29m)
Q: Which country was Beethoven from?A: He was born in what is now Germany.(Ludwig van Beethoven, place of birth, Germany)
Question Answering System
Q: How tall is Liu Xiang? A: He is 1.89m tall
Learning System
Answer is generated
GenQA
• Interpreter: creates representation of question using RNN
• Enquirer: retrieves top k triples with highest matching scores using CNN model
• Generator: generates answer based on question and retrieved triples using attention-based RNN
• Attention model: controls generation of answer
Short Term Memory
Long Term Memory
(Knowledge Base)
How tall is Yao Ming?
Interpreter
Enquirer
Generator
He is 2.29m tall
Attention Model
Key idea: • Generation of answer based on question and retrieved kresult• Combination of neural processing and symbolic rprocessing
Enquirer: Retrieval and Matching
• Retaining both symbolic representations and vector representations• Using question words to retrieve top k triples• Calculating matching scores between question and triples using CCNN model• Finding best matched triples
(how, tall, is, liu, xiang)
< liu xiang, height, 1.90m>
< yao ming, height, 2.26m>… …
<liu xiang, birth place, shanghai>
Retrieved Top k Triplesand Embeddings
Question and Embedding Matching
Generator: Answer Generation
• Generating answer using attention mechanism• At each position, a variable decides whether to ggenerate a word or use the object of top triple
2s 3s
3c
He is
03 z 13 z
2.29mtall
< yao ming, height, 2.29m>
3z
How tall is Yao Ming ?
3y2y
…o
…
…
…
3y
Image Retrieval
Ma et al. 2015
Image Retrieval
a lady in a car
a man holds a cell phone
two ladies are chatting
Learning System
Retrieval System
Having dinner with friends in restaurant
Image Index
Multimodal CNN• Represent text and image as vectors and then match
the two vectors
• Word-level matching, phrase-level matching, sentence-level matching
• CNN model works better than RNN models (state of the art) for text
A dog is catching a ball
CNN CNN
MLP
Sentence-level Matching
• Combing image vector and sentence vector
……
CNN
MLP
a dog is catching a ball
Word-level Matching Model
• Adding image vector to word vectors
……
a dog is catching a ball
CNN
MLP
Summary
Summary
• Learning to Rank– Approaches: pointwise, pairwise, listwise approaches– Methods: Ranking SVM, IR SVM, AdaRank
• Learning to Match– Semantic Matching– Methods: RLSI, RMLS
• Deep Learning for Information Retrieval– Deep learning is effective for hard IR problems– Basic tools: word embedding, CNN, RNN– Methods: Deep Match CNN, NRM, GenQA, Neural
Enquirer, Multimodal CNN
References• Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, Xiaoming Li. Neural Generative
Question Answering. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2972-2978, 2016.
• Pengcheng Yin, Zhengdong Lu, Hang Li, Ben Kao. Neural Enquirer: Learning to Query Tables with Natural Language. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2308-2314, 2016.
• Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li, Multimodal Convolutional Neural Networks for Matching Image and Sentence. Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV), 2623-2631, 2015.
• Lifeng Shang, Zhengdong Lu, Hang Li. Neural Responding Machine for Short Text Conversation. Proceedings of the 53th Annual Meeting of Association for Computational Linguistics and the 7th International Conference on Natural Language Processing (ACL-IJCNLP'15), 2015.
• Baotian Hu, Zhengdong Lu, Hang Li, Qingcai Chen. Convolutional Neural Network Architectures for Matching Natural Language Sentences. Proceedings of Advances in Neural Information Processing Systems 27 (NIPS'14), 2042-2050, 2014.
• Hang Li, Jun Xu. Semantic Matching in Search. Foundations and Trends in Information Retrieval, Now Publishers, 2014.
• Wei Wu, Zhengdong Lu, Hang Li. Learning Bilinear Model for Matching Queries and Documents. Journal of Machine Learning Research (JMLR), 14: 2519-2548, 2013.
References• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing:
A New Approach to Large Scale Topic Modeling. ACM Transactions on Information Systems (TOIS), 31(1): 5, 2013.
• Hang Li. A Short Introduction to Learning to Rank. IEICE Transactions on Information and Systems, E94-(10), 2011.
• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing. Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR’11), 685-694, 2011.
• Hang Li. Learning to Rank for Information Retrieval and Natural Language Processing. Synthesis Lectures on Human Language Technology, Lecture 12, Morgan & Claypool Publishers, 2011.
• Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, Hang Li. Learning to Rank: From Pairwise Approach to Listwise Approach. Proceedings of the 24th International Conference on Machine Learning (ICML’07), 129-136, 2007.
• Jun Xu and Hang Li. AdaRank: A Boosting Algorithm for Information Retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR’07), 391-398, 2007.
• Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference (SIGIR’06), 186-193, 2006.
Thank you!