Deep Learning for Information Retrieval - Hang Li...star wars the force awakens reviews Star Wars:...

Machine Learning for Information Retrieval

Hang Li

Noah’s Ark Lab

Huawei Technologies

The Third Asian Summer School in Information Access

Kyoto JapanAug 5, 2016

Outline of Tutorial

• Introduction

• Part 1: Learning to Rank

• Part 2: Learning to Match

• Part 3: Deep Learning for Information Retrieval

• Summary

Overview of Information Retrieval

Information and Knowledge Base

Information Retrieval System

Query

Relevant Result

Intent:Key Words,Question

Content:Documents,Images,Relational Tables

Machine Learning can play an important role

• Key questions: how to represent intent and content, how to match intent and content• Ranking, indexing, etc are less essential• Interactive IR is not particularly considered here

Approach in Traditional IR

Query:star wars the force awakens reviews

Star Wars: Episode VIIThree decades after the defeat of the Galactic

Empire, a new threat arises.

||||||||

,),(

dq

dqdqfVSM

0

0

1

q

1

0

1

d

• Representing query and document as tf-idf vectors• Calculating cosine similarity between them• BM25, LM4IR, etc can be considered as non-linear variants

),( dqf

Document:

Approach in Modern IR

• Conducting query and document understanding• Representing query and document as feature vectors• Calculating multiple matching scores between query and document using learning to match• Training ranker with matching scores as features using learning to rank

Document:Query:star wars the force awakens reviews

Star Wars: Episode VIIThree decades after the defeat of the Galactic

Empire, a new threat arises.

qm

q

v

v

q

1

dn

d

v

v

d

1),( dqf

“Easy” Problems in IR

• Search

– Matching between query and document

• Question Answering from Documents

– Matching between question and answer

• Learning to match & learning to rank: well studied so far

• Deep Learning may not help so much

“Hard” Problems in IR

• Image Retrieval

– Matching between text and image

– Not the same as traditional setting

• Question Answering from Knowledge Base

– Complicated matching between question and fact in knowledge base

• Generation-based Question Answering

– Generating answer to question based on facts in knowledge base

• Not well studied so far

• Deep Learning for Information Retrieval can make a big deal

Part 1. Learning to Rank

8

1.1. Overview of Learning to Rank

9

Ranking Plays Key Role in Many Applications

10

Ranking Problem: Example = Document Retrieval

qnq

q

q

d

d

d

,

2,

1,

q

query

documents

ranking of documents

NdddD ,,, 21

),( dqf

11

ranking based on relevance, importance,

preference

Ranking ProblemExample = Recommenders System

12

Item1 Item2 Item3 ... ItemN

User1 5 4

User2 1 2 2

... ? ? ?

UserM 4 3

Ranking ProblemExample = Machine Translation

13

Re-Ranking Model

1000

2

1

e

e

e

ranked sentencecandidates in target language

GenerativeModel

f

sentence source language

e~

re-ranked topsentence in targetlanguage

1.2. Problem and Approaches of Learning to Rank

14

Ranking Problem: Example = Document Search

qnq

q

q

d

d

d

,

2,

1,

q

query

documents


NdddD ,,, 21

),( dqf

15

ranking based on relevance, importance,

preference

Traditional Approach = Probabilistic Model

Nd

d

d

2

1

q

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd

query

documents


}0,1{

),|(

R

dqrP

16

BM25 [Robertson & Walker 94]

Nd

d

d

2

1

qquery

documents

ranking function

qdw wtfavgdl

dlbkb

wtfk

)()1(

)()1(

17

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd


}0,1{

),|(

R

dqrP

PageRank[Page et al, 1999]

ndL

dPdP

ij dMd j

j

i

1)1(

)(

)()(

)(

New Approach = Learning to Rank

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

Learning System

Ranking System

1mq

),(

),(

),(

11 ,11,1

2,112,1

1,111,1

mm nmmnm

mmm

mmm

dqfd

dqfd

dqfd

),( dqf

19

NdddD ,,, 21

1. Data Labeling(rank) 3. Learning

)(xf

2. Feature Extraction

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

11 ,1,1

2,12,1

1,11,1

1

nn yd

yd

yd

q

mm nmnm

mm

mm

m

yd

yd

yd

q

,,

2,2,

1,1,

11 ,1,1

2,12,1

1,11,1

nn yx

yx

yx

mm nmnm

mm

mm

yx

yx

yx

,,

2,2,

1,1,

Training Process

20

1. Data Labeling(rank)

2. Feature Extraction

Testing Process

21

3. Ranking with )(xf

)(

)(

)(

111 ,1,1,1

2,12,12,1

1,11,11,1

mmm nmnmnm

mmm

mmm

yxfx

yxfx

yxfx

4. Evaluation

EvaluationResult

1,1

2,1

1,1

1

mnm

m

m

m

d

d

d

q

11 ,1,1

2,12,1

1,11,1

1

mm nmnm

mm

mm

m

yd

yd

yd

q

11 ,1,1

2,12,1

1,11,1

mm nmnm

mm

mm

yx

yx

yx

Notes

• Features are functions of query and document

• Query and associated documents form a group

• Groups are i.i.d. data

• Feature vectors within group are not i.i.d. data

• Ranking model is function of features

• Several data labeling methods (here labeling of grade)

22

Issues in Learning to Rank

• Data Labeling

• Feature Extraction

• Evaluation Measure

• Learning Method (Model, Loss Function, Algorithm)

23

Data Labeling Problem

• E.g., relevance of documents w.r.t. query

24

Doc A

Doc B

Doc C

Query

Data Labeling Methods

• Labeling of Grades

– Multiple levels (e.g., relevant, partially relevant, irrelevant)

– Widely used in IR

• Labeling of Ordered Pairs

– Ordered pairs between documents (e.g. A>B, B>C)

– Implicit relevance judgment: derived from click-through data

• Creation of List

– List (or permutation) of documents is given

– Ideal but difficult to implement

25

Implicit Relevance Judgment

Doc A

Doc B

Doc CB > A

ranking of documents at search system

users often clicked on Doc B

ordered pair

26

Feature Extraction

27

Doc A

Doc B

Doc C

Query

BM25

BM25

BM25

PageRank

PageRank

PageRank

.............

.............

.............

Query-document feature

Document feature

Feature Vectors

Example Features

28

Evaluation Measures

• Important to rank top results correctly

• Measures– NDCG (Normalized Discounted Cumulative Gain)

– MAP (Mean Average Precision)

– MRR (Mean Reciprocal Rank)

– WTA (Winners Take All)

– Kendall’s Tau

29

NDCG

• Evaluating ranking using labeled grades

• NDCG at position j

30

)1log(/)12(1

1

)( in

j

i

ir

j

NDCG (cont’)

• Example: perfect ranking

– (3, 3, 2, 2, 1, 1, 1) grade r=3,2,1

– (7, 7, 3, 3, 1, 1, 1) gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount

– (7, 11.41, 12.91, …) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (1, 1,1,1,1,1,1) NDCG for perfect ranking

12 )( jr

)1log(/1 j

)1log(/)12(1

)( ij

i

ir

jn

31

NDCG (cont’)

• Example: imperfect ranking

– (2, 3, 2, 3, 1, 1, 1)

– (3, 7, 3, 7, 1, 1, 1) Gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount

– (3, 7.41, 8.91, … ) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (0.43, 0.65, 0.69, ….) NDCG

• Imperfect ranking decreases NDCG

32

Relations with Other Learning Tasks

• No need to predict category

vs Classification

• No need to predict value of

vs Regression

• Relative ranking order is more important

vs Ordinal regression

• Learning to rank can be approximated by classification, regression, ordinal regression

33

),( dqf

Ordinal Regression (Ordinal Classification)

• Categories are ordered

– 5, 4, 3, 2, 1

– e.g., rating restaurants

• Prediction

– Map to ordered categories

34

Three Major Approaches

• Pointwise approach

• Pairwise approach

• Listwise approach

• SVM based

• Boosting based

• Neural Network based

• Others

35

Categorization of Learning to rank Methods

36

Pointwise Approach

• Transforming ranking to regression, classification, or ordinal classification

• Query-document group structure is ignored

37

Pointwise Approach

38

Pointwise Approach

39

Pairwise Approach

• Transforming ranking to pairwise classification

• Query-document group structure is ignored

40

Pairwise Approach

41

Listwise Approach

• List as instance

• Query-document group structure is used

• Straightforwardly represents learning to rank problem

42

Listwise Approach

43

Evaluation Results

• Pairwise approach and listwise approach perform better than pointwise approach

• LabmdaMART performs best in Yahoo Learning to rank Challenge

• No significant difference among pairwise and listwise methods

44

1.3. Methods of Learning to Rank

46

Ranking SVM

Hebrich et al., 1999

47

Pairwise Classification

• Converting document list to document pairs

48

Doc A

Doc B

Doc C

Query

Doc A

Doc B Doc C

Query

Doc ADoc B

Doc C

grade 1

grade 2

grade 3

Transforming Ranking to Pairwise Classification

• Input space: X

• Ranking function

• Ranking:

• Linear ranking function:

• Transforming to pairwise classification:

RXf :

);();( wxfwxfxx jiji

xwwxf ,);(

);();( 0, wxfwxfxxw jiji

ij

ji

jixx

xxyzxx

1

1 ),,(

49

Ranking Problem

50

w

grade 1

grade 2

grade 3

1x

2x

3x

w

Transformed Pairwise Classification Problem

51

);( wxf31 xx

32 xx

21 xx

+1-1

12 xx

23 xx

13 xx

Positive Examples

Negative Examples

Redundant

Ranking SVM

• Pairwise classification on differences of feature vectors

• Corresponding positive and negative examples

• Negative examples are redundant and can be discarded

• Hyper plane passes the origin

• Soft margin and kernel can be used

• Ranking SVM = pairwise classification SVM

52

Learning of Ranking SVM

2

1

)2()1( ||||,1min wxxwyl

i

iiiw

53

),0max(][ ss C2

1

0

,,1 1,

||||2

1min

)2()1(

1

2

,

i

iiii

N

i

iw

Nixxwy

Cw

IR SVM

Cao et al., 2006

54

Cost-sensitive Pairwise Classification

• Converting to document pairs

55

Doc A

Doc B

Doc C

Query

Doc A

Doc B Doc C

Query

Doc ADoc B

Doc C

grade 1

grade 2

grade 3

CriticalNot Critical

Problems with Ranking SVM

• Not sufficient emphasis on correct ranking on topgrades: 3, 2, 1ranking 1: 2 3 2 1 1 1 1ranking 2: 3 2 1 2 1 1 1ranking 2 should be better than ranking 1Ranking SVM views them as the same

• Numbers of pairs vary according to queriesq1: 3 2 2 1 1 1 1q2: 3 3 2 2 2 1 1 1 1 1number of pairs for q1 : 2*(2-2) + 4*(3-1) + 8*(2-1) = 14number of pairs for q2: 6*(3-2) + 10*(3-1) + 15*(2-1) = 31Ranking SVM is biased toward q2

56

IR SVM

• Solving the two problems of Ranking SVM

• Higher weight on important grade pairs

• Normalization weight on pairs in query

• IR SVM = Ranking SVM using modified hinge loss

57

)(ik

)(iq

Modified Hinge Loss function

58

2)2()1(

)(

1

)( ||||,1min wxxwy iiiiq

l

i

ikw

1

2

0.5

1 )( )2()1( xxfy

Loss

Learning of IR SVM

59

2)2()1(

)(

1

)( ||||,1min wxxwy iiiiq

l

i

ikw

2

0

,,1 1,

||||2

1min

)()(

)2()1(

1

2

,

iqik

i

i

iiii

l

i

iiw

C

lixxwy

Cw

AdaRank

Xu and Li, 2007

60

Listwise Loss

61

111 ,1,1,1

2,11,22,1

1,11,11,1

1

nnn yx

yx

yx

q

mmm nmnmnm

mmm

mmm

m

yx

yx

yx

q

,,,

2,2,2,

1,1,1,

AdaRank

• Optimizing exponential loss function

• Algorithm: AdaBoost-like algorithm for ranking

62

Loss Function of AdaRank

63

Any evaluation measuretaking value between [-1,+1]

AdaRank Algorithm

64

Part 2. Learning to Match

65

2-1. Semantic Matching in Search

66

Query Document Mismatch isBiggest Challenge in Search

67

Query Document Mismatch

• Same intent can be represented by different queries (representations)

• Search is still mainly based on term level matching

• Query document mismatch occurs, when searcher and author use different representations

68

Examples of Query Document Mismatch

Query Document Term Matching

Semantic Matching

seattle best hotel seattle best hotels

no yes

pool schedule swimmingpoolschedule

no yes

natural logarithm transformation

logarithm transformation

partial yes

china kong china hong kong partial no

why are windows so expensive

why are macs so expensive

partial no

69

Matching at Different Levels

Semantic Matching

Form Phrase Sense Topic Structure

Term Matching

Structure Identification

Topic Identification

Similar Query Finding

Phrase Identification

Spelling Error Correction

Sense

michael jordan berkele

query form: michael jordan berkeley

phrase: michael jordan

similar query: michael i. jordan

main phrase: michael jordan

Phrase

Term

Structure

topic: machine learning, berkeley

Topic

phrase: berkeley

Query Understanding

Title Structure Identification

Topic Identification

Key Phrase Identification

Phrase IdentificationTerm

Topic

Homepage of Michael Jordan

Michael Jordan is Professor in the

Department of Electrical Engineering

……

key phrase: michael jordan, professor,

electrical engineeringKey Phrase

topic: machine learning, berkeley

main phrase in title: michael jordan

phrase: michael jordan, professor,

department, electrical engineering]

Phrase

Structure

Document Understanding

Query

Representation

Document

RepresentationQuery Document

Matching

Relevance Ranking

Query form: michael jordan berkeley

Similar query: michael i jordan

Main phrase: michael jordan

Phrase: michael jordan, berkeley

Topic : machine learning

Document: michael jordan homepage

Main phrase in title: michael jordan

Key phrase: michael jordan, berkeley

Phrase: michael jordan, professor,

department of electrical engineering

Topic : machine learning, berkeley

Semantic Matching

Matching in Different Ways

74

q d

q’

d’

c

Query Reformulation

Document transformation

Query and document transformation

No transformation

Web Search System

Ranking

Document

UnderstandingIndexing

Index

User Interface

Web

User

Crawling

Query

Document

Matching

Query

UnderstandingRetrieving

Machine Learning for Query Document Matching in Web Search

76

Learning for Matching between Query and Document

• Learning matching function

• Using training data

• and can be id’s or feature vectors

• can be binary or numerical values

• Using relations in data and/or prior knowledge

77

),|(or ),( dqrpdqf MM

),,(,),,,( 111 NNN rdqrdq

Nqqq ,,, 21 Nddd ,,, 21

Nrrr ,,, 21

Long Tail Challenge

• Head pages have rich anchor texts and click data

• Tail queries and pages suffer more from mismatch

• Problem of propagating information and knowledge from head to tail

78

Relation between Matching and Ranking

• In traditional IR: – Ranking = matching

• Web search: – Ranking and matching become separated– Learning to rank becomes state-of-the-art

– Matching = feature learning for ranking

79

or ),(),( 25 dqfdqf BM

)(),(),( 25 dgdqfdqf PageRankBM

)|(),( qdPdqf LMIR

Matching vs Ranking

Matching Ranking

Prediction Matchingdegree between query and document

Ranking list of documents

Model f(q, d) f(q,d1), f(q,d2), … f(q,dn)

Challenge Mismatch Correct ranking on top

80

In search, first matching and then ranking

Matching Functions as Features in Learning to Rank

• Term level matching:

• Phrase level matching:

• Sense level matching:

• Topic level matching:

• Structure level matching:

• Term level matching (spelling, stemming):

81

),(25 dqfBM

),( dqfP

),( dqfT

),( dqfC

),( dqfS

qq '

),(25 dqf BMn

2-2. Overview of Learning to Match

83

Matching between Heterogeneous Data is Everywhere

• Matching between user and product (collaborative filtering)

• Matching between text and image (image annotation)

• Matching between people (dating)

• Matching between languages (machine translation)

• Matching between receptor and ligand (drug design)

84

Formulation of Learning Problem

• Learning matching function

• Training data

• Generated according to

85

),( yxf

),,(,),,,( 111 NNN ryxryx

),|(~ ),|(~ ),(~ YXRPrXYPyXPx

Formulation of Learning Problem

• Loss Function

• Risk Function

• Objective Function in Learning

86

) ),(,( yxfrL

RYXryxdPyxfrLryxPyxfrR ),,() ),(,(),,() ),(,(

N

iiii

FffyxfrL

1

)() ),(,(min

Matching Problem: Instance Matching

1

4

51

1

x1

xm

y1 yny2 y3

x2

x3

87

Instances

Can be represented as matching between nodes in bipartite graph

Matching Problem: Feature Matching

1

4

51

1

x1

xm

y1 yny2 y3

x2

x3

88

Features

Can be represented as matching between objects in two spaces

Matching Problem: Structure Matching

1

4

51

1

x1

xm

y1 yny2 y3

x2

x3

89

Structures

2-3. Methods of Learning to Match

90

Regularized Latent Semantic Indexing

Wang et al. 2011


• Motivation– Matching between query and document at topic level– Scale up to large datasets (vs. existing methods)

• Approach– Matrix Factorization– Regularization on topics and documents (vs. Sparse Coding)– Learning problem can be easily decomposed

• Results– l1 on topics leads to sparse topics and l2 on documents leads to

accurate matching– Comparable with existing methods in topic discovery and search

relevance– But can easily scale up to large document sets

Query and Document Matching in Topic Space

q

d1

dnd2

Document Space

d2

dn

d1

Topic Space

q


term representation

of doc n topics

topic representation

of doc n

topics are sparse

documents are

smooth

D

term

documentdocument

topicterm

topic

U

0

0

0

1

0

0

2

0

0

9

0

0

0

5

0

0

0

0

0

0

6

1

0

4

0

0

0

0

1

0

0

0

0

≈ V×

Optimization Strategy

Coordinate Decent

Analytic Solution

RLSI Algorithm

terms processed in parallel

docs processed in parallel

• Single machine multi core version• Multiple machine version (MapReduce and MPI)

Regularized Matching in Latent Space (RMLS)

Wu et al., 2013

97

Matching in Latent Space

• Motivation

– Matching between query and document in latent space

• Assumption

– Queries have similarity

– Document have similarity

– Click-through data represent “similarity” relations between queries and documents

• Approach

– Projection to latent space

– Regularization or constraints

• Results

– Significantly enhance accuracy of query document matching

Matching in Latent Space

q1

qmq2

d1

dn

d2

Query Space Document Space

q1d2

qm

dn

d1

q2 Latent Space

qL

dL

• Matching between Heterogeneous Data

• Example: Image Annotation

hook

fishing

singer

solider

worrier

microphone

Projecting Keywords and Images into Latent Space

Partial Least Square (PLS)• Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and orthonormal ) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ILLILLdLqLc d

T

dq

T

q

dq

idiqiLL

iidq

, st. , maxarg),(

,

Solution of Partial Least Square

• Non-convex optimization

• Global optimal solution exists

• Global optimum can be found by solving SVD (Singular Value Decomposition)

Regularized Mapping to Latent Space (RMLS)

• Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and sparse) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ddqqddqq

dq

idiqiLL

lllldLqLcii

dq

||||,||||,||,|| st. , maxarg),(

,

Solution of Regularized Mapping to Latent Space (RMLS)

• Coordinate Descent

• Repeat

– Fix , update

– Fix , update

• No guarantee to find global optimum

• Updates can be parallelized by rows

qL dL

dLqL

Comparison between PLS and RMLS

PLS RMLS

Assumption Orthogonal L1 and L2 Regularization

OptimizationMethod

Singular Value Decomposition

CoordinateDescent

Optimality Globaloptimum

Local optimum

Efficiency Low High

Scalability Low High

Part 3. Deep Learning for Information Retrieval

107

3-1: Overview of Deep Learning for Information Retrieval

Hard Problems in IR

Q: How tall is Yao Ming?

Q: A dog catching a ball

Name Height Weight

Yao Ming 2.29m 134kg

Liu Xiang 1.89m 85kg

Key Questions: How to Represent Intent and Content, How to Match Intent and Content

Image Retrieval

Question Answering from Knowledge Base

Q:How far is sun from earth?

The average distance between the Sun andthe Earth is about 92,935,700 miles.

Generation-based Question Answering

A: It is about 93 million miles

(No tag on images)

Deep Learning and IR

Intent ContentMatching

Deep Learning

Recent Progress: Deep Learning Is Particularly Effective for Hard IR Problems

3-2: Basics of Deep Learning

Word Embedding

Word Embedding

• Motivation: representing words with low-dimensional real-valued vectors, utilizing them as input to deep learning methods, vs one-hot vectors

• Method: SGNS (Skip-Gram with Negative Sampling)

• Tool: Word2Vec

• Input: words and their contexts in documents

• Output: embeddings of words

• Assumption: similar words occur in similar contexts

• Interpretation: factorization of mutual information matrix

• Advantage: compact representations (usually 100~ dimensions)

Skip-Gram with Negative Sampling(Mikolov et al., 2013)

• Input: occurrences between words and contexts

• Probability model:

5 1 2

2 1

3 1

1w

2w

3w

1c 2c 3c4c 5cM

cwecwcwDP

1

1)(),|1(

cwecwcwDP

1

1)(),|0(

Skip-Gram with Negative Sampling

• Word vector and context vector: lower dimensional (parameter ) vectors

• Goal: learning of the probability model from data

• Take co-occurrence data as positive examples

• Negative sampling: randomly sample k unobserved pairs

as negative examples

• Objective function in learning

• Algorithm: stochastic gradient descent

),( Ncw

)(log)(log),(# ~ NPC

w c

cwkcwcwLN

E

cw

,

Interpretation as Matrix Factorization(Levy & Goldberg 2014)

• Pointwise Mutual Information Matrix

)()(

),(log

cPwP

cwP

3 -.5 2

1 -0.5

1.5 1

1w

2w

3w

1c 2c 3c4c 5c

M

Interpretation as Matrix Factorization

7 0.5 1

2.2 3

1 1.5 1

1w

2w

3w

1t 2t 3t

TWCM Matrix factorization, equivalent to SGNS

W

3 -.5 2

1 -0.5

1.5 1

1w

2w

3w

1c 2c 3c4c 5c

M

Word embedding

Recurrent Neural Network


• Motivation: representing sequence of words and utilizing the representation in deep learning methods

• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)

• Output: sequence of internal representations (hidden states)

• Variants: LSTM and GRU, to deal with long distance dependency

• Learning of model: stochastic gradient descent

• Advantage: handling arbitrarily long sequence; can be used as part of deep model for sequence processing (e.g., language modeling)

Recurrent Neural Network (RNN)(Mikolov et al. 2010)

the cat sat on the mat

the cat sat …. mat

),( 1 ttt xhfh

tx

1th

tx

th1th


)tanh(),( 11 hxtxthttt bxWhWxhfh

1th tx

+1

th

…

……

Long Term Short Memory (LSTM)(Hochreiter & Schmidhuber, 1997)

)tanh(

)tanh(

)(

)(

)(

1

1

1

1

1

ttt

ttttt

gtgxtght

otoxtoht

ftfxtfht

itixtiht

coh

cfgic

bxWhWg

bxWhWo

bxWhWf

bxWhWi

• A memory (vector) to store Svalues of previous state• Input gate, output gate, and kforget gate to control • Gate: element-wise product kwith vector of values in [0,1]

),( 1 tt xh

),( 1 tt xh ),( 1 tt xh

),( 1 tt xh

th

ti to

tf

tgtc

m

Input Gate Output Gate

Forget Gate

Gated Recurrent Unit (GRU)(Cho et al., 2014)

ttttt

gtgxttght

ztzxtzht

rtrxtrht

gzhzh

bxWhrWg

bxWhWz

bxWhWr

)1(

))(tanh(

)(

)(

1

1

1

1

• A memory (vector) to store Svalues of previous state• Reset gate and update gate to kcontrol

tx

th

1th

1th

),( 1tt hx

tr

tg

tz

),( 1tt hx

Reset Gate

Update Gate

Recurrent Neural Network Language Model

)max(soft)|(

)tanh(

11

1

bWhxxxPp

bxWhWh

tttt

hxtxtht

Model Objective of Learning

T

t

tpT 1

ˆlog1

th

tx

1th

1tx

tx1tx

• Input one sequence and output another• In training, input sequence is same as output sequence

Convolutional Neural Network

Convolutional Neural Network

• Motivation: representing sequence of words and utilizing the representation in deep learning methods

• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)

• Output: representation of input sequence

• Learning of model: stochastic gradient descent

• Advantage: robust extraction of n-gram features; can be used as part of deep model for sequence processing (e.g., sentence classification)

Convolutional Neural Network (CNN) (Kim 2014, Blunsom et al. 2014, Hu et al., 2014)

the cat sat on the mat Concatenation

……

the cat sat on the mat

cat sat

the cat sat

the cat

sat on

cat sat on

cat sat

on the

sat on the

sat on

the mat

on the mat

on the

sat on

the cat sat

the cat

the mat

on the mat

sat on

Convolution

Max pooling

• Shared parameters on same level• Fixed length, zero padding

Example: Image Convolution

0

0

3

1

Dark Pixel Value = 1, Light Pixel Value = 0Dot in Filter = 1, Others = 0

Filter

Leow Wee Kheng

Example: Image Convolution

0 0 0 0 0

0 0 1 1 0

0 1 3 2 0

0 1 3 1 0

0 1 1 0 0

Feature Map

• Scanning image with filter having 3*3 cells, among them 3 are dot cells• Counting number of dark pixels overlapping with dot cells at each position• Creating feature map (matrix), each element represents similarity between filer pattern and pixel pattern at one position• Equivalent to extracting feature using the filter• Translation-invariant

Convolution Operation

Convolution

Filter Feature Map Neuron

TT

hi

T

i

T

ii

i

l

i

flfl

fl

i

l

fll

i

flfl

i

xxxz

iz

liz

lfbw

lifz

Ffbzwz

],,[

location for vectors wordedcancatenat frominput is

1layer from location for neuron ofinput is

function sigmoid is

layer in typeofneuron of parameters are ,

layer in location for typeofneuron ofoutput is

,,2,1 )(

11

)0(

)0(

)1(

),(),(

),(

),()1(),(),(

)1( liz

),( flb

+1

),( flw

),( fliz

…

Convolution

Equivalent to n-gram feature extraction at each position

Max Pooling

lifzz

lifz

zzz

fl

i

fl

i

fl

i

fl

i

fl

i

fl

i

layer in location for typeof pooling ofinput are ,

layer in location for typeof pooling ofoutput is

),max(

),1(

2

),1(

12

),(

),1(

2

),1(

12

),(

Max Pooling

Equivalent to n-gram feature selection

Sentence Classification Using Convolutional Neural Network

)(CNN

)max(soft)(

xz

bWzxfy

Concatenation

……

Convolution

Max Pooling

)(Lz

x

y

3-3. Methods of Deep Learning for Information Retrieval

Retrieval based Question Answering

Hu et al. 2014

Ji et al. 2014

Retrieval-based Question Answering

Q: What is the population of Hong Kong?A: It is 7.18 million as in 2013.

Q: How many people are there in Hong Kong?A: There are about 7 million.

Question Answering System

Q:Do you know Hong Kong’smmpopulation?

A:There are about 7 mmmillion.

Learning System

Retrieval based Question Answering System

Index of Questions and

Answers

Matching Ranking

Question

Retrieval

RetrievedQuestions and Answers

RankedAnswers

MatchingModels

RankingModel

Online

Offline

Best Answer

MatchedAnswers

Deep Match CNN- Architecture I

MLP

…… ……

• First represent two sentences as vectors, and then match the vectors

Sentence X Sentence Y

Deep Match CNN - Architecture II

• Represent and match two sentences simultaneously

• Two dimensional model

138

MLP

Matching Degree

…

2D Convolution

More 2D Convolution & Pooling

Max-Pooling

1D Convolution

Sentence X

Sen

ten

ce Y

Generation based Question Answering

Shang et al. 2015

Generation-based Question Answering

Q: What is the population of Hong Kong?

A: It is 7.18 million as in 2013.

Q: How many people are there in HonglmmKong?

A: There are about 7 million.


Q: Do you know HongmmlKong’s population?

A: It is 7 million

Learning System

Neural Responding Machine

• Encoding questions to internal representations

• Decoding internal representations to answers

• Using GRU

Question

Answer

Encoder

Txxx 21x

Decoder

tyyy 21y

c

h

ContextGenerator

Decoder

1c2c

tc 'Tc

1s 1ts ts'Ts

1y 1ty ty 'Ty

… …

GRU is () function,softmax is ()

ctorcontext ve is

decoder of statehidden is

hot vector-one is

),,(

),,(),|(

11

111

fg

c

s

y

csyfs

csygyyyP

t

t

t

tttt

ttttt

x

Similar to attention mechanism in RNN Encoder-Decoder

Encoder

Global EncoderLocal Encoder

1x1tx tx Tx

1h 1th th Th

1ts

tc

… …

stateshidden global and local ofion concatenat is :

weightis ctor,context ve is

),(,: 1

1

g

T

l

j

tjt

tjtj

g

T

l

j

T

j

tjt

hh

c

shqhhc

Combination of global and local encoders

GRU is ()

encoder of statehidden is

embedding wordis

),( 1

f

h

x

hxfh

t

t

ttt

Question Answering from Relational Database

Yin et al. 2016

Question Answering from Relational Database

Relational Database

Q: How many people participated in the mmgame in Beijing?

A: 4,200SQL: select #_participants, where city=beijing

Q: When was the latest game hosted?A: 2012SQL: argmax(city, year)


Q: Which city hosted the mmlongest Olympic game mmbefore the game in Beijing? A: Athens

Learning System

year city #_days #_medals

2000 Sydney 20 2,000

2004 Athens 35 1,500

2008 Beijing 30 2,500

2012 London 40 2,300

Neural Enquirer

• Query Encoder: encoding query• Table Encoder: encoding entries in table• Five Executors: executing query against table

Conducting matching between question and database entries multiple times

Query Encoder and Table Encoder

• Creating query embedding using RNN• Creating table embedding for each entry using DNN

RNN

…

Query Representation

Query

DNN

Entry Representation

Field Value

Table Representation

Query Encoder Table Encoder

Executors

• Five layers, except last layer, each layer has reader, mannotator, and memory • Reader fetches important representation for each row,me.g., city=beijing• Annotator encodes result representation for each row, me.g., row where city=beijing

Select #_participants where city = beijing

Question Answering from Knowledge Graph

Yin et al. 2016

Question Answering from Knowledge Graph

(Yao-Ming, spouse, Ye-Li)(Yao-Ming, born, Shanghai)(Yao-Ming, height, 2.29m)… …(Ludwig van Beethoven, place ofbirth, Germany)… …

Knowledge Graph

Q: How tall is Yao Ming?A: He is 2.29m tall and is visible from space.(Yao Ming, height, 2.29m)

Q: Which country was Beethoven from?A: He was born in what is now Germany.(Ludwig van Beethoven, place of birth, Germany)


Q: How tall is Liu Xiang? A: He is 1.89m tall

Learning System

Answer is generated

GenQA

• Interpreter: creates representation of question using RNN

• Enquirer: retrieves top k triples with highest matching scores using CNN model

• Generator: generates answer based on question and retrieved triples using attention-based RNN

• Attention model: controls generation of answer

Short Term Memory

Long Term Memory

(Knowledge Base)

How tall is Yao Ming?

Interpreter

Enquirer

Generator

He is 2.29m tall

Attention Model

Key idea: • Generation of answer based on question and retrieved kresult• Combination of neural processing and symbolic rprocessing

Enquirer: Retrieval and Matching

• Retaining both symbolic representations and vector representations• Using question words to retrieve top k triples• Calculating matching scores between question and triples using CCNN model• Finding best matched triples

(how, tall, is, liu, xiang)

< liu xiang, height, 1.90m>

< yao ming, height, 2.26m>… …

<liu xiang, birth place, shanghai>

Retrieved Top k Triplesand Embeddings

Question and Embedding Matching

Generator: Answer Generation

• Generating answer using attention mechanism• At each position, a variable decides whether to ggenerate a word or use the object of top triple

2s 3s

3c

He is

03 z 13 z

2.29mtall

< yao ming, height, 2.29m>

3z

How tall is Yao Ming ?

3y2y

…o

…

…

…

3y

Image Retrieval

Ma et al. 2015

Image Retrieval

a lady in a car

a man holds a cell phone

two ladies are chatting

Learning System

Retrieval System

Having dinner with friends in restaurant

Image Index

Multimodal CNN• Represent text and image as vectors and then match

the two vectors

• Word-level matching, phrase-level matching, sentence-level matching

• CNN model works better than RNN models (state of the art) for text

A dog is catching a ball

CNN CNN

MLP

Sentence-level Matching

• Combing image vector and sentence vector

……

CNN

MLP

a dog is catching a ball

Word-level Matching Model

• Adding image vector to word vectors

……

a dog is catching a ball

CNN

MLP

Summary

Summary

• Learning to Rank– Approaches: pointwise, pairwise, listwise approaches– Methods: Ranking SVM, IR SVM, AdaRank

• Learning to Match– Semantic Matching– Methods: RLSI, RMLS

• Deep Learning for Information Retrieval– Deep learning is effective for hard IR problems– Basic tools: word embedding, CNN, RNN– Methods: Deep Match CNN, NRM, GenQA, Neural

Enquirer, Multimodal CNN

References• Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, Xiaoming Li. Neural Generative

Question Answering. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2972-2978, 2016.

• Pengcheng Yin, Zhengdong Lu, Hang Li, Ben Kao. Neural Enquirer: Learning to Query Tables with Natural Language. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2308-2314, 2016.

• Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li, Multimodal Convolutional Neural Networks for Matching Image and Sentence. Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV), 2623-2631, 2015.

• Lifeng Shang, Zhengdong Lu, Hang Li. Neural Responding Machine for Short Text Conversation. Proceedings of the 53th Annual Meeting of Association for Computational Linguistics and the 7th International Conference on Natural Language Processing (ACL-IJCNLP'15), 2015.

• Baotian Hu, Zhengdong Lu, Hang Li, Qingcai Chen. Convolutional Neural Network Architectures for Matching Natural Language Sentences. Proceedings of Advances in Neural Information Processing Systems 27 (NIPS'14), 2042-2050, 2014.

• Hang Li, Jun Xu. Semantic Matching in Search. Foundations and Trends in Information Retrieval, Now Publishers, 2014.

• Wei Wu, Zhengdong Lu, Hang Li. Learning Bilinear Model for Matching Queries and Documents. Journal of Machine Learning Research (JMLR), 14: 2519-2548, 2013.

References• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing:

A New Approach to Large Scale Topic Modeling. ACM Transactions on Information Systems (TOIS), 31(1): 5, 2013.

• Hang Li. A Short Introduction to Learning to Rank. IEICE Transactions on Information and Systems, E94-(10), 2011.

• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing. Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR’11), 685-694, 2011.

• Hang Li. Learning to Rank for Information Retrieval and Natural Language Processing. Synthesis Lectures on Human Language Technology, Lecture 12, Morgan & Claypool Publishers, 2011.

• Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, Hang Li. Learning to Rank: From Pairwise Approach to Listwise Approach. Proceedings of the 24th International Conference on Machine Learning (ICML’07), 129-136, 2007.

• Jun Xu and Hang Li. AdaRank: A Boosting Algorithm for Information Retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR’07), 391-398, 2007.

• Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference (SIGIR’06), 186-193, 2006.

Thank you!

Deep Learning for Information Retrieval - Hang Li...star wars the force awakens reviews Star Wars:...

Documents

Transcript of Deep Learning for Information Retrieval - Hang Li...star wars the force awakens reviews Star Wars:...