Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning...
-
Upload
paige-kirby -
Category
Documents
-
view
215 -
download
2
Transcript of Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning...
![Page 1: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/1.jpg)
Information Retrieval as Structured PredictionUniversity of Massachusetts Amherst
Machine Learning SeminarApril 29th, 2009
Yisong YueCornell University
Joint work with:Thorsten Joachims, Filip Radlinski, and Thomas Finley
![Page 2: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/2.jpg)
Supervised Learning
• Find function from input space X to output space Y
such that the prediction error is low.
Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters…
x
y1
GATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAGATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCACATTTA
x
y-1x
y7.3
![Page 3: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/3.jpg)
Examples of Complex Output Spaces
• Natural Language Parsing– Given a sequence of words x, predict the parse tree y.– Dependencies from structural constraints, since y has to be a
tree.
The dog chased the catx
S
VPNP
Det NV
NP
Det N
y
![Page 4: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/4.jpg)
• Part-of-Speech Tagging– Given a sequence of words x, predict sequence of tags y.
– Dependencies from tag-tag transitions in Markov model.
Similarly for other sequence labeling problems, e.g., RNA
Intron/Exon Tagging.
The rain wet the catx
Det NVDet Ny
Examples of Complex Output Spaces
![Page 5: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/5.jpg)
Examples of Complex Output Spaces
• Information Retrieval– Given a query x, predict a ranking y.
– Dependencies between results (e.g. avoid redundant hits)
– Loss function over rankings (e.g. Average Precision)
SVMx 1. Kernel-Machines
2. SVM-Light3. Learning with Kernels4. SV Meppen Fan Club5. Service Master & Co.6. School of Volunteer Management7. SV Mattersburg Online…
y
![Page 6: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/6.jpg)
Goals of this Talk
• Learning to Rank– Optimizing Average Precision– Diversified Retrieval
• Structured Prediction– Complex Retrieval Goals– Structural SVMs (Supervised Learning)
![Page 7: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/7.jpg)
Learning to Rank
• At intersection of ML and IR• First generate input features
)(
),cos(
tolinkingt anchor texin appears
paragraphfirst in appears
in title appears
),(,
dpagerank
dq
dq
q
q
dqxi
i
ii
ii
dq
![Page 8: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/8.jpg)
Learning to Rank
• Design a retrieval function f(x) = wTx – (weighted average of features)
• For each query q – Score all sq,d = wTxq,d
– Sort by sq,d to produce ranking
• Which weight vector w is best?
![Page 9: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/9.jpg)
Conventional SVMs
• Input: x (high dimensional point)
• Target: y (either +1 or -1)
• Prediction: sign(wTx)
• Training:
subject to:
• The sum of slacks smooth upper bound for 0/1 loss
N
ii
w N
Cw
1
2
, 2
1minarg
iiT xwi 1)(y : i
i
i
![Page 10: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/10.jpg)
• 2 pairwise disagreements
Optimizing Pairwise Agreements
![Page 11: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/11.jpg)
Pairwise Preferences SVM
ji
jiw N
Cw
,,
2
, 2
1minarg
ji
yyjixwxw
ji
jijijT
iT
, ,0
:, ,1
,
,
Such that:
Large Margin Ordinal Regression [Herbrich et al., 1999]
Can be reduced to time [Joachims, 2005] )log( nnO
Pairs can be reweighted to more closely model IR goals [Cao et al., 2006]
![Page 12: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/12.jpg)
Mean Average Precision
• Consider rank position of each relevant doc– K1, K2, … KR
• Compute Precision@K for each K1, K2, … KR
• Average precision = average of P@K
• Ex: has AvgPrec of
• MAP is Average Precision across multiple queries/rankings
76.05
3
3
2
1
1
3
1
![Page 13: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/13.jpg)
Optimization Challenges
• Rank-based measures are multivariate– Cannot decompose (additively) into document pairs– Need to exploit other structure
• Defined over rankings– Rankings do not vary smoothly– Discontinuous w.r.t model parameters– Need some kind of relaxation/approximation
![Page 14: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/14.jpg)
[Yue & Burges, 2007]
![Page 15: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/15.jpg)
Structured Prediction
• Let x be a structured input (candidate documents)• Let y be a structured output (ranking)
• Use a joint feature map to encode
the compatibility of predicting y for given x.– Captures all the structure of the prediction problem
• Consider linear models: after learning w, we can make predictions via
FR ),( xy
),(maxargˆ xyyy
Tw
![Page 16: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/16.jpg)
Linear Discriminant for Ranking
• Let x = (x1,…xn) denote candidate documents (features)
• Let yjk = {+1, -1} encode pairwise rank orders
• Feature map is linear combination of documents.
• Prediction made by sorting on document scores wTxi
),(maxargˆ xyyy
Tw
relj relk
kjjk xxy: :!
)(),( xy
![Page 17: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/17.jpg)
Structural SVM
• Let x denote a structured input (candidate documents)• Let y denote a structured output (ranking)
• Standard objective function:
• Constraints are defined for each incorrect labeling y’ over the set of documents x.
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
[Tsochantaridis et al., 2005]
![Page 18: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/18.jpg)
Minimizing Hinge Loss
Suppose for incorrect y’:
Then:
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
[Tsochantaridis et al., 2005]
)'(75.0 yi
![Page 19: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/19.jpg)
Structural SVM for MAP
• Minimize
subject to
where ( y jk = {-1, +1} )
and
• Sum of slacks is smooth upper bound on MAP loss.
relj relk
ik
ij
ii xxyjk
: :!
)()()()( )(),( xy
i
iN
Cw 2
2
1
iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy
)'(1)'( yy Avgprec
i
[Yue et al., SIGIR 2007]
![Page 20: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/20.jpg)
Too Many Constraints!
• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,
• An incorrect labeling would be any other ranking, e.g.,
• This ranking has Average Precision of about 0.8 with (y’) ¼ 0.2
• Intractable number of rankings, thus an intractable number of constraints!
![Page 21: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/21.jpg)
Cutting Plane Method
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
![Page 22: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/22.jpg)
Cutting Plane Method
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
![Page 23: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/23.jpg)
Cutting Plane Method
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
![Page 24: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/24.jpg)
Cutting Plane Method
Original SVM Problem• Exponential constraints
• Most are dominated by a small set of “important” constraints
Structural SVM Approach• Repeatedly finds the next most
violated constraint…
• …until set of constraints is a good approximation.
[Tsochantaridis et al., 2005]
![Page 25: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/25.jpg)
Finding Most Violated Constraint
• A constraint is violated when
• Finding most violated constraint reduces to
• Highly related to inference/prediction:
0)'(),(),'( iTw yxyxy
)'(),'(maxargˆ'
yxyyy
Tw
),(maxarg xyy
Tw
![Page 26: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/26.jpg)
Finding Most Violated Constraint
Observations• MAP is invariant on the order of documents within a relevance
class– Swapping two relevant or non-relevant documents does not change MAP.
• Joint SVM score is optimized by sorting by document score, wTxj
• Reduces to finding an interleaving
between two sorted lists of documents
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 27: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/27.jpg)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 28: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/28.jpg)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of
the non-relevant document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 29: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/29.jpg)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 30: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/30.jpg)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 31: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/31.jpg)
Finding Most Violated Constraint
►
• Start with perfect ranking• Consider swapping adjacent
relevant/non-relevant documents• Find the best feasible ranking of the
non-relevant document• Repeat for next non-relevant
document• Never want to swap past previous
non-relevant document• Repeat until all non-relevant
documents have been considered
relj relk
kT
jT
jk xwxwy: :!'
)(')'(maxarg yy
[Yue et al., SIGIR 2007]
![Page 32: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/32.jpg)
Comparison with other SVM methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
SVM-ROC
SVM-ACC
SVM-ACC2
SVM-ACC3
SVM-ACC4
![Page 33: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/33.jpg)
Structural SVM for MAP
• Treats rankings as structured objects
• Optimizes hinge-loss relaxation of MAP– Provably minimizes the empirical risk
– Performance improvement over conventional SVMs
• Relies on subroutine to find most violated constraint
– Computationally compatible with linear discriminant
)'(),'(maxargˆ'
yxyyy
Tw
![Page 34: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/34.jpg)
Structural SVM Recipe
• Joint feature map
• Inference method
• Loss function
• Loss-augmented (most violated constraint)
),(maxarg xyy
Tw
),( xy
)(y
)'(),(maxarg'
yxyy
Tw
![Page 35: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/35.jpg)
Need for Diversity (in IR)
• Ambiguous Queries– Users with different information needs issuing the
same textual query– “Jaguar” or “Apple”– At least one relevant result for each information need
• Learning Queries– User interested in “a specific detail or entire breadth
of knowledge available” • [Swaminathan et al., 2008]
– Results with high information diversity
![Page 36: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/36.jpg)
Example
•Choose K documents with maximal information coverage.•For K = 3, optimal set is {D1, D2, D10}
![Page 37: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/37.jpg)
Diversity via Set Cover
• Documents cover information– Assume information is partitioned into discrete units.
• Documents overlap in the information covered.
• Selecting K documents with maximal coverage is a set cover problem– NP-complete in general
– Greedy has (1-1/e) approximation [Khuller et al., 1997]
![Page 38: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/38.jpg)
Diversity via Subtopics
• Current datasets use manually determined subtopic labels– E.g., “Use of robots in the world today”
• Nanorobots• Space mission robots• Underwater robots
– Manual partitioning of the total information– Relatively reliable– Use as training data
![Page 39: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/39.jpg)
Weighted Word Coverage
• Use words to represent units of information
• More distinct words = more information
• Weight word importance
• Does not depend on human labeling
• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Greedy selection yields (1-1/e) bound.– Need to find good weighting function (learning problem).
![Page 40: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/40.jpg)
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
![Page 41: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/41.jpg)
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
![Page 42: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/42.jpg)
Related Work Comparison
• Essential Pages [Swaminathan et al., 2008]
– Uses fixed function of word benefit– Depends on word frequency in candidate set
• Our goals– Automatically learn a word benefit function
• Learn to predict set covers • Use training data• Minimize subtopic loss
– No prior ML approach • (to our knowledge)
![Page 43: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/43.jpg)
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• y – subset of x • V(y) – union of words from documents in y.
• Discriminant Function:
• (v,x) – frequency features (e.g., ¸10%, ¸20%, etc).
• Benefit of covering word v is then wT(v,x)
)(
),(),(y
xxyVv
TT vww
),(maxargˆ xyyy
Tw
[Yue & Joachims, ICML 2008]
![Page 44: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/44.jpg)
Linear Discriminant
• Does NOT reward redundancy – Benefit of each word only counted once
• Greedy has (1-1/e)-approximation bound
• Linear (joint feature space) – Allows for SVM optimization
• (Used more sophisticated discriminant in experiments.)
)(
),(),(y
xxyVv
TT vww
[Yue & Joachims, ICML 2008]
![Page 45: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/45.jpg)
Weighted Subtopic Loss
• Example:– x1 covers t1
– x2 covers t1,t2,t3
– x3 covers t1,t3
• Motivation– Higher penalty for not covering popular subtopics
# Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
[Yue & Joachims, ICML 2008]
![Page 46: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/46.jpg)
Finding Most Violated Constraint
• Encode each subtopic as an additional “word” to be covered.
• Use greedy algorithm: (1-1/e)-approximation
)(
)(
)( 1
),(
),(
),('
1
y
y
y
x
x
xy
T T
Vv L
Vv
Lv
v
),'('maxargˆ'
xyyy
Tw
![Page 47: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/47.jpg)
TREC Experiments
• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.
• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem
• 45 docs/query, 20 subtopics/query, 300 words/doc
• Trained using LOO cross validation
![Page 48: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/48.jpg)
• TREC 6-8 Interactive Track• Retrieving 5 documents
![Page 49: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/49.jpg)
Can expect further benefit from having more training data.
![Page 50: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/50.jpg)
Learning Set Cover Representations
• Given:– Manual partitioning of a space
• subtopics
– Weighting for how items cover manual partitions • subtopic labels + subtopic loss
– Automatic partitioning of the space• Words
• Goal:– Weighting for how items cover automatic partitions– The (greedy) optimal covering solutions agree
![Page 51: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/51.jpg)
Summary
• Information Retrieval as Structured Prediction– Can be used to reason about complex retrieval goals
– E.g., mean average precision, diversity
– Software & papers available at www.yisongyue.com
• Beyond Supervised Learning– Training data expensive to gather
– Current work on interactive learning (with users)
• Thanks to Andrew McCallum, David Mimno & Marc Cartwright
• Work supported by NSF IIS-0713483, Microsoft Graduate Fellowship, and Yahoo! KTC Grant.
![Page 52: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/52.jpg)
Extra Slides
![Page 53: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/53.jpg)
Structural SVM Training• STEP 1: Solve the SVM objective function using only the current
working set of constraints.
• STEP 2: Using the model learned in STEP 1, find the most violated constraint from the global set of constraints.
• STEP 3: If the constraint returned in STEP 2 is violated by more than epsilon, add it to the working set.
• Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1.
STEP 1-3 is guaranteed to loop for at most O(1/epsilon^2) iterations. [Tsochantaridis et al., 2005]
*This is known as a “cutting plane” method.
![Page 54: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/54.jpg)
Structural SVM Training
• Suppose we only solve the SVM objective over a small subset of constraints (working set).
• Some constraints from global set might be violated.• When finding a violated constraint, only y’ is free,
everything else is fixed – y’s and x’s fixed from training– w and slack variables fixed from solving SVM objective
• Degree of violation of a constraint is measured by:
iiiiiTw )',(),(),'( )()()()( yyxyxy
iiiTiiTi ww )',(),'(),( :' )()()()()( yyxyxyyy
![Page 55: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/55.jpg)
SVM-map
![Page 56: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/56.jpg)
Experiments
• Used TREC 9 & 10 Web Track corpus.
• Features of document/query pairs computed from outputs of existing retrieval functions.
(Indri Retrieval Functions & TREC Submissions)
• Goal is to learn a recombination of outputs which improves mean average precision.
![Page 57: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/57.jpg)
Comparison with Best Base methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
Base 1
Base 2
Base 3
![Page 58: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/58.jpg)
Comparison with other SVM methods
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
TREC 9 Indri TREC 10 Indri TREC 9Submissions
TREC 10Submissions
TREC 9Submissions(without best)
TREC 10Submissions(without best)
Dataset
Mea
n A
vera
ge
Pre
cisi
on
SVM-MAP
SVM-ROC
SVM-ACC
SVM-ACC2
SVM-ACC3
SVM-ACC4
![Page 59: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/59.jpg)
SVM-div
![Page 60: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/60.jpg)
Result #18
Top of First Page
Bottom of First Page
Results From 11/27/2007
Query: “Jaguar”
![Page 61: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/61.jpg)
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Microsoft” might cover it
better than another document with only 2 copies.
• Use multiple word sets, V1(y), V2(y), … , VL(y)
• Each Vi(y) contains only words satisfying certain importance criteria.
[Yue & Joachims, ICML 2008]
![Page 62: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/62.jpg)
More Sophisticated Discriminant
)(
)( 1
),(
),(
),(1
y
y
x
x
xy
LVv L
Vv
v
v
•Separate i for each importance level i. •Joint feature map is vector composition of all i
),(maxargˆ xyyy
Tw
•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.
[Yue & Joachims, ICML 2008]
![Page 63: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/63.jpg)
Essential Pages
),(1log),(),(
),,(),(),,(
2 xxx
xxx
vrvrv
xvTFvxvC ii
[Swaminathan et al., 2008]
Benefit of covering word v with document xi
Importance of covering word v
Intuition: - Frequent words cannot encode information diversity.- Infrequent words do not provide significant information
x = (x1,x2,…,xn) - set of candidate documents for a queryy – a subset of x of size K (our prediction).
v
vC ),,(maxargˆ xyyy
![Page 64: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/64.jpg)
Approximate Constraint Generation
• Theoretical guarantees no longer hold.– Might not find an epsilon-close approximation to the feasible
region boundary.
• Performs well in practice.
![Page 65: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/65.jpg)
TREC Experiments
• 12/4/1 train/valid/test split– Approx 500 documents in training set
• Permuted until all 17 queries were tested once
• Set K=5 (some queries have very few documents)
• SVM-div – uses term frequency thresholds to define importance levels
• SVM-div2 – in addition uses TFIDF thresholds
![Page 66: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/66.jpg)
TREC Interactive Track Results
Method Loss
Random 0.469
Okapi 0.472
Unweighted Model 0.471
Essential Pages 0.434
SVM-div 0.349
SVM-div2 0.382
Methods W / T / L
SVM-div vs Ess. Pages
14 / 0 / 3 **
SVM-div2 vs Ess. Pages
13 / 0 / 4
SVM-div vs SVM-div2
9 / 6 / 2
![Page 67: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/67.jpg)
Approximate constraint generation seems to work perform well.
![Page 68: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/68.jpg)
Synthetic Dataset
• Trec dataset very small• Synthetic dataset so we can vary retrieval size K
• 100 queries• 100 docs/query, 25 subtopics/query, 300 words/doc
• 15/10/75 train/valid/test split
![Page 69: Information Retrieval as Structured Prediction University of Massachusetts Amherst Machine Learning Seminar April 29 th, 2009 Yisong Yue Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062618/55146660550346284e8b5b5d/html5/thumbnails/69.jpg)
Consistently outperforms Essential Pages