Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
-
Upload
charles-gregory-kelly -
Category
Documents
-
view
228 -
download
4
Transcript of Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
![Page 1: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/1.jpg)
Introduction to Machine Learning for Information Retrieval
Xiaolong Wang
![Page 2: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/2.jpg)
What is Machine Learning
• In short, tricks of maths• Two major tasks:– Supervised Learning: • a.k.a. Regression, Classification…
– Unsupervised Learning:• a.k.a. data manipulation, clustering …
![Page 3: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/3.jpg)
Supervised Learning
• Label : usually manually labeled• Data : data representation, usually as a vector• Prediction Function : selecting one from a predefined family
of functions that has the best prediction
classification regression
![Page 4: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/4.jpg)
Supervised Learning
• Two formulations:– F1: Given a set of Xi, Yi, learn a function• Yi
– Binary: Spam v.s. Non-spam– Numeric: Very relevant(5), somewhat relevant(4), marginal
relevant(3), somewhat irrelevant(2), very irrelevant(1)
• Xi
– Number of words, occurrence of each word, …
• f– usually linear function
![Page 5: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/5.jpg)
Supervised Learning
• Two formulations:– F2: Give a set of Xi, Yi ,learn a function such
that• Yi: more complex label than binary or numeric
– Multiclass learning: entertainment v.s. sports v.s. politics…– Structural learning: syntactic parsing
more general
Y
X
![Page 6: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/6.jpg)
Supervised Learning
• Training– Optimization:• Loss: difference b/w true label Yi and predicted label wTXi
– Squared Loss (regression): (Yi – wTXi)2
– Hinge Loss (classification): max(0, 1 – Yi .wTXi)
– Logistic Loss (classification): log(1 + exp(-Yi .wTXi))
![Page 7: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/7.jpg)
Supervised Learning
• Training– Optimization:• Regularization:
Without regularization: overfitting
![Page 8: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/8.jpg)
Supervised Learning
• Training– Optimization:• Regularization:
Large margin, small ||w||
![Page 9: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/9.jpg)
Supervised Learning
• Optimization:– Art of maximization• Unconstraint:
– First order: Gradient descent– Second order: Newtonian method– Stochastic: stochastic gradient descent (SGD)
• Constraint:– Active set method– Interior Point Method– Alternative Direction Method of Multiplier (ADMM)
![Page 10: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/10.jpg)
Unsupervised Learning
• Clustering: – PCA– kNN
![Page 11: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/11.jpg)
Machine Learning for Information Retrieval
• Learning to Rank• Topic Modeling
![Page 12: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/12.jpg)
Learning to Rank
http://research.microsoft.com/en-us/people/hangli/li-acl-ijcnlp-2009-tutorial.pdf
![Page 13: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/13.jpg)
Learning to Rank
• X = (q, d)– Features: e.g. Matching between Query and Document
![Page 14: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/14.jpg)
Learning to Rank
![Page 15: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/15.jpg)
Learning to Rank
• Labels:– Pointwise: relevant vs. irrelevant; 5,4,3,2,1– Pairwise: doc A > doc B, doc C > doc D– Listwise: permutation
• Acquisition:– Expert Annotation– Clickthrough: click ,skip above
![Page 16: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/16.jpg)
Learning to Rank
![Page 17: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/17.jpg)
Learning to Rank
• Prediction function:– Extract Xq,d from (q, d)
– Ranking document by sorting wT Xq,d
• Loss function:– Pointwise– Pairwise– Listwise
![Page 18: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/18.jpg)
Learning to Rank
• Pointwise:– Regression: Square loss
• Pairwise:– Classification: (q, d1) > (q, d2) => positive example Xq,d1 – Xq, d2
• Listwise:– Optimization: NDCG@j
Relevance (0/1) of document at rank i
Discount of rank i
Cumulative
Gain
Normalized
![Page 19: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/19.jpg)
Topic Modeling
• Topic Modeling– Factorization of Words * Documents matrix
• Clustering of document– Project documents (vector of # vocabulary) into lower dimension (vector
of # topics)
• What is Topic?– Linear combination of words
• Nonnegative weights, sum to 1 => probability
![Page 20: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/20.jpg)
Topic Modeling
• Generative models: story-telling– Latent Semantic Analysis, LSA– Probabilistic Latent Semantic Analysis, PLSA– Latent Dirichlet Allocation, LDA
![Page 21: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/21.jpg)
Topic Modeling
• Latent Semantic Analysis (LSA): – Deerwester et al (1990)– Singular Value Decomposition (SVD) applied to
words * documents matrix
– How to interpret negative values?
![Page 22: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/22.jpg)
Topic Modeling• Probabilistic Latent Semantic Analysis (PLSA):
– Thomas Hofmann (1999)– How words/documents are generated (as described by probability)
d1, fish
d1, boat
d1, voyage
d2, voyage
d2, sky
d3, trip
……
documents
topics
topics
wor
ds
documents
documents
Maximal Likelihood:
![Page 23: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/23.jpg)
Topic Modeling• Latent Dirichlet Allocation (LDA)
– David Blei et al. (2003)– PLSA with a Dirichlet prior
• What is Bayesian inference? Conjugate Prior? Posterior? Frequentist v.s. Bayesian Tossing a Coin
Parameter to be estimated priorlikelihood
Posterior probability
• Canonical Maximal Likelihood (Frequentist) as a special form of Bayesian Maximal a Posterior (MAP) when g(r) is uniform prior• Bayesian as an inference method:
• Estimate r: posterior mean, or MAP• Estimate new toss to be head:
![Page 24: Introduction to Machine Learning for Information Retrieval Xiaolong Wang.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e4c5503460f94b424b8/html5/thumbnails/24.jpg)
Topic Modeling• Latent Dirichlet Allocation (LDA)
– David Blei et al. (2003)– PLSA with a Dirichlet prior
• What additional info we know about ?– Sparsity:
• each topic has nonzero probability on few words;• each document has nonzero probability on few topics;
Dirichlet distribution defines probability on simplex
documents
topics
topics
wor
ds
documents
documents
Parameter of Multinomial:• Nonnegative• Sum to 1 simplex
Dirichlet can encourage sparsity