Active Learning for Probabilistic Models
description
Transcript of Active Learning for Probabilistic Models
![Page 1: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/1.jpg)
LARC-IMS Workshop
Active Learning for Probabilistic Models
Lee Wee SunDepartment of Computer ScienceNational University of Singapore
![Page 2: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/2.jpg)
LARC-IMS Workshop
Probabilistic Models in Networked Environments
• Probabilistic graphical models are powerful tools in networked environments
• Example task: Given some labeled nodes, what are the labels of remaining nodes?
• May also need to learn parameters of model (later)
?
?
?
Labeling university web pages with CRF
?
Project
FacultyProject
Student
![Page 3: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/3.jpg)
LARC-IMS Workshop
Active Learning
• Given a budget of k queries, which nodes to query to maximize performance on remaining nodes?
• What are reasonable performance measures with provable guarantees for greedy methods?
?
?
?
Labeling university web pages with CRF
?
Project
FacultyProject
Student
![Page 4: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/4.jpg)
LARC-IMS Workshop
Entropy
• First consider non-adaptive policy• Chain rule of entropy
• Maximizing entropy of selected variables (Y1) minimizes the conditional entropy
Constant Maximize Minimize target
![Page 5: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/5.jpg)
LARC-IMS Workshop
• Greedy method– Given already selected set S, add variable Yi to
maximize
• Near optimality:
because of submodularity of entropy.
![Page 6: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/6.jpg)
LARC-IMS Workshop
Submodularity
• Diminishing return property
![Page 7: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/7.jpg)
LARC-IMS Workshop
Adaptive Policy
• What about adaptive policies?
Non-adaptive Adaptive
k
![Page 8: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/8.jpg)
LARC-IMS Workshop
• Let ρ be a path down the policy tree, and let the policy entropy be
Then we can show
where YG is the graph labeling• Correspond to chain rule in non-adaptive case –
maximizing policy entropy minimizes conditional entropy
![Page 9: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/9.jpg)
LARC-IMS Workshop
• Recap: Greedy algorithm is near-optimal for non-adaptive case
• For adaptive case, consider greedy algorithm that selects the variable with the largest entropy conditioned on the observations
• Unfortunately, for adaptive case, we can show that, for every α > 0, there is a probabilistic model such that
![Page 10: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/10.jpg)
LARC-IMS Workshop
Tsallis Entropy and Gibbs Error
• In statistical mechanics, Tsallis entropy is a generalization of Shannon entropy
• Shannon entropy is special case for q = 1.• We call the case q = 2, Gibbs Error
![Page 11: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/11.jpg)
LARC-IMS Workshop
Properties of Gibbs Error
• Gibbs error is the expected error of the Gibbs classifier
– Gibbs classifier: Draw a labeling from the distribution and use the labeling as the prediction
• At most twice Bayes (best possible) error.
![Page 12: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/12.jpg)
LARC-IMS Workshop
• Lower bound to entropy
– Maximizing policy Gibbs error, maximize lower bound to policy entropy
Gibbs Error
![Page 13: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/13.jpg)
LARC-IMS Workshop
• Policy Gibbs error
![Page 14: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/14.jpg)
LARC-IMS Workshop
• Maximizing policy Gibbs error minimizes expected weighted posterior Gibbs error
• Make progress on either the version space or posterior Gibbs error
Version space Posterior Gibbs error
![Page 15: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/15.jpg)
LARC-IMS Workshop
Gibbs Error and Adaptive Policies
• Greedy algorithm: Select node i with the largest conditional Gibbs error
• Near-optimality holds for the case of policy Gibbs error (in contrast to policy entropy)
![Page 16: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/16.jpg)
LARC-IMS Workshop
• Proof idea: – Show that policy Gibbs error is the same as the expected
version space reduction.– Version space is the total probability of remaining
labelings on unlabeled nodes (labelings that are consistent with labeled nodes)
– Version space reduction function is adaptive submodular, giving required result for policy Gibbs error (using result of Golovin and Krause).
Version space
![Page 17: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/17.jpg)
LARC-IMS Workshop
Adaptive Submodularity
x3
ρρ’
• Diminishing return property– Change in version space
when xi is concatenated to path ρ and y is received
– Adaptive submodular because
![Page 18: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/18.jpg)
LARC-IMS Workshop
Worst Case Version Space
• Maximizing policy Gibbs error maximizes expected version space reduction
• Related greedy algorithm: Select the least confident variable – Select the variable with the smallest maximum label
probability• Approximately
maximizes worst case version space reduction
![Page 19: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/19.jpg)
LARC-IMS Workshop
• Let
• Using greedy strategy that selects least confident variable achieves
because version space reduction function is pointwise submodular
![Page 20: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/20.jpg)
LARC-IMS Workshop
Pointwise Submodularity
• Let V(S,y) be the version space remaining if y is the true labeling of all nodes and subset S has been labeled
• 1-V(S,y) is pointwise submodular as it is submodular for every labeling y
![Page 21: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/21.jpg)
LARC-IMS Workshop
Summary So Far …
Greedy Algorithm Criteria Optimality Property
Select maximum entropy variable
Entropy of selected variables
No constant factor approximation
Select maximum Gibbs error variable
Policy Gibbs error (expected version space reduction)
1-1/e Adaptive submodular
Select least confident variable
Worst case version space reduction
1-1/e Pointwise submodular
…
…
![Page 22: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/22.jpg)
LARC-IMS Workshop
Learning Parameters
• Take a Bayesian approach• Put prior over parameters• Integrate away parameters when computing probability
of labeling
• Also works with commonly encountered pooled based active learning scenario (independent instances – no dependencies other than on parameter)
![Page 23: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/23.jpg)
LARC-IMS Workshop
Experiments
• Named entity recognition with Bayesian CRF on CoNLL 2003 dataset
• Greedy algsperformancesimilar andbetter thanpassivelearning (random) Passive MaxEnt Least Conf Gibbs Err
71
72
73
74
75
76
77
Performance on NER
Greedy Algorithm
F1 A
UC
![Page 24: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/24.jpg)
LARC-IMS Workshop
Weakness of Gibbs Error
• A labeling is considered incorrect if even one component does not agree
Project
FacultyProject
Student
Faculty
Project
Student
Student
Project
FacultyProject
Student
Project
Student
Student
Project
![Page 25: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/25.jpg)
LARC-IMS Workshop
Generalized Gibbs Error
• Generalize Gibbs error to use loss function L
• Example: Hamming loss, 1-F-score, etc.• Reduces to Gibbs error when L(y,y’) =
1-δ(y,y’) where – δ(y,y’) = 1 when y = y’, and – δ(y,y’) = 0 otherwise
y2
y1y3
y4
![Page 26: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/26.jpg)
LARC-IMS Workshop
• Generalized policy Gibbs error (to maximize)
Generalized Gibbs Error
Remaining weighted Generalized Gibbs error (agrees with y on ρ)
![Page 27: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/27.jpg)
LARC-IMS Workshop
• Generalized policy Gibbs error is the average of
• Call this function the generalized version space reduction function
• Unfortunately, not adaptive submodular for arbitrary L.
y2
y1y3
y4
![Page 28: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/28.jpg)
LARC-IMS Workshop
• However, generalized version space reduction function is pointwise submodular– Has good approximation
in the worst casey2
y1y3
y4
![Page 29: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/29.jpg)
LARC-IMS Workshop
• Hedging against worst case labeling may be too conservative• Can hedge against the total generalized version space
among surviving labelings instead
y2
y1y3
y4
y2
y1y3
y4
instead of
![Page 30: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/30.jpg)
LARC-IMS Workshop
• Call this total generalized version space reduction function
• Total generalized version space reduction function is pointwise submodular– Has good approximation in the worst case
![Page 31: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/31.jpg)
LARC-IMS Workshop
Summary
Greedy Algorithm Criteria Optimality Property
Select maximum entropy variable
Entropy of selected variables
No constant factor approximation
Select maximum Gibbs error variable
Policy Gibbs error (expected version space reduction)
1-1/e Adaptive submodular
Select least confident variable
Worst case version space reduction
1-1/e Pointwise submodular
Select variable that maximizes worst case generalized version space reduction
Worst case generalized version space reduction
1-1/e Pointwise submodular
Select variable that maximizes worst case total generalized version space reduction
Worst case total generalized version space reduction
1-1/e Pointwise submodular
![Page 32: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/32.jpg)
LARC-IMS Workshop
Experiments
• Text classification• 20Newsgroup
dataset• Classify 7 pairs of
newsgroups• AUC for
classification error• Max Gibbs error vs
Total Generalized Version Space with Hamming Loss
74 76 78 80 82 84 86 8874
76
78
80
82
84
86
88
Gibbs vs Hamming
AUC for Gibbs
AUC
for H
amm
ing
![Page 33: Active Learning for Probabilistic Models](https://reader033.fdocuments.us/reader033/viewer/2022051421/56816630550346895dd9978a/html5/thumbnails/33.jpg)
LARC-IMS Workshop
Acknowledgements
• Joint work with – Nguyen Viet Cuong (NUS)– Ye Nan (NUS)– Adam Chai (DSO)– Chieu Hai Leong (DSO)