Three Assertions about Interactive Machine...
Transcript of Three Assertions about Interactive Machine...
Three Assertionsabout Interactive Machine Learning
Xiaojin Zhu
Department of Computer SciencesUniversity of Wisconsin-Madison
Assertion 1: Humans can be modeled with statisticallearning theory
I Unifying math behind cognitive science and machine learning
Example 1a: Human Rademacher Complexity(grenade, A), (meadow, A), (skull, B), (conflict, B), (queen, B)
I “learning random labels” (x1, σ1) . . . (xn, σn)I Rademacher complexity (similar to VC dimension)
Radn(F ) ≈
∣∣∣∣∣ 2nn∑i=1
σif(xi)
∣∣∣∣∣. . . of our mind!
I Larger Rademacher complexity → worse generalization errorbound (overfitting) [ZRG NIPS09]
rape killer funeral · · · fun laughter joy
0 10 20 30 400
0.5
1
1.5
2
n
Rad
emac
her
com
plex
ity
0 10 20 30 400
0.5
1
1.5
2
n
Rad
emac
her
com
plex
ity
Example 1a: Human Rademacher Complexity(grenade, A), (meadow, A), (skull, B), (conflict, B), (queen, B)
I “learning random labels” (x1, σ1) . . . (xn, σn)I Rademacher complexity (similar to VC dimension)
Radn(F ) ≈
∣∣∣∣∣ 2nn∑i=1
σif(xi)
∣∣∣∣∣. . . of our mind!
I Larger Rademacher complexity → worse generalization errorbound (overfitting) [ZRG NIPS09]
rape killer funeral · · · fun laughter joy
0 10 20 30 400
0.5
1
1.5
2
n
Rad
emac
her
com
plex
ity
0 10 20 30 400
0.5
1
1.5
2
n
Rad
emac
her
com
plex
ity
Example 1b: Human Semi-Supervised Learning
I Humans learn supervised first, then
I . . . decision boundary shifts to distribution trough in test data
I Can be explained by a variety of semi-supervised machinelearning models [GRZ ToCS13]
−3 −2 −1 0 1 2 30
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
left−shifted
Gaussian mixture
range examples
test examples
x−1 −0.5 0 0.5 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
perc
ent c
lass
2 r
espo
nses
test−1, alltest−2, L−subjectstest−2, R−subjects
Example 1c: Human Active Learning
[CKNQRZ NIPS08]
Passive learning inf θn supθ∈[0,1] E[|θn − θ|] ≥ 14
(1+2ε1−2ε
)2ε1
n+1
Active learning supθ∈[0,1] E[|θn − θ|] ≤ 2
(√12 +
√ε(1− ε)
)nnoise ε = 0 ε = 0.05 ε = 0.1 ε = 0.2 ε = 0.4
HumanPassive
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
HumanActive
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
10 20 30 40−1
−0.5
0
0.5
1
Assertion 2: There is a theoretically optimal way to teach
Human teaches machine (interactive ML)Machine teaches human (education)
Example 2: 1D threshold function
I Passive learning (xi, yi)iid∼ p, risk ≈ O( 1
n)
I Active learning risk ≈ 12n
I Minimum teaching: n = 2. (teaching dimension)
I Alternatively: easy to hard (curriculum learning, fading,parentese)
A formula for optimal teaching
1. World: p(x, y | θ∗), loss function `(f(x), y)
2. Learner: makes prediction f(x | data)
3. Teacher:I clairvoyant, knows everything aboveI can only teach by examples (x, y)I goal: choose the least-effort teaching set D = (x, y)1:n to
minimize the learner’s future loss (risk):
minD
Eθ∗ [`(f(x | D), y)] + effort(D)
I if the future loss approaches Bayes risk, D is a teaching setand n is the (generalized) teaching dimension
[KZM NIPS11, Z arXiv13]
Assertion 3: Even when human teachers are not optimal,they are not iid
. . . and machine learners should take advantage of thatnon-iidness.
Example 3: Feature Volunteering (Interactive ML)
[JZSR ICML13]
Example 3: Sampling with Reduced Replacement
Example 3: Sampling with Reduced Replacement
Example 3: Sampling with Reduced Replacement
Example 3: Sampling with Reduced Replacement
Example 3: Sampling with Reduced Replacement
References
R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu.Human active learning.In Advances in Neural Information Processing Systems (NIPS) 22. 2008.
B. R. Gibson, T. T. Rogers, and X. Zhu.Human semi-supervised learning.Topics in Cognitive Science, 5(1):132–172, 2013.
K.-S. Jun, X. Zhu, B. Settles, and T. Rogers.Learning from human-generated lists.In The 30th International Conference on Machine Learning (ICML), 2013.
F. Khan, X. Zhu, and B. Mutlu.How do humans teach: On curriculum learning and teaching dimension.In Advances in Neural Information Processing Systems (NIPS) 25. 2011.
X. Zhu, T. T. Rogers, and B. Gibson.Human Rademacher complexity.In Advances in Neural Information Processing Systems (NIPS) 23. 2009.
Three Assertions
1. Humans can be modeled with statistical learning theory.
2. There is a theoretically optimal way to teach.
3. Even when human teachers are not optimal, they are not iid.
Capacity
VC-dimension
I F : a family of binary classifiers
I VC-dimension V C(F ): size of the largest set that F canshatter
I With probability at least 1− δ,
supf∈F
R(f)−Rn(f) ≤ 2
√2V C(F ) log n+ V C(F ) log 2e
V C(F ) + log 2δ
n.
I R(f): error of f in the future
I Rn(f): error of f on a training set of size n
Capacity
Rademacher complexity
I σ1, . . . , σn : P (σi = 1) = P (σi = −1) = 12
I Rademacher complexity
Radn(F ) = Eσ,x
(supf∈F
∣∣∣∣∣ 1nn∑i=1
σif(xi)
∣∣∣∣∣).
I With probability at least 1− δ,
supf∈F|Rn(f)−R(f)| ≤ 2Radn(F ) +
√log(2/δ)
2n.
Machine learning → human learning
I f : you categorize x by f(x)
I F : all the classifiers in your mind
I Rn(f): how did you do in class
I R(f): how well can you do outside classI Capacity: can we measure it in humans?
I V C(F ): too brittle (find one dataset of size n) andcombinatorial (verify shattering)
I Others may behave better, e.g., Radn(F )
Overfitting indicator
0 0.5 1 1.5 20
0.5
1
1.5
Rademacher complexity
e −
e^
Shape,40Word,40 Shape,5
Word,5
boundobserved
I e test set error, e training set errorI generalization error bound holdsI actual overfitting tracks bound (nice but not predicted by
theory)
The study of capacity may
I constrain cognitive modelsI understand groups differ in age, health, education, etc.
Human semi-supervised learning, the other way aroundHuman unsupervised learning first
−8 −4.8 1.6 8
p(x)
−8 −1.6 4.8 8
p(x)
−8 −1.6 8
p(x)
−8 −1.6 8
p(x)
−8 −4.8 1.6 8x
time
→
−8 −1.6 4.8 8x
time
→
−8 −1.6 8x
time
→
−8 −1.6 8x
time
→
trough peak uniform converge. . . influences subsequent (identical) supervised learning task
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
mea
n ac
cura
cy ±
std
err
troughconvergeuniformpeak
Human teacher behaviors
strategy boundary curriculum linear positive
“graspability” (n = 31) 0% 48% 42% 10%“lines” (n = 32) 56% 19% 25% 0%