Dr. Perceptron - cs.cmu.edu
Transcript of Dr. Perceptron - cs.cmu.edu
![Page 1: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/1.jpg)
1http://futurama.wikia.com/wiki/Dr._Perceptron
![Page 2: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/2.jpg)
Where we are…• Experiments with a hash-trick implementation
of logistic regression• Next question:
– how do you parallelize SGD, or more generally, this kind of streaming algorithm?
– each example affects the next prediction èorder matters è parallelization changes the behavior
– we will step back to perceptrons and then step forward to parallel perceptrons
– then another nice parallel learning algorithm – then a midterm
2
![Page 3: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/3.jpg)
Recap: perceptrons
![Page 4: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/4.jpg)
The perceptron
A Binstance xi Compute: yi =sign( vk . xi )^
yi^
yi
If mistake: vk+1 = vk + yi xi
2
![Page 5: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/5.jpg)
The perceptron
A Binstance xi Compute: yi =sign( vk . xi )^
yi^
yi
If mistake: vk+1 = vk + yi xi
2
22
÷÷ø
öççè
æ=
gR
u
-u
2γ
margin
positive
negative
+ ++- -
-
Mistake bound:
A lot like SGD update for logistic regression!
![Page 6: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/6.jpg)
On-line to batch learning
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
m1=3
m2=4
m=10
![Page 7: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/7.jpg)
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vkyou just picked.
3. (Actually, use some sort of deterministic approximation to this).
predict using sign(v*. x)
m1=3
m2=4
m=10
![Page 8: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/8.jpg)
predict using sign(v*. x)
Averaging/voting
Last perceptron
Also: there’s a sparsification trick that
makes learning the averaged perceptron fast
![Page 9: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/9.jpg)
KERNELS AND PERCEPTRONS
![Page 10: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/10.jpg)
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the kernel trick
10
![Page 11: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/11.jpg)
The kernel perceptron
A Binstance xi
Compute: yi = vk . xi^
yi^
yi
If mistake: vk+1 = vk + yi xi
-
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
y xxxxxx
..ˆ :Compute
FN to add :mistake low) (too positive false IfFP to add :mistake high) (too positive false If
i
i
xx
Mathematically the same as before … but allows use of the “kernel trick”
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
kkK xxxx ׺),(
Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.
11
![Page 12: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/12.jpg)
Some common kernels• Linear kernel:
• Polynomial kernel:
• Gaussian kernel:
')',( xxxx ׺K
dK )1'()',( +׺ xxxx
s/'|||| 2
)',( xxxx --º eK
12
![Page 13: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/13.jpg)
Some common kernels
• Polynomial kernel:
• for d=2
dK )1'()',( +׺ xxxx
13
𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +
= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 +
= 𝑥)𝑥′) + 𝑥+𝑥′+ + 1 𝑥)𝑥′) + 𝑥+𝑥′+ + 1
= 𝑥)𝑥′) ++2 𝑥)𝑥0)𝑥+𝑥0+ + 2(𝑥)𝑥′))+ 𝑥+𝑥′+ ++2(𝑥+𝑥′+)+1
≅ 1, 𝑥),𝑥+, 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 𝑥′),𝑥′+, 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2
� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
![Page 14: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/14.jpg)
Some common kernels
• Polynomial kernel:
• for d=2
dK )1'()',( +׺ xxxx
14
𝑥),𝑥+ , 𝑥′),𝑥′+ + 1 +
= 1, 2� 𝑥), 2� 𝑥+, 2� 𝑥)𝑥+, 𝑥)+, 𝑥++ , 1, 2� 𝑥′), 2
� 𝑥′+, 2� 𝑥′)𝑥′+, 𝑥′)+, 𝑥′++
Similarity with the kernel on x is equivalent to dot-product similarity on a transformed feature vector 𝜙(𝒙)
![Page 15: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/15.jpg)
Kernels 101
• Duality: two ways to look at this
),(),(ˆ -
-
+
+
ååÎÎ
-= kFP
ikFN
i
kk
KKy xxxxxx
)()(),( kkK xxxx ff ׺
)(ˆ wx,wx Ky =×=
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw
wx ×= )(ˆ fy
ååÎÎ -
-
+
+ -=FP
kFN
kkkxx
xxw )()( ffsame behavior but compute time/space are different
Explicitly map from x to φ(x)– i.e.tothepointcorrespondingtoxintheHilbertspace(RKHS)
Implicitly map from x to φ(x)bychangingthekernelfunctionK
15
Observation about perceptron Generalization of perceptron
Generalization: add weights to the sums for w
![Page 16: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/16.jpg)
Kernels 101
• Duality • Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x) è Gram matrix is symmetric
K(x,x) > 0 è diagonal of K is positive è K is “positive semi-definite”è zT K z >= 0 for all z
16
![Page 17: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/17.jpg)
A FAMILIAR KERNEL
![Page 18: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/18.jpg)
Learning as optimization for regularized logistic regression + hashes• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T
– For each example (xi,yi)• V is a hash table• For j : xj>0 increment V[h[j]] by xj
• pi = … ; k++• For each hash value h: V[h]>0:
»W[h] *= (1 - λ2µ)k-A[h]
»W[h] = W[h] + λ(yi - pi)V[h]»A[h] = k
18
€
V[h] = xij
j:hash( j )%R ==h∑
![Page 19: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/19.jpg)
19
![Page 20: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/20.jpg)
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
€
V[h] = xij
j:hash( j )%R ==h∑
m is the number of buckets you hash into (R in my discussion)
20
![Page 21: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/21.jpg)
Some details
ϕ[h]= ξ ( j)xij
j:hash( j )%m==h∑ , where ξ ( j)∈ −1,+1{ }
Slightly different hash to avoid systematic bias
21
I.e., for large feature sets the variance should be low
![Page 22: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/22.jpg)
Some details
I.e. – a hashed vector is probably close to the original vector
22
![Page 23: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/23.jpg)
Some details
I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work23
![Page 24: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/24.jpg)
The Voted Perceptron for Ranking and Structured Classification
William Cohen
![Page 25: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/25.jpg)
The voted perceptron for ranking
A Binstances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
bIf mistake: vk+1 = vk + xb - xb*
![Page 26: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/26.jpg)
u
-u
xx
x
x
x
γ
Ranking some x’s with the target vector u
![Page 27: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/27.jpg)
u
-u
xx
x
x
x
γ
v
Ranking some x’s with some guess vector v – part 1
![Page 28: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/28.jpg)
u
-u
xx
x
x
x
v
Ranking some x’s with some guess vector v – part 2.
The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.
![Page 29: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/29.jpg)
u
-u
xx
x
x
x
v
Correcting v by adding xb – xb*
![Page 30: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/30.jpg)
xx
x
x
x
vk
Vk+1
Correcting v by adding xb – xb*
(part 2)
![Page 31: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/31.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
![Page 32: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/32.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
![Page 33: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/33.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
![Page 34: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/34.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
Notice this doesn’t depend at all on the number of x’s being ranked
Neither proof depends on the dimension of the x’s.
![Page 35: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/35.jpg)
Ranking perceptrons èstructured perceptrons
• The API:– A sends B a (maybe
huge) set of items to rank
– B finds the single bestone according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
![Page 36: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/36.jpg)
36
Borkar et al’s: HMMs for segmentation
– Example: Addresses, bib records– Problem: some DBs may split records up differently (eg no “mail
stop” field, combine address and apt #, …) or not at all– Solution: Learn to segment textual form of records
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume Page
![Page 37: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/37.jpg)
IE with Hidden Markov Models
37
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilities
Year
Learning
Convex
…
0.06
0.03
..
Comm.
Trans.
Chemical
0.04
0.02
0.004
Smith
Cohen
Jordan
…
0.01
0.05
0.3
…
Emission probabilities
dddd
dd
0.8
0.2
![Page 38: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/38.jpg)
Inference for linear-chain CRFs
When will prof Cohen post the notes …
Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them (Begin,Inside,Outside)
• (y(i)==B or y(i)==I) and (token(i) is capitalized)
• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)
• (y(i)==B and y(i-1)==B)
•eg “tell Rose William is on the way”
Idea 2: construct a graph where each path is a possible sequence labeling.
![Page 39: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/39.jpg)
Inference for a linear-chain CRF
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
•Inference: find the highest-weight path given a weighting of features•This can be done efficiently using dynamic programming (Viterbi)
![Page 40: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/40.jpg)
Ranking perceptrons èstructured perceptrons
• The API:– A sends B a (maybe
huge) set of items to rank
– B finds the single bestone according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
![Page 41: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/41.jpg)
Ranking perceptrons èstructured perceptrons
• New API:– A sends B the word
sequence x– B finds the single best
y according to the current weight vector using Viterbi
– A tells B which y was actually best
– This is equivalent to ranking pairs g=(x,y’)
• Structured classification on a sequence– Input: list of words:
x=(w1,…,wn)– Output: list of labels:
y=(y1,…,yn)– If there are K classes,
there are Kn labels possible for x
![Page 42: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/42.jpg)
The voted perceptron for ranking
A Binstances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
bIf mistake: vk+1 = vk + xb - xb*
Change number one is notation: replace x with g
![Page 43: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/43.jpg)
The voted perceptron for structured classification tasks
A Binstances g1 g2 g3 g4… Compute: yi = vk . gi
Return: the index b* of the “best” gi
^
b* bIf mistake: vk+1 = vk + gb - gb*
1. A sends B feature functions, and instructions for creating the instances g:
• A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …
• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.
2. A sends B the correct label sequence yi.
3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)
![Page 44: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/44.jpg)
EMNLP 2002, Best paper
Results from the original paper….
![Page 45: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/45.jpg)
Collins’ Experiments• POS tagging • NP Chunking (words and POS tags from Brill’s
tagger as features) and BIO output tags• Compared logistic regression methods (MaxEnt)
and “Voted Perceptron trained HMM’s”– With and w/o averaging– With and w/o feature selection (count>5)
![Page 46: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/46.jpg)
Collins’ results
![Page 47: Dr. Perceptron - cs.cmu.edu](https://reader030.fdocuments.us/reader030/viewer/2022012500/61789a873371ee0c505f17ee/html5/thumbnails/47.jpg)
Where we are…• Experiments with a hash-trick implementation
of logistic regression• Next question:
– how do you parallelize SGD, or more generally, this kind of streaming algorithm?
– each example affects the next prediction èorder matters è parallelization changes the behavior
– we will step back to perceptrons and then step forward to parallel perceptrons
– then another nice parallel learning algorithm – then a midterm
47