Post on 11-Jan-2016
description
Magic Moments:Moment-based Approaches toStructured Output Prediction
Elisa Riccijoint work with Nobuhisa Ueda, Tijl De Bie, Nello Cristianini
Thursday, October 25th
The Analysis of Patterns
Outline
Learning in structured output spaces
New algorithms based on Z-score
Experimental results and computational issues
Conclusions
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Structured data everywhere!!! Many problems involve highly
structured data which can be represented by sequences, trees and graphs.
Temporal, spatial and structural dependencies between objects are modeled.
This phenomenon is observed in several fields such as computational biology, computer vision, natural language processing or web data analysis.
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Learning with structured data
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Machine learning and data mining algorithms must be able to analyze efficiently and automatically a vast amount of complex and structured data.
The goal of structured learning algorithms is to predict complex structures, such as sequences, trees, or graphs.
Using traditional algorithms to cope with problems involving structured data often implies a loss of information about the structure.
Find s.t.
Supervised learning Data are available in form of examples and their associated correct answers.
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Learning:
Prediction:
Hypotheses
space
yx ii yyy ,,,,,, xxx 11T
yxh : Hh
1 iyh iix
Training set:
yh x on a new test sample x.
Classification A typical supervised learning task is classification.
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Named entity recognition (NER): locate named entities in text. Entities of interest are person names, location names, organization names, miscellaneous (dates, times...)
Label: entity tag.
Observed variable: word in a sentence.
PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.
O N N N M m m N N N O L
Multiclass classification y
x
Sequence labeling
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling: given an input sequence x, reconstruct the associated label sequence y of equal length.
Label sequence: entity tags.
Observed sequence: words in a sentence.
Can we consider the interactions between adjacent words?
Goal: realize a joint labeling for all the words in the sentence.
PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.
O N N N M m m N N N O L
y = (y1...yn)
x = (x1...xn)
Sequence alignment
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Biological sequence alignment is used to determine the similarity between biological sequences.
ACTGATTACGTGAACTGGATCCA
ACTC--TAGGTGAAGTG-ATCCA?
S ={A,T,G,C}, S1 , S2
S
Given two sequences S1, S2 S a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence.
ATGCTTTC------CTGTCGCC
S1 ATGCTTTCS2 CTGTCGCC
Sequence alignment: given a sequences pair x, predict the correct sequence y of alignment operations (e.g. matches, mismatches, gaps).
Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.
Sequence alignment
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
ATGCTTTC------CTGTCGCCy
S1 ATGCTTTCS2 CTGTCGCC
x
A T G C T T T CCTGTCGCC
RNA secondary structure prediction
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
RNA secondary structure prediction: given a RNA sequence, predict the most likely secondary structure.
The study of RNA structure is important in understanding its functions.
AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUGAUUUCGAUUCUCACUACUCAU
?
Sequence parsing
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence parsing: given an input sequence x, determine the associated parse tree y given an underlying context-free grammar.
Example:
y
GAUCGAUCGAUCx
SS
S
SS
S
UC
GGA
A
C
U
SSS
GA
C
U
Context-free grammar G={V, A, R, S}
V ={S} set of non-terminals symbols
S = {G, A, U, C} set of terminals symbols
R= {S → SS | GSC | CSG | ASU | USA | }.
Traditionally HMMs have been used for sequence labeling.
Two main drawbacks: The conditional independence assumptions are often too restrictive. HMMs cannot represent multiple interacting features or long range dependencies between the observations.They are typically trained by maximum likelihood (ML) estimation.
Label sequence y = (y1...yn) Observed sequence x = (x1...xn)
y1 y2 y3
x1 x2 x3
Generative models
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling:
Discriminative models
Specify the probability of possible output y given an observation x (consider conditional probability P(y|x) rather than joint probability P(y,x)).
Do not require strict independence assumptions of generative models.
Arbitrary features of the observations are considered.
Conditional Random Fields (CRFs) [Lafferty et al., 01]
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
y
x
y1 y2 y3
x1 x2 x3
Learning in structured output spaces
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Several discriminative algorithms have emerged recently in order to predict complex structures, such as sequences, trees, or graphs.
New discriminative approaches.
Problems analyzed: Given a training set of correct pairs of sentences and their associated entity tags learn
to extract entities from a new sentence. Given a training set of correct biological alignments learn to align two unknown
sequences. Given a training set of corrects RNA secondary structures associated to a set of
sequences learn to determine the secondary structure of a new sequence.
This is not an exhaustive list of possible applications.
Find s.t.
Learning in structured output spaces Multilabel supervised classification (Output: y = (y1...yn)).
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Learning:
Prediction:
Hypotheses
space
yx ii yxyxyx ,,,,,, 11T
yxh : Hh
1 ih ii yx
Training set:
yx h on a new test sample x. yxxy
,sy
h
max argScore
dT Rs wyxwyx ,,,
Three main phases:
Encoding: define a suitable feature map (x,y).
Compression: characterize the output space in a synthetic and compact way.
Optimization: define a suitable objective function and use it for learning.
Learning in structured output spaces
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Encoding: define a suitable feature map (x,y).
Compression: characterize the output space in a synthetic and compact way.
Optimization: define a suitable objective function and use it for learning.
Learning in structured output spaces
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Encoding
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Features must be defined in a way such that prediction can be computed efficiently.
The feature vector (x,y) decomposes as sum of elementary features (x,y) on “parts”.
Parts are typically edges or nodes in graphs.
yxwxy
,T
yh
max arg
is typically
huge.
y
S1 ATGCTTTCS2 CTGTCGCC
A T G C T T T CCTGTCGCC
Encoding
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
ykkkktpz zpzyIpyIyy ,, 11
yxkkkkepq pqpyIqxIyx ,,
y1 y2 y3
x1 x2 x3
In general features reflect long range interactions (when labeling xi past and future observations are taken into account).
Arbitrary features of the observations are considered (e.g. spelling properties in NER).
Sequence labeling:
Example: CRF with HMM features
3-parameters model:
In practice more complex models are used:
4-parameters model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty) in a given position or if it continues (gap extension penalty).
211/212-parameters model: (x,y) contains the statistics associated to the gap penalties and all the possible pairs of amino acids.
Encoding
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
A T G C T T T CCTGTCGCC
#matches #mismatches #gaps
yx,
4 1 4
Sequence alignment:
Encoding
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
2
2
2
1
1
1
yx,
S → SS S → GSC
S → CSG
S → ASU
S → USA
S → .
Sequence parsing:
y
GAUCGAUCGAUCx
SS
S
SS
S
UC
GGA
A
C
U
SSS
GA
C
U
The feature vector contains the statistics associated to the occurrences of the rules.
Encoding Having defined these features, predictions can be computed efficiently with dynamic
programming (DP).
Sequence labeling Viterbi algorithm
Sequence alignment Needleman-Wunsch algorithm
Sequence parsing Cocke-Younger-Kasami (CYK) algorithm
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
A T G C T T T CCTGTCGCC
A T G C T T T C
CTGTCGCC
DP TABLE
Encoding: define a suitable feature map (x,y).
Compression: characterize the output space in a synthetic and compact way.
Optimization: define a suitable objective function and use it for learning.
Learning in structured output spaces
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Computing moments
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
The number N of possible output vector yk given an observation x is typically huge.
To characterize the distribution of the scores its mean and its variance are considered.
C and can be computed efficiently with DP techniques.
ΤN
kk
ΤN
kk N
sN
wyxwyxyx
11
11,,,
Cwwwyxyxwyx ΤN
k
Tkk
Τ
N
1
2 1 ,,,
Input: x = (x1, x2, ..., xn), p, q.
(i, 1) := 1 i
if (q = x1) and (p = i),
for j = 2 to n
for i = 1 to
M := 0
if (q = x1) and (p = i), M := 1
endfor
endfor
Output:
i
jiji 1: ,,
111
:
ji
jiMjiji i
epqe
pq ,
,,,
i
i
epq
ni
nini
,
,,
1:1 ,iepq
y
Computing moments
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
The number N of possible label sequences yk given an observation sequence x is exponential in the length of the sequences.
An algorithm similar to the forward algorithm is used to compute and C.
Recursive formulaSequence labeling:
Mean value associated to the feature which represents the emission of a symbol q at state p.
kkepq yx ,
y1 y2 y3
x1 x2 x3
Computing moments
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
k
k
ii
k
ii aEaEaE
1
11
21
1
21
1
2
1
2 k
k
iik
k
ii
k
ii aEaEaaEaE
Basic idea behind recursive formulas:
Mean values are computed considering:
Variances are computed centering the second order moments:
Computing moments
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Problem: high computational cost for large feature spaces. 1st Solution: Exploit the structure and the sparseness of the covariance matrix C.
In sequence labeling for CRF with HMM features the number of different values in C is linear in the size of the observation alphabet.
2nd Solution: Sampling strategy.
Example:
34 yx
Encoding: define a suitable feature map (x,y).
Compression: characterize the output space in a synthetic and compact way.
Optimization: define a suitable objective function and use it for learning.
Learning in structured output spaces
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Z-score
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
New optimization criterion particularly suited for non-separable cases.
Minimize the number of output vectors with score higher than the score of the correct pairs.
Maximize the Z-score:
yx
yxyxx
,,,
s
Z
Z-score
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
The Z-score can be expressed as a function of the parameters w.
Two equivalent optimization problems:
Cww
bww T
T
max
kkT
N
kkN
1s.t.
1min
1
2
yxyxw
w
,,
Cww
bw
Cww
yxw
yx
yxyxx
T
T
T
TsZ
,,
,,
Z-score
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Ranking loss:
An upper bound on the ranking loss is minimized:
The number of output vectors with score higher than the score of the correct pairs is minimized.
yy
yxyxyxk
kssIN
rk ,,,L 1
yxyxyxwyxyy
,,,, LL rkurk
k
kT
N
kk NN
2
1
2 111
Previous approaches
Minimize the number of incorrect macrolabels y.
CRFs [Lafferty et al., 01], HMSVM [Altun at al., 03], averaged perceptron [Collins 02].
Minimize the number of incorrect microlabels y.
M3Ns [Taskar et al., 03], SVMISO [Tsochantaridis et al., 04].
yxyx hI/ ,L 10
jj yhIhm xyx,L
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
SODA
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Given a training set T the empirical risk associated to the upper-bound on the ranking loss is minimized.
An equivalent formulation in terms of C and b is considered to solve it .
wCw
bw
wbbCw
bw
w
T
T
i
Tiii
T
ii
T
1
1maxSODA (Structured Output
Discriminant Analysis)
i
N
ikik
iiiT
i
N
kik
i
i
1s.t.
1min
1 1
2
yxyxw
w
,,
SODA Convex optimization:
If C* is not PSD, regularization can be introduced.
Solution: simple matrix inversion .
Fast conjugate gradient methods available.
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
1s.t.
minmax
bw
wCw
wCw
bww
w T
T
T
T
bCw1
Rademacher bound The bound shows that learning based on the upper bound on the ranking loss is
effectively achieved.
The bound holds also in the case where b* and C* are estimated by sampling.
Two directions of sampling: For each only a limited number n of incorrect outputs is considered to
estimate b* and C*.
Only a finite number ℓ of input-output pairs is given in the training set.
The empirical expectation of the estimated loss (estimated by computing b* and C* by random sampling) is a good approximate upper bound for the expected loss .
The latter is an upper bound for the ranking loss , such that the Rademacher bound is also a bound on the expectation of the ranking loss.
yxyx ,E urkL,
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Tyx,
yxyx ,E urkL̂,ˆ
yx,rkL
Rademacher bound
Theorem (Rademacher bound for SODA). With probability at least 1-over the joint of therandom sample T and the random samples from the output space for each that aretaken to approximate the matrices b* and C*, the following bound holds for any w with squarednorm smaller than c:
whereby M is a constant and we assume that the number of random samples for each trainingpair is equal to n.The Rademacher complexity terms and decrease with and respectively, suchthat the bound becomes tight for increasing n and ℓ, as long as n grows faster than log(ℓ).
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
2
2log3
2
2log3
21
//
ˆˆˆˆ,,,,
ˆ
Mn
M
E,E,E ,urkurk
yxyxyxyx yxyx LL
yx,,ˆ
1 2̂ 1
n
1
Tyx,
Z-score approach
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
How to define the Z-score of a training set? Another possible approach (independence assumption):
Convex optimization problem which can be solved again by simple matrix inversion.
Maximizing the Z-score most linear constraints
are satisfied.
wCw
bw
wCw
bw
w T
T
ii
T
ii
T
1
1max Z-score approach
iik
ik
iTiiT i yyyxwyxw ,,,,, 21
One may want to impose explicitly the violated constraints.
This is again a convex optimization problem that can be solved with an iterative algorithm similar to previous approaches (HMSVM [Altun at al., 03], averaged perceptron [Collins 02]).
Eventually relax constraints (e.g. add slack variables for non separable problems).
Iterative approach
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
QP1s.t.
min
bw
wCww
T
T
iik
ik
iTiiT i yyyxwyxw ,,,,, 21
Input: training set T
1: C ← ø2: Compute bi, Ci for all i=1…ℓ3: Compute =sum(bi), =sum(Ci)4: Find wsolving QP.5: Repeat6: for i=1…ℓ do7: Compute yi’=argmaxy wT(xi, yi) 8: if wT(xi, yi’) >wT(xi, ) 9: C ← C U wT((xi, )- (xi, yi’) )> }10: Find wsolving QP s.t. C11: endif 12: endfor13: until C is not changed in during the current iteration.
iy
b C
iy
Iterative approach
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Moments computation
Z-score maximization
Constrained Z-score maximization
Identify the most violated constraint
Experimental results
Chain CRF with HMM features.Sequence length: 50. Training set size: 20 pairs. Test set size: 100 pairs. Comparison with SVMISO [Tsochantaridis et al., 04], Perceptron [Collins 02], CRFs [Lafferty et al., 01].Average number of incorrect labels varying the level of noise p.
24 yx
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling: artificial data.
35 yx
0 0.2 0.4 0.660
70
80
90
100
p
Tes
t e
rro
r
SODAPerceptronSVMISOCRFsZ-score
0 0.2 0.4 0.630
40
50
60
70
80
90
p
Tes
t e
rro
r
CRFsSVMISOPerceptronSODAZ-score
HMM features ( ).Noise level p=0.2.Average number of incorrect labels and computational time as function of the training set size.
Experimental results
35 yx
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling: artificial data.
5 10 15 20 25 3030
32
34
36
38
40
42
44
Training set size
Tes
t e
rro
r
CRFsSVMISOPerceptronSODA
20 40 60 80 1000
10
20
30
40
50
60
70
Training set size
Tim
e
SODASVMISO
Chain CRF with HMM features ( ).Sequence length: 10. Training set size: 50 pairs. Test set size: 100 pairs. Level of noise p=0.2Comparison with SVMISO [Tsochantaridis et al., 04].Labeling error on test set and average training time as function of the observation alphabet size.
2 4 6 80
2
4
6
8
10
12
14
Tim
e (s
ec)
SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO
Experimental results
3 y
2 4 6 80
10
20
30
40
Tes
t e
rro
r
SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO
x
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling: artificial data.
x
Experimental results
Chain CRF with HMM features ( ).
Adding constraints is not very useful when data are noisy and non linearly separable.
0 20 40 60 80 1000
20
40
60
80
100
Number of constraintsAve
rag
e n
um
ber
of c
orr
ect
hid
den
seq
uen
ces
(%)
Z-score (constr)SVMISOPerceptron
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling: artificial data.
24 yx
Experimental results
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence labeling:
NER
Spanish news wire article - Special Session of CoNLL02
300 sentences with average length of 30 words.9 labels: non-name, beginning and continuation
of persons, organizations, locations and miscellaneous names.
Two sets of binary features: S1 (HMM features) and S2 (S1 and HMM features for the previous and the next word).
Labeling error on test set (5-fold crossvalidation)
Method S1 S2
Z-score 11.07 7.89
SODA 10.13 8.27
SVMISO 10.97 8.11
Perceptron 20.99 13.78
CRFs 12.01 8.29
Experimental results
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence alignment: artificial sequences.
5 10 20 50 100
SODA 78.6 62.85 44.6 36.7 30.84
Generative 96.4 94.39 87.12 45.31 31.05
Test error (number of incorrectly aligned pairs) as function of the training set size.
5 10 15 20
2
4
6
8
10
12
14
16
18
20
5 10 15 20
2
4
6
8
10
12
14
16
18
20
Original and reconstructed substitution matrices.
Experimental results
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Sequence parsing:
G6 grammar in [Dowell and Eddy, 2004].RNA sequences of five families extracted from the Rfam database [Griffiths-Jones et al., 2003]
Prediction on five-fold crossvalidation.
Z-score with constraints Generative Perceptron
sensitivity specificity constraints sensitivity specificity sensitivity specificity
RF00032 100 95.98 2 100 95.53 100 95.59
RF00260 98.77 94.80 6 98.97 100 98.57 98.90
RF00436 91.11 90.61 27.6 44.16 53.30 90.27 86.53
RF00164 76.14 73.74 37.8 65.51 62.55 87.06 78.32
RF00480 99.08 89.89 78.2 99.88 86.43 98.83 94.78
Conclusions
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
New methods for learning in structured output spaces. Accuracy comparable with state-of-the-art techniques. Easy to implement (DP for matrix computations and simple optimization problem). Fast for large training set and reasonable number of features.
• Mean and variance computations parallelizable for large training set.• Conjugate gradient techniques used in the optimization phase.
Three application analyzed: sequence labeling, sequence parsing and sequence alignment.
Future works: Test the scalability of this approach using approximate techniques. Develop a dual version with kernels.
Learning in structured output spaces
Z-score
Experimental results and computational issues
Conclusions
Thank you