Magic Moments: Moment-based Approaches to Structured Output Prediction

Magic Moments:Moment-based Approaches toStructured Output Prediction

Elisa Riccijoint work with Nobuhisa Ueda, Tijl De Bie, Nello Cristianini

Thursday, October 25th

The Analysis of Patterns

Outline

Learning in structured output spaces

New algorithms based on Z-score

Experimental results and computational issues

Conclusions

Z-score

Conclusions

Structured data everywhere!!! Many problems involve highly

structured data which can be represented by sequences, trees and graphs.

Temporal, spatial and structural dependencies between objects are modeled.

This phenomenon is observed in several fields such as computational biology, computer vision, natural language processing or web data analysis.

Z-score

Conclusions

Learning with structured data

Z-score

Conclusions

Machine learning and data mining algorithms must be able to analyze efficiently and automatically a vast amount of complex and structured data.

The goal of structured learning algorithms is to predict complex structures, such as sequences, trees, or graphs.

Using traditional algorithms to cope with problems involving structured data often implies a loss of information about the structure.

Find s.t.

Supervised learning Data are available in form of examples and their associated correct answers.

Z-score

Conclusions

Learning:

Prediction:

Hypotheses

yx ii yyy ,,,,,, xxx 11T

yxh : Hh

1 iyh iix

Training set:

yh x on a new test sample x.

Classification A typical supervised learning task is classification.

Z-score

Conclusions

Named entity recognition (NER): locate named entities in text. Entities of interest are person names, location names, organization names, miscellaneous (dates, times...)

Label: entity tag.

Observed variable: word in a sentence.

PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.

O N N N M m m N N N O L

Multiclass classification y

Sequence labeling

Z-score

Conclusions

Sequence labeling: given an input sequence x, reconstruct the associated label sequence y of equal length.

Label sequence: entity tags.

Observed sequence: words in a sentence.

Can we consider the interactions between adjacent words?

Goal: realize a joint labeling for all the words in the sentence.

PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida.

O N N N M m m N N N O L

y = (y1...yn)

x = (x1...xn)

Sequence alignment

Z-score

Conclusions

Biological sequence alignment is used to determine the similarity between biological sequences.

ACTGATTACGTGAACTGGATCCA

ACTC--TAGGTGAAGTG-ATCCA?

S ={A,T,G,C}, S1 , S2

Given two sequences S1, S2 S a global alignment is an assignment of gaps, so as to line up each letter in one sequence with either a gap or a letter in the other sequence.

ATGCTTTC------CTGTCGCC

S1 ATGCTTTCS2 CTGTCGCC

Sequence alignment: given a sequences pair x, predict the correct sequence y of alignment operations (e.g. matches, mismatches, gaps).

Alignments can be represented as paths from the upper-left to the lower-right corner in the alignment graph.

Sequence alignment

Z-score

Conclusions

ATGCTTTC------CTGTCGCCy

A T G C T T T CCTGTCGCC

RNA secondary structure prediction

Z-score

Conclusions

RNA secondary structure prediction: given a RNA sequence, predict the most likely secondary structure.

The study of RNA structure is important in understanding its functions.

AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUGAUUUCGAUUCUCACUACUCAU

Sequence parsing

Z-score

Conclusions

Sequence parsing: given an input sequence x, determine the associated parse tree y given an underlying context-free grammar.

Example:

GAUCGAUCGAUCx

Context-free grammar G={V, A, R, S}

V ={S} set of non-terminals symbols

S = {G, A, U, C} set of terminals symbols

R= {S → SS | GSC | CSG | ASU | USA | }.

Traditionally HMMs have been used for sequence labeling.

Two main drawbacks: The conditional independence assumptions are often too restrictive. HMMs cannot represent multiple interacting features or long range dependencies between the observations.They are typically trained by maximum likelihood (ML) estimation.

Label sequence y = (y1...yn) Observed sequence x = (x1...xn)

y1 y2 y3

x1 x2 x3

Generative models

Z-score

Conclusions

Sequence labeling:

Discriminative models

Specify the probability of possible output y given an observation x (consider conditional probability P(y|x) rather than joint probability P(y,x)).

Do not require strict independence assumptions of generative models.

Arbitrary features of the observations are considered.

Conditional Random Fields (CRFs) [Lafferty et al., 01]

Z-score

Conclusions

y1 y2 y3

x1 x2 x3

Z-score

Conclusions

Several discriminative algorithms have emerged recently in order to predict complex structures, such as sequences, trees, or graphs.

New discriminative approaches.

Problems analyzed: Given a training set of correct pairs of sentences and their associated entity tags learn

to extract entities from a new sentence. Given a training set of correct biological alignments learn to align two unknown

sequences. Given a training set of corrects RNA secondary structures associated to a set of

sequences learn to determine the secondary structure of a new sequence.

This is not an exhaustive list of possible applications.

Find s.t.

Learning in structured output spaces Multilabel supervised classification (Output: y = (y1...yn)).

Z-score

Conclusions

Learning:

Prediction:

Hypotheses

yx ii yxyxyx ,,,,,, 11T

yxh : Hh

1 ih ii yx

Training set:

yx h on a new test sample x. yxxy

max argScore

dT Rs wyxwyx ,,,

Three main phases:

Encoding: define a suitable feature map (x,y).

Compression: characterize the output space in a synthetic and compact way.

Optimization: define a suitable objective function and use it for learning.

Z-score

Conclusions

Z-score

Conclusions

Encoding

Z-score

Conclusions

Features must be defined in a way such that prediction can be computed efficiently.

The feature vector (x,y) decomposes as sum of elementary features (x,y) on “parts”.

Parts are typically edges or nodes in graphs.

max arg

is typically

Encoding

Z-score

Conclusions

ykkkktpz zpzyIpyIyy ,, 11

yxkkkkepq pqpyIqxIyx ,,

y1 y2 y3

x1 x2 x3

In general features reflect long range interactions (when labeling xi past and future observations are taken into account).

Arbitrary features of the observations are considered (e.g. spelling properties in NER).

Sequence labeling:

Example: CRF with HMM features

3-parameters model:

In practice more complex models are used:

4-parameters model: affine function for gap penalties, i.e. different costs if the gap starts (gap opening penalty) in a given position or if it continues (gap extension penalty).

211/212-parameters model: (x,y) contains the statistics associated to the gap penalties and all the possible pairs of amino acids.

Encoding

Z-score

Conclusions

#matches #mismatches #gaps

Sequence alignment:

Encoding

Z-score

Conclusions

S → SS S → GSC

S → CSG

S → ASU

S → USA

S → .

Sequence parsing:

GAUCGAUCGAUCx

The feature vector contains the statistics associated to the occurrences of the rules.

Encoding Having defined these features, predictions can be computed efficiently with dynamic

programming (DP).

Sequence labeling Viterbi algorithm

Sequence alignment Needleman-Wunsch algorithm

Sequence parsing Cocke-Younger-Kasami (CYK) algorithm

Z-score

Conclusions

A T G C T T T C

CTGTCGCC

DP TABLE

Z-score

Conclusions

Computing moments

Z-score

Conclusions

The number N of possible output vector yk given an observation x is typically huge.

To characterize the distribution of the scores its mean and its variance are considered.

C and can be computed efficiently with DP techniques.

wyxwyxyx

Cwwwyxyxwyx ΤN

2 1 ,,,

Input: x = (x1, x2, ..., xn), p, q.

(i, 1) := 1 i

if (q = x1) and (p = i),

for j = 2 to n

for i = 1 to

M := 0

if (q = x1) and (p = i), M := 1

endfor

Output:

jiji 1: ,,

jiMjiji i

1:1 ,iepq

Computing moments

Z-score

Conclusions

The number N of possible label sequences yk given an observation sequence x is exponential in the length of the sequences.

An algorithm similar to the forward algorithm is used to compute and C.

Recursive formulaSequence labeling:

Mean value associated to the feature which represents the emission of a symbol q at state p.

kkepq yx ,

y1 y2 y3

x1 x2 x3

Computing moments

Z-score

Conclusions

ii aEaEaE

ii aEaEaaEaE

Basic idea behind recursive formulas:

Mean values are computed considering:

Variances are computed centering the second order moments:

Computing moments

Z-score

Conclusions

Problem: high computational cost for large feature spaces. 1st Solution: Exploit the structure and the sparseness of the covariance matrix C.

In sequence labeling for CRF with HMM features the number of different values in C is linear in the size of the observation alphabet.

2nd Solution: Sampling strategy.

Example:

Z-score

Conclusions

Z-score

Conclusions

New optimization criterion particularly suited for non-separable cases.

Minimize the number of output vectors with score higher than the score of the correct pairs.

Maximize the Z-score:

Z-score

Conclusions

The Z-score can be expressed as a function of the parameters w.

Two equivalent optimization problems:

Z-score

Conclusions

Ranking loss:

An upper bound on the ranking loss is minimized:

The number of output vectors with score higher than the score of the correct pairs is minimized.

yxyxyxk

rk ,,,L 1

yxyxyxwyxyy

,,,, LL rkurk

Previous approaches

Minimize the number of incorrect macrolabels y.

CRFs [Lafferty et al., 01], HMSVM [Altun at al., 03], averaged perceptron [Collins 02].

Minimize the number of incorrect microlabels y.

M3Ns [Taskar et al., 03], SVMISO [Tsochantaridis et al., 04].

yxyx hI/ ,L 10

jj yhIhm xyx,L

Z-score

Conclusions

Z-score

Conclusions

Given a training set T the empirical risk associated to the upper-bound on the ranking loss is minimized.

An equivalent formulation in terms of C and b is considered to solve it .

1maxSODA (Structured Output

Discriminant Analysis)

SODA Convex optimization:

If C* is not PSD, regularization can be introduced.

Solution: simple matrix inversion .

Fast conjugate gradient methods available.

Z-score

Conclusions

minmax

Rademacher bound The bound shows that learning based on the upper bound on the ranking loss is

effectively achieved.

The bound holds also in the case where b* and C* are estimated by sampling.

Two directions of sampling: For each only a limited number n of incorrect outputs is considered to

estimate b* and C*.

Only a finite number ℓ of input-output pairs is given in the training set.

The empirical expectation of the estimated loss (estimated by computing b* and C* by random sampling) is a good approximate upper bound for the expected loss .

The latter is an upper bound for the ranking loss , such that the Rademacher bound is also a bound on the expectation of the ranking loss.

yxyx ,E urkL,

Z-score

Conclusions

yxyx ,E urkL̂,ˆ

yx,rkL

Rademacher bound

Theorem (Rademacher bound for SODA). With probability at least 1-over the joint of therandom sample T and the random samples from the output space for each that aretaken to approximate the matrices b* and C*, the following bound holds for any w with squarednorm smaller than c:

whereby M is a constant and we assume that the number of random samples for each trainingpair is equal to n.The Rademacher complexity terms and decrease with and respectively, suchthat the bound becomes tight for increasing n and ℓ, as long as n grows faster than log(ℓ).

Z-score

Conclusions

ˆˆˆˆ,,,,

E,E,E ,urkurk

yxyxyxyx yxyx LL

yx,,ˆ

1 2̂ 1

Z-score approach

Z-score

Conclusions

How to define the Z-score of a training set? Another possible approach (independence assumption):

Convex optimization problem which can be solved again by simple matrix inversion.

Maximizing the Z-score most linear constraints

are satisfied.

1max Z-score approach

iTiiT i yyyxwyxw ,,,,, 21

One may want to impose explicitly the violated constraints.

This is again a convex optimization problem that can be solved with an iterative algorithm similar to previous approaches (HMSVM [Altun at al., 03], averaged perceptron [Collins 02]).

Eventually relax constraints (e.g. add slack variables for non separable problems).

Iterative approach

Z-score

Conclusions

QP1s.t.

iTiiT i yyyxwyxw ,,,,, 21

Input: training set T

1: C ← ø2: Compute bi, Ci for all i=1…ℓ3: Compute =sum(bi), =sum(Ci)4: Find wsolving QP.5: Repeat6: for i=1…ℓ do7: Compute yi’=argmaxy wT(xi, yi) 8: if wT(xi, yi’) >wT(xi, ) 9: C ← C U wT((xi, )- (xi, yi’) )> }10: Find wsolving QP s.t. C11: endif 12: endfor13: until C is not changed in during the current iteration.

Iterative approach

Z-score

Conclusions

Moments computation

Z-score maximization

Constrained Z-score maximization

Identify the most violated constraint

Experimental results

Chain CRF with HMM features.Sequence length: 50. Training set size: 20 pairs. Test set size: 100 pairs. Comparison with SVMISO [Tsochantaridis et al., 04], Perceptron [Collins 02], CRFs [Lafferty et al., 01].Average number of incorrect labels varying the level of noise p.

Z-score

Conclusions

Sequence labeling: artificial data.

0 0.2 0.4 0.660

SODAPerceptronSVMISOCRFsZ-score

0 0.2 0.4 0.630

CRFsSVMISOPerceptronSODAZ-score

HMM features ( ).Noise level p=0.2.Average number of incorrect labels and computational time as function of the training set size.

Z-score

Conclusions

5 10 15 20 25 3030

Training set size

CRFsSVMISOPerceptronSODA

20 40 60 80 1000

Training set size

SODASVMISO

Chain CRF with HMM features ( ).Sequence length: 10. Training set size: 50 pairs. Test set size: 100 pairs. Level of noise p=0.2Comparison with SVMISO [Tsochantaridis et al., 04].Labeling error on test set and average training time as function of the observation alphabet size.

2 4 6 80

SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO

2 4 6 80

SODA (50 paths)SODA (200 paths)SODA (DP)SVMISO

Z-score

Conclusions

Chain CRF with HMM features ( ).

Adding constraints is not very useful when data are noisy and non linearly separable.

0 20 40 60 80 1000

Number of constraintsAve

Z-score (constr)SVMISOPerceptron

Z-score

Conclusions

Z-score

Conclusions

Sequence labeling:

Spanish news wire article - Special Session of CoNLL02

300 sentences with average length of 30 words.9 labels: non-name, beginning and continuation

of persons, organizations, locations and miscellaneous names.

Two sets of binary features: S1 (HMM features) and S2 (S1 and HMM features for the previous and the next word).

Labeling error on test set (5-fold crossvalidation)

Method S1 S2

Z-score 11.07 7.89

SODA 10.13 8.27

SVMISO 10.97 8.11

Perceptron 20.99 13.78

CRFs 12.01 8.29

Z-score

Conclusions

Sequence alignment: artificial sequences.

5 10 20 50 100

SODA 78.6 62.85 44.6 36.7 30.84

Generative 96.4 94.39 87.12 45.31 31.05

Test error (number of incorrectly aligned pairs) as function of the training set size.

5 10 15 20

Original and reconstructed substitution matrices.

Z-score

Conclusions

Sequence parsing:

G6 grammar in [Dowell and Eddy, 2004].RNA sequences of five families extracted from the Rfam database [Griffiths-Jones et al., 2003]

Prediction on five-fold crossvalidation.

Z-score with constraints Generative Perceptron

sensitivity specificity constraints sensitivity specificity sensitivity specificity

RF00032 100 95.98 2 100 95.53 100 95.59

RF00260 98.77 94.80 6 98.97 100 98.57 98.90

RF00436 91.11 90.61 27.6 44.16 53.30 90.27 86.53

RF00164 76.14 73.74 37.8 65.51 62.55 87.06 78.32

RF00480 99.08 89.89 78.2 99.88 86.43 98.83 94.78

Conclusions

Z-score

Conclusions

New methods for learning in structured output spaces. Accuracy comparable with state-of-the-art techniques. Easy to implement (DP for matrix computations and simple optimization problem). Fast for large training set and reasonable number of features.

• Mean and variance computations parallelizable for large training set.• Conjugate gradient techniques used in the optimization phase.

Three application analyzed: sequence labeling, sequence parsing and sequence alignment.

Future works: Test the scalability of this approach using approximate techniques. Develop a dual version with kernels.

Z-score

Conclusions

Thank you

Magic Moments: Moment-based Approaches to Structured Output Prediction

Documents

Transcript of Magic Moments: Moment-based Approaches to Structured Output Prediction

Masses and Moments via Precision Atomic Physics Techniques · 2017. 3. 3. · High intensities –good quality –element dependent ... nuclear multipole moments (dipole moment m

Shear Forces and Bending Moments - Vaftsy CAEvaftsycae.com/knowledge_files/SFD_BMD.pdfShear Forces and Bending Moments Problem 4.3-1 Calculate the shear force V and bending moment

Bending Shear Forces & Bending Moments Shear & Moment Diagrams.

Engineering Mechanics: Statics in SI Units, 12e...1. Moment of a Force –Scalar Formation 2. Cross Product 3. Moment of Force –Vector Formulation 4. Principle of Moments 5. Moment

06-GB Moment Connections - cvut.czpeople.fsv.cvut.cz/.../Texts_of_lessons/06-GB_Moment_Connections… · Moment connections are designed to transfer bending moments, shear forces

Distributed Forces: Moments of Inertia · 2018. 12. 6. · 512 Distributed Forces: Moments of Inertia MOMENTS OF INERTIA OF MASSES 9.11 MOMENT OF INERTIA OF A MASS Consider a small

INSTITUTE OF AERONAUTICAL ENGINEERING LECTURE NOTES.pdf · function of a probability distribution. Mathematical expectation. Moment about origin, central moments, moment generating

EVERY MOMENT - Saffire Freycinet · 2019-02-17 · EVERY MOMENT The Saffire experience is made up of moments. Meaningful moments that capture the essence of Saffire and its close

Generalized method of moments (GMM) estimation in … · Generalized method of moments (GMM) estimation in Stata 11 ... In the generalized method of moments (MM), we more sample moment

Teachable moments literature review · teachable moments and compare outcomes to other approaches. Organisations implementing a teachable moment approach or MECC in general need to

INTEGRITY. Our nation integrates at dramatic moments. All conflicts and disputes should cease in really terrible moments. Such a moment was a disaster.

J. Flusser, T. Suk, and B. Zitová Moments and Moment Invariants in Pattern Recognition

moments and rigid body equilibrium - Faculty Webspacesfaculty.arch.tamu.edu/media/cms_page_media/4211/lect5_scdl5X8.pdf · a different moment: z. Moments 5 S2017abn Lecture 5 Elements

Feature Extraction Based on Wavelet Moments and Moment Invariants in Machine Vision Systems

Golden Moments Blend for Structured Water

Chapter 7 Reading on Moment Calculation. Time Moments of Impulse Response h(t) Definition of moments i-th moment Note that m 1 = Elmore delay when h(t)

Make a Moment for Wow Moments Early Language Session 2.

CHAPTER VECTOR MECHANICS FOR ENGINEERS: STATICSkisi.deu.edu.tr/mehmet.cevik/Statics/chapt9.pdf · Mohr’s Circle for Moments and Products of ... second moment 0 first moment ...

Moments in 2D - Memphis - University of Memphis in 2D.pdf · moment MdF= ⊥ 18 Moments in 2D Monday,September 17, 2012 Deﬁnion! We ... 30 Moments in 2D Monday,September 17, 2012

First- and Second-Moment Equationskiwi.atmos.colostate.edu/group/dave/at745/FirstSecondMoment.pdf · First- and Second-Moment Equations ... moments is given later. The rate equation