Ph.D. Thesis Oral DefensePh.D. Thesis Oral Defense Stanford Artificial Intelligence Laboratory...

252
Ph.D. Thesis Oral Defense Stanford Artificial Intelligence Laboratory (SAIL) Valentin I. Spitkovsky Grammar Induction and Parsing with Dependency-and-Boundary Models V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 1 / 60

Transcript of Ph.D. Thesis Oral DefensePh.D. Thesis Oral Defense Stanford Artificial Intelligence Laboratory...

  • Ph.D. Thesis Oral Defense

    Stanford Artificial Intelligence Laboratory (SAIL)

    Valentin I. Spitkovsky

    Grammar Induction and Parsingwith Dependency-and-Boundary Models

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 1 / 60

  • The Problem

    Problem: Unsupervised Parsing (and Grammar Induction)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 2 / 60

  • The Problem

    Problem: Unsupervised Parsing (and Grammar Induction)

    Input: Raw Text

    ... By most measures, the nation’s industrial sector is now

    growing very slowly — if at all. Factory payrolls fell in

    September. So did the Federal Reserve ...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 2 / 60

  • The Problem

    Problem: Unsupervised Parsing (and Grammar Induction)

    NN NNS VBD IN NN ♦| | | | | |

    Factory payrolls fell in September .

    Input: Raw Text (Sentences, Tokens and POS-tags)

    ... By most measures, the nation’s industrial sector is now

    growing very slowly — if at all. Factory payrolls fell in

    September. So did the Federal Reserve ...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 2 / 60

  • The Problem

    Problem: Unsupervised Parsing (and Grammar Induction)

    NN NNS VBD IN NN ♦| | | | | |

    Factory payrolls fell in September .

    Input: Raw Text (Sentences, Tokens and POS-tags)

    ... By most measures, the nation’s industrial sector is now

    growing very slowly — if at all. Factory payrolls fell in

    September. So did the Federal Reserve ...

    Output: Syntactic Structures (and a Probabilistic Grammar)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 2 / 60

  • The Problem

    Motivation: Unsupervised (Dependency) Parsing

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 3 / 60

  • The Problem

    Motivation: Unsupervised (Dependency) Parsing

    Insert your favorite reason(s) why you’d like to parseanything in the first place...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 3 / 60

  • The Problem

    Motivation: Unsupervised (Dependency) Parsing

    Insert your favorite reason(s) why you’d like to parseanything in the first place...

    ... adjust for any data without reference tree banks:— i.e., understudied languages or genres (e.g., legal).

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 3 / 60

  • The Problem

    Motivation: Unsupervised (Dependency) Parsing

    Insert your favorite reason(s) why you’d like to parseanything in the first place...

    ... adjust for any data without reference tree banks:— i.e., understudied languages or genres (e.g., legal).

    Potential applications:◮ machine translation

    — word alignment, phrase extraction, reordering;

    ◮ web search— retrieval, query refinement;

    ◮ question answering, speech recognition, etc.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 3 / 60

  • The Problem

    Evaluation: Directed Dependency Accuracy

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 4 / 60

  • The Problem

    Evaluation: Directed Dependency Accuracy

    Scoring example:

    NN NNS VBD IN NN ♦| | | | | |

    Factory payrolls fell in September .

    Directed Score: 35 = 60% (left/right-branching baselines:25 = 40%)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 4 / 60

  • The Problem

    Evaluation: Directed Dependency Accuracy

    Scoring example:

    NN NNS VBD IN NN ♦| | | | | |

    Factory payrolls fell in September .

    Directed Score: 35 = 60% (left/right-branching baselines:25 = 40%);

    Undirected Score: 45 = 80% (left/right-branching baselines:45 = 80%).

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 4 / 60

  • Prior Art

    Prior Art: A Brief History

    1992 — word classes (Carroll and Charniak)1998 — greedy linkage via mutual information (Yuret)2001 — iterative re-estimation with EM (Paskin)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 5 / 60

  • Prior Art

    Prior Art: A Brief History

    1992 — word classes (Carroll and Charniak)1998 — greedy linkage via mutual information (Yuret)2001 — iterative re-estimation with EM (Paskin)

    2004 — right-branching baseline— valence (DMV) (Klein and Manning)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 5 / 60

  • Prior Art

    Prior Art: A Brief History

    1992 — word classes (Carroll and Charniak)1998 — greedy linkage via mutual information (Yuret)2001 — iterative re-estimation with EM (Paskin)

    2004 — right-branching baseline— valence (DMV) (Klein and Manning)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 5 / 60

  • Prior Art

    Prior Art: A Brief History

    1992 — word classes (Carroll and Charniak)1998 — greedy linkage via mutual information (Yuret)2001 — iterative re-estimation with EM (Paskin)

    2004 — right-branching baseline— valence (DMV) (Klein and Manning)

    2004 — annealing techniques (Smith and Eisner)2005 — contrastive estimation (Smith and Eisner)2006 — structural biasing (Smith and Eisner)2007 — common cover link representation (Seginer)2008 — logistic normal priors (Cohen et al.)2009 — lexicalization and smoothing (Headden et al.)2009 — soft parameter tying (Cohen and Smith)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 5 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    noun

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    nounphrase

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    nounphrase

    etc...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    nounphrase

    etc...

    a2

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    nounphrase

    etc...

    a2

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    e.g.: verb

    a1

    nounphrase

    etc...

    a2

    STOP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Dependency Model with Valence

    a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

    h

    a1 a2

    STOP

    P(th) =∏

    dir∈{L,R}

    PSTOP(ch, dir,

    adj︷︸︸︷

    1n=0)

    n∏

    i=1

    P(tai ) PATTACH(ch, dir, cai )

    (1− PSTOP(ch, dir,

    adj︷︸︸︷

    1i=1))

    n=|args(h,dir)|V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 6 / 60

  • Prior Art

    Prior Art: Unsupervised Learning Engine

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 7 / 60

  • Prior Art

    Prior Art: Unsupervised Learning Engineparameter fitting via expectation-maximization (EM)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 7 / 60

  • Prior Art

    Prior Art: Unsupervised Learning Engineparameter fitting via expectation-maximization (EM)

    MAGIC

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 7 / 60

  • Prior Art

    Prior Art: Unsupervised Learning Engineparameter fitting via expectation-maximization (EM)

    ◮ by means of inside-outside re-estimation (Baker, 1979)

    w1 wmwp−1 wp wq wq+1

    N1 (Manning and Schütze, 1999)

    N j

    · · · · · · · · ·

    α

    β

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 7 / 60

  • Prior Art

    Prior Art: Unsupervised Learning Engineparameter fitting via expectation-maximization (EM)

    BLACK

    BOX

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 7 / 60

  • Prior Art

    Prior Art: The Standard Corpus — WSJ

    5 10 15 20 25 30 35 40 45

    5

    10

    15

    20

    25

    30

    35

    40

    45Sentences (1,000s)

    Tokens (1,000s)100

    200

    300

    400

    500

    600

    700

    800

    900

    WSJk

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 8 / 60

  • Prior Art

    Prior Art: The Standard Corpus — WSJ10

    5 10 15 20 25 30 35 40 45

    5

    10

    15

    20

    25

    30

    35

    40

    45Sentences (1,000s)

    Tokens (1,000s)100

    200

    300

    400

    500

    600

    700

    800

    900

    WSJk

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 8 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    Part I: Non-Convex Optimization

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    Part I: Non-Convex Optimization

    Part II: Parsing Constraints

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    Part I: Non-Convex Optimization

    Part II: Parsing Constraints

    Part III: Generative Models

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    Part I: Non-Convex Optimization

    Part II: Parsing Constraints

    Part III: Generative Models

    State-of-the-Art Integrated Solution

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Outline

    Introduction to the Grammar Induction Problem◮ The Inputs and Outputs (text → dependency parses)◮ A Probabilistic Model (DMV)◮ Parameter Fitting (EM)

    Part I: Non-Convex Optimization

    Part II: Parsing Constraints

    Part III: Generative Models

    State-of-the-Art Integrated Solution (in submission)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 9 / 60

  • Outline

    Optimization

    Viterbi Training

    Baby Steps

    Lateen EM

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 10 / 60

  • Issues...

    Issue I: Why so little data?

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

  • Issues...

    Issue I: Why so little data?

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

  • Issues...

    Issue I: Why so little data?

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

  • Issues...

    Issue I: Why so little data?

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

  • Issues...

    Issue I: Why so little data?

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 11 / 60

  • Issues...

    Issue II: Non-convex objectives...

    maximizing the probability of data (sentence strings):

    θ̂UNS = argmaxθ

    s

    log∑

    t∈T (s)

    Pθ(t)

    ︸ ︷︷ ︸

    Pθ(s)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 12 / 60

  • Issues...

    Issue II: Non-convex objectives...

    maximizing the probability of data (sentence strings):

    θ̂UNS = argmaxθ

    s

    log∑

    t∈T (s)

    Pθ(t)

    ︸ ︷︷ ︸

    Pθ(s)

    supervised objective would be convex:

    θ̂SUP = argmaxθ

    s

    logPθ(t∗(s))

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 12 / 60

  • Issues...

    Issue II: Non-convex objectives...

    maximizing the probability of data (sentence strings):

    θ̂UNS = argmaxθ

    s

    log∑

    t∈T (s)

    Pθ(t)

    ︸ ︷︷ ︸

    Pθ(s)

    supervised objective would be convex:

    θ̂SUP = argmaxθ

    s

    logPθ(t∗(s))

    could optimize likeliest parse trees (also non-convex):

    θ̂VIT = argmaxθ

    s

    maxt∈T (s)

    logPθ(t)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 12 / 60

  • Issues... Classic EM

    Classic EM:

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 13 / 60

  • Issues... Viterbi EM

    Viterbi EM:

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Oracle

    Uninformed

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 14 / 60

  • Issues... Viterbi EM

    Hard vs. Soft EM:

    Viterbi EM: zoom in on likeliest tree

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 15 / 60

  • Issues... Viterbi EM

    Hard vs. Soft EM:

    Viterbi EM: zoom in on likeliest tree◮ does not degrade with input length◮ better retains supervised solutions◮ actually learns from scratch

    — see also Cohen and Smith (2010)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 15 / 60

  • Issues... Viterbi EM

    Hard vs. Soft EM:

    Viterbi EM: zoom in on likeliest tree◮ does not degrade with input length◮ better retains supervised solutions◮ actually learns from scratch

    — see also Cohen and Smith (2010)

    Classic EM: “focus across the board”

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 15 / 60

  • Issues... Viterbi EM

    Hard vs. Soft EM:

    Viterbi EM: zoom in on likeliest tree◮ does not degrade with input length◮ better retains supervised solutions◮ actually learns from scratch

    — see also Cohen and Smith (2010)

    Classic EM: “focus across the board”◮ hard to see the trees for the forest...

    ... but known to work with shorter inputs!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 15 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    slowly extend it to the fully complex target task

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    slowly extend it to the fully complex target task

    take tiny (cautious) steps in the problem space

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    slowly extend it to the fully complex target task

    take tiny (cautious) steps in the problem space

    ... try not to stray far from relevantneighborhoods in the solution space

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    slowly extend it to the fully complex target task

    take tiny (cautious) steps in the problem space

    ... try not to stray far from relevantneighborhoods in the solution space

    base case: sentences of length one (trivial — no init)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Non-convex Optimization

    start with an easy (convex) case

    slowly extend it to the fully complex target task

    take tiny (cautious) steps in the problem space

    ... try not to stray far from relevantneighborhoods in the solution space

    base case: sentences of length one (trivial — no init)

    incremental step: smooth WSJk; re-init WSJ(k + 1)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 16 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Graduated Learning

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 17 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Graduated Learning

    WSJ1 — Atone (verbs!)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 17 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Graduated Learning

    WSJ1 — Atone (verbs!)

    WSJ2 — It is. (pronouns!)Darkness fell.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 17 / 60

  • Baby Steps

    Idea I: Baby Steps ... as Graduated Learning

    WSJ1 — Atone (verbs!)

    WSJ2 — It is. (pronouns!)Darkness fell.

    Become a LobbyistWSJ3 — But many have. (determiners!)

    They didn’t.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 17 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    shaping (Skinner, 1938)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    shaping Psychology / Cognitive Science (Skinner, 1938)

    less is more (Kail, 1984; Newport, 1988; 1990)

    starting small (Elman, 1993)

    ◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    shaping Psychology / Cognitive Science (Skinner, 1938)

    less is more (Kail, 1984; Newport, 1988; 1990)

    starting small (Elman, 1993)

    ◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]

    stepping stones (Brown et al., 1993)

    coarse-to-fine NLP / AI (Charniak and Johnson, 2005)

    curriculum learning (Bengio et al., 2009)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    shaping Psychology / Cognitive Science (Skinner, 1938)

    less is more (Kail, 1984; Newport, 1988; 1990)

    starting small (Elman, 1993)

    ◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]

    stepping stones (Brown et al., 1993)

    coarse-to-fine NLP / AI (Charniak and Johnson, 2005)

    curriculum learning (Bengio et al., 2009)

    continuation methods (Allgower and Georg, 1990)

    deterministic annealing Math / OR (Rose, 1998)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Idea I: Baby Steps ... and Related Notions

    shaping Psychology / Cognitive Science (Skinner, 1938)

    less is more (Kail, 1984; Newport, 1988; 1990)

    starting small (Elman, 1993)

    ◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]

    stepping stones (Brown et al., 1993)

    coarse-to-fine NLP / AI (Charniak and Johnson, 2005)

    curriculum learning (Bengio et al., 2009)

    continuation methods (Allgower and Georg, 1990)

    deterministic annealing Math / OR (Rose, 1998)

    successive approximations!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 18 / 60

  • Baby Steps

    Aside I: Speaking of successive approximations...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 19 / 60

  • Baby Steps

    Idea I: Baby Steps ... Observations & Results!

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

  • Baby Steps

    Idea I: Baby Steps ... Observations & Results!

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

    Baby Steps

  • Baby Steps

    Idea I: Baby Steps ... Observations & Results!

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

    Baby Steps

  • Baby Steps

    Idea I: Baby Steps ... Observations & Results!

    5 10 15 20 25 30 35 40

    10

    20

    30

    40

    50

    60

    70

    WSJk

    Directed Dependency Accuracy (%)on WSJ40

    Uninformed

    Oracle

    Baby Steps

    Less is More︸ ︷︷ ︸

    K&M∗

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 20 / 60

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby StepsK&M

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby Steps

    K&M∗

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby Steps

    K&M∗

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby Steps

    K&M∗

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby Steps

    K&M∗

  • Leapfrog

    Idea II: Less is More & Leapfrog ... a Hack!

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Directed Dependency Accuracy (%)on WSJk

    Oracle

    Uninformed

    Baby Steps

    K&M∗

    Leapfrog

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 21 / 60

  • Leapfrog

    Aside II: Also, to be fair...

    5 10 15 20 25 30 35 40

    20

    30

    40

    50

    60

    70

    80

    90

    WSJk

    Undirected Dependency Accuracy (%)on WSJk

    Uninformed

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 22 / 60

  • Lateen EM

    Lateen EM: Intuition

    Use both objectives (a primary and a secondary).

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 23 / 60

  • Lateen EM

    Lateen EM: Intuition

    Use both objectives (a primary and a secondary).

    As a captain can’t count on favorable winds, so anunsupervised learner can’t rely on co-operative gradients.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 23 / 60

  • Lateen EM

    Lateen EM: Intuition

    Use both objectives (a primary and a secondary).

    As a captain can’t count on favorable winds, so anunsupervised learner can’t rely on co-operative gradients.

    Lateen strategies de-emphasize fixed points, e.g., bytacking around local attractors, in a zig-zag fashion.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 23 / 60

  • Lateen EM

    Lateen EM: Simple Algorithm

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 24 / 60

  • Lateen EM

    Lateen EM: Simple Algorithm

    Alternate ordinary soft and hard EM algorithms:switching when stuck helps escape local optima.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 24 / 60

  • Lateen EM

    Lateen EM: Simple Algorithm

    Alternate ordinary soft and hard EM algorithms:switching when stuck helps escape local optima.

    E.g., Italian grammar induction improvesfrom 41.8% to 56.2% after three lateen alternations:

    50 100 150 200 250 300

    3.03.54.04.5

    3.39

    3.26

    3.42

    3.19

    3.33

    3.23

    3.39

    3.18

    3.29

    3.21

    3.39

    3.18

    3.29

    3.22

    bpt

    iteration

    Pumping action: – hard EM pushes down the top curve (primary objective);– soft EM pushes down the bottom curve (the secondary

    objective), often at the expense of the primary.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 24 / 60

  • Lateen EM

    Lateen EM: Early-Stopping

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 25 / 60

  • Lateen EM

    Lateen EM: Early-Stopping

    Use one objective to validate moves proposed by theother: stop if the secondary objective gets worse.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 25 / 60

  • Lateen EM

    Lateen EM: Early-Stopping

    Use one objective to validate moves proposed by theother: stop if the secondary objective gets worse.

    30% faster, on average, than either standard EM.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 25 / 60

  • Lateen EM

    Optimization: Summary

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:◮ guarantee to not harm a primary likelihood

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:◮ guarantee to not harm a primary likelihood;◮ begin with large steps in a parameter space.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:◮ guarantee to not harm a primary likelihood;◮ begin with large steps in a parameter space.

    Ameliorate EM’s disadvantages:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:◮ guarantee to not harm a primary likelihood;◮ begin with large steps in a parameter space.

    Ameliorate EM’s disadvantages:◮ avoid taking disproportionately many (and ever-smaller)

    steps to approach a likelihood’s fixed point

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Optimization: Summary

    Take guesswork out of tuning EM:◮ Lateen EM ties termination to a sign change;◮ Viterbi EM and Baby Steps don’t require initializers.

    Exploit multiple views and iterate (Blum and Mitchell, 1998):

    ◮ Lateen EM optimizes for Viterbi parse trees and forests;◮ Baby Steps focuses on simpler sentences in the data.

    Capitalize on EM’s advantages:◮ guarantee to not harm a primary likelihood;◮ begin with large steps in a parameter space.

    Ameliorate EM’s disadvantages:◮ avoid taking disproportionately many (and ever-smaller)

    steps to approach a likelihood’s fixed point;◮ escape by changing objectives and mixing solutions!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 26 / 60

  • Lateen EM

    Constraints

    Web Markup

    Punctuation

    Capitalization

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 27 / 60

  • Overview

    Constraints: Supervised and Unsupervised

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    relevant to unsupervised learning (less rope to hang self)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    relevant to unsupervised learning (less rope to hang self)— in general, steer at the “right” regularities in data

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    relevant to unsupervised learning (less rope to hang self)— in general, steer at the “right” regularities in data— linguistic structure underdetermined by raw text

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervisedcompact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    relevant to unsupervised learning (less rope to hang self)— in general, steer at the “right” regularities in data— linguistic structure underdetermined by raw text

    partial bracketings (Pereira and Schabes, 1992)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Overview

    Constraints: Supervised and Unsupervised

    compact summaries of high-level insights into a domain— can significantly reduce the search space— often easier to enforce than to model

    relevant to unsupervised learning (less rope to hang self)— in general, steer at the “right” regularities in data— linguistic structure underdetermined by raw text

    partial bracketings (Pereira and Schabes, 1992)

    synchronous grammars (Alshawi and Douglas, 2000)

    linear-time parsing, skewness of trees, etc. (Seginer, 2007)

    sparse posterior regularization (Ganchev et al., 2009)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 28 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Constituents

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 29 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Constituents

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    S → NP VP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 29 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Constituents

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    S → NP VP→ DT NNP NNP VBZ NP PP S

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 29 / 60

  • Linguistic Analysis

    Syntax of Markup: Constituent Productions

    %NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

    NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

    NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

    NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

    35.3

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 30 / 60

  • Linguistic Analysis

    Syntax of Markup: Constituent Productions

    %NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

    NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

    NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

    NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

    35.3

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 30 / 60

  • Linguistic Analysis

    Syntax of Markup: Constituent Productions

    %NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

    NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

    NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

    NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

    35.3

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 30 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Dependencies

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    DT NNP NNP VBZ DT IN DT JJS JJ NN

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 31 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Dependencies

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    DT NNP NNP VBZ DT IN DT JJS JJ NN

    DT NNP VBZ

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 31 / 60

  • Linguistic Analysis

    Syntax of Markup: Common Dependencies

    ..., but [S [NP the Toronto Star][VP reports [NP this][PP in the softest possible way],[S stating ...]]]

    DT NNP NNP VBZ DT IN DT JJS JJ NN

    DT NNP VBZ

    “the Star reports”

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 31 / 60

  • Linguistic Analysis

    Syntax of Markup: Head-Outward Spawns

    %NNP 24.4NN 8.1

    DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

    NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

    DT JJ NN 0.9VBZ 0.9

    POS NNP 0.9JJ 0.8

    59.4

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 32 / 60

  • Linguistic Analysis

    Syntax of Markup: Head-Outward Spawns

    %NNP 24.4NN 8.1

    DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

    NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

    DT JJ NN 0.9VBZ 0.9

    POS NNP 0.9JJ 0.8

    59.4

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 32 / 60

  • Linguistic Analysis

    Syntax of Markup: Head-Outward Spawns

    %NNP 24.4NN 8.1

    DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

    NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

    DT JJ NN 0.9VBZ 0.9

    POS NNP 0.9JJ 0.8

    59.4

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 32 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...◮ works better with more natural language-resources!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...◮ works better with more natural language-resources!

    missing structural cues:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...◮ works better with more natural language-resources!

    missing structural cues:— e.g., punctuation (will make heavy use)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...◮ works better with more natural language-resources!

    missing structural cues:— e.g., punctuation (will make heavy use)

    and capitalization (will not discuss today)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Transition Formatting

    Formatting:

    strong connection between markup and syntax

    yields a suite of accurate parsing constraints

    improves grammar induction over web data, but...◮ works better with more natural language-resources!

    missing structural cues:— e.g., punctuation (will make heavy use)

    and capitalization (will not discuss today)

    raw word streams often difficult even for humans— e.g., transcribed utterances (Kim and Woodland, 2002)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 33 / 60

  • Example Raw Text

    Example: Raw Word Stream

    ALTHOUGH IT PROBABLY HAS REDUCED THELEVEL OF EXPENDITURES FOR SOME

    PURCHASERS UTILIZATION MANAGEMENTLIKE MOST OTHER COST CONTAINMENTSTRATEGIES DOESN’T APPEAR TO HAVE

    ALTERED THE LONG-TERM RATE OFINCREASE IN HEALTH-CARE COSTS THE

    INSTITUTE OF MEDICINE AN AFFILIATE OFTHE NATIONAL ACADEMY OF SCIENCESCONCLUDED AFTER A TWO-YEAR STUDY

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 34 / 60

  • Example Formatted Text

    Example:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers],

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] —

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] — [PP like most other costcontainment strategies] —

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

    have altered the long-term rate of increase inhealth-care costs],

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

    have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

    have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

    [NP an affiliate of the National Academy of

    Sciences],

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Formatted Text

    Example:

    [SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

    management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

    have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

    [NP an affiliate of the National Academy of

    Sciences], [VP concluded after a two-year study].

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 35 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint: (head ← head) in training

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint: (head ← head) in training

    word head , head word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint: (head ← head) in training

    word head , head word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint: (head ← head) in training

    word head , head word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Strong Assumption

    Intuition:

    strong constraint: (head ← head) in training

    Other countries , including West Germany ,

    may have a hard time justifying continued membership .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 36 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint: (head ← external word) in inference

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint: (head ← external word) in inference

    word word head word word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint: (head ← external word) in inference

    word word head word word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint: (head ← external word) in inference

    word word head word word word ,

    head word word word word word word word .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Example Weak Assumption

    Intuition:

    weak constraint: (head ← external word) in inference

    IFI also has nonvoting preferred shares ,

    which are quoted on the Milan stock exchange .

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 37 / 60

  • Linguistic Analysis Constituents

    Linguistic Analysis:

    punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994;Jones 1994; Doran, 1998, inter alia)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 38 / 60

  • Linguistic Analysis Constituents

    Linguistic Analysis:

    punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994;Jones 1994; Doran, 1998, inter alia)

    49.4% of inter-punctuationfragments are constituents

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 38 / 60

  • Linguistic Analysis Constituents

    Linguistic Analysis:

    punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994;Jones 1994; Doran, 1998, inter alia)

    49.4% of inter-punctuationfragments are constituents

    many fragments deriveprecisely themselves:

    %IN 9.6NN 7.2NNP 6.3CD 3.8VBD 3.2VBZ 3.0RB 2.8VBG 2.2VBP 1.9NNS 1.8WDT 1.6MD 1.1VBN 1.1

    IN VBD 1.0JJ 0.7

    52.8

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 38 / 60

  • Linguistic Analysis Strong Dependencies

    Training:

    strong constraint, e.g.,

    ... arrests followed a “ Snake Day ” at Utrecht ...

    — already 74.0% agreement with head-percolated trees.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 39 / 60

  • Linguistic Analysis Weak Dependencies

    Inference:

    weak constraint, e.g.,

    Maryland Club also distributes tea , which ...

    — now 92.9% agreement with head-percolated trees!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 40 / 60

  • Linguistic Analysis Weak Dependencies

    Modeling

    Context-Sensitive Unsupervsed Tags

    Dependency-and-Boundary Models

    Reduced Models of Grammar

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 41 / 60

  • Gold Tags Unlexicalized Tokens

    Example: Actually, state-of-the-art uses gold tags!

    IN PRP RB VBZ VBN DT NN IN NNS IN DTNNS NN NN IN RBS JJ NN NN NNS VBZ RBVB TO VB VBN DT JJ NN IN NN IN JJ NNSDT NNP IN NNP DT NN IN DT NNP NNP IN

    NNPS VBD IN DT JJ NN

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 42 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    ◮ reducing manual tags to monosemous clusterings lowersperformance below that of the unsupervised categoriesconstructed by Finkel and Manning (2009) / Clark (2003).

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    ◮ reducing manual tags to monosemous clusterings lowersperformance below that of the unsupervised categoriesconstructed by Finkel and Manning (2009) / Clark (2003).

    Can improve with context-sensitive unsupervised clusters!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    ◮ reducing manual tags to monosemous clusterings lowersperformance below that of the unsupervised categoriesconstructed by Finkel and Manning (2009) / Clark (2003).

    Can improve with context-sensitive unsupervised clusters!

    1 start with a hard assignment; (standard word clustering)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    ◮ reducing manual tags to monosemous clusterings lowersperformance below that of the unsupervised categoriesconstructed by Finkel and Manning (2009) / Clark (2003).

    Can improve with context-sensitive unsupervised clusters!

    1 start with a hard assignment; (standard word clustering)2 inject context-colored noise; (get out of the local optimum)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Two benefits of tag-sets...Grouping: pooling statistics of words that play similarsyntactic roles improves generalization (reduces sparsity):

    ◮ indeed, working directly with words does not do well.

    Disambiguation: for words that take on multiple parts ofspeech, knowing gold tags limits the parsing search space:

    ◮ reducing manual tags to monosemous clusterings lowersperformance below that of the unsupervised categoriesconstructed by Finkel and Manning (2009) / Clark (2003).

    Can improve with context-sensitive unsupervised clusters!

    1 start with a hard assignment; (standard word clustering)2 inject context-colored noise; (get out of the local optimum)3 Viterbi-train a bitag HMM. (the unTagger)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 43 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Breaking the barrier...

    Swapped out gold tags for unsupervised word categories:

    System Description Accuracy

    “punctuation” with gold tags 58.4

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 44 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Breaking the barrier...

    Swapped out gold tags for unsupervised word categories:

    System Description Accuracy

    “punctuation” with gold tags 58.4“punctuation” with monosemous induced tags 58.2 (−0.2)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 44 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Breaking the barrier...

    Swapped out gold tags for unsupervised word categories:

    System Description Accuracy

    “punctuation” with gold tags 58.4“punctuation” with monosemous induced tags 58.2 (−0.2)“punctuation” with context-sensitive induced tags 59.1 (+0.7)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 44 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Breaking the barrier...

    Swapped out gold tags for unsupervised word categories:

    System Description Accuracy

    “punctuation” with gold tags 58.4“punctuation” with monosemous induced tags 58.2 (−0.2)“punctuation” with context-sensitive induced tags 59.1 (+0.7)

    ◮ only a small drop from switching to (monosemous)unsupervised clusters — previous systems lost ∼ 5pts

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 44 / 60

  • Gold Tags Unlexicalized Tokens

    Word Categories: Breaking the barrier...

    Swapped out gold tags for unsupervised word categories:

    System Description Accuracy

    “punctuation” with gold tags 58.4“punctuation” with monosemous induced tags 58.2 (−0.2)“punctuation” with context-sensitive induced tags 59.1 (+0.7)

    ◮ only a small drop from switching to (monosemous)unsupervised clusters — previous systems lost ∼ 5pts;

    ◮ first state-of-the-art system to improve with(context-sensitive) unsupervised clusters!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 44 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    E.g., induce structure by working inwards from edges

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    E.g., induce structure by working inwards from edges:

    DT NN VBZ IN DT NN

    | | | | | |

    [The check] is in [the mail].︸ ︷︷ ︸

    Subject NP︸ ︷︷ ︸

    Object NP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    E.g., induce structure by working inwards from edges:

    DT NN VBZ IN DT NN

    | | | | | |

    [The check] is in [the mail].︸ ︷︷ ︸

    Subject NP︸ ︷︷ ︸

    Object NP

    ◮ learn from left fringe (determiner DT) to parse object NP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    E.g., induce structure by working inwards from edges:

    DT NN VBZ IN DT NN

    | | | | | |

    [The check] is in [the mail].︸ ︷︷ ︸

    Subject NP︸ ︷︷ ︸

    Object NP

    ◮ learn from left fringe (determiner DT) to parse object NP◮ based on right fringe (noun NN), correctly parse subject NP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models (DBMs):

    Use boundary cues in head-driven dependency grammars.

    E.g., induce structure by working inwards from edges:

    DT NN VBZ IN DT NN

    | | | | | |

    [The check] is in [the mail].︸ ︷︷ ︸

    Subject NP︸ ︷︷ ︸

    Object NP

    ◮ learn from left fringe (determiner DT) to parse object NP◮ based on right fringe (noun NN), correctly parse subject NP◮ between them, glean make-up of larger phrases (e.g., VP)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 45 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)◮ conditions on what can more often be seen!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)◮ conditions on what can more often be seen!

    DBM-2: Fragments differ from complete sentences.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)◮ conditions on what can more often be seen!

    DBM-2: Fragments differ from complete sentences.

    ◮ incomplete fragments are uncharacteristically short

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)◮ conditions on what can more often be seen!

    DBM-2: Fragments differ from complete sentences.

    ◮ incomplete fragments are uncharacteristically short◮ roots of fragments are generally not verbs or modals

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Dependency-and-Boundary Models: Highlights

    DBM-1: Use words at the fringes.

    ◮ truly head-outward model (Alshawi, 1996)◮ conditions on what can more often be seen!

    DBM-2: Fragments differ from complete sentences.

    ◮ incomplete fragments are uncharacteristically short◮ roots of fragments are generally not verbs or modals

    DBM-3: Learn to stitch together fragments.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 46 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    chce

    dir = R adj = T

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    chce

    dir = R adj = T

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R adj = T

    cd1

    ce

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    ceadj = F

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    ceadj = F

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ce

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ce

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ceSTOP

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ceSTOP

    PROOT(ch | comp)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ceSTOP

    PROOT(ch | comp) PATTACH(cd | ch, dir , cross)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Dependency-and-Boundary Models

    Class-based, head-outward generation (Alshawi, 1996)

    ch

    dir = R

    cd1

    adj = F

    cd2

    ceSTOP

    PROOT(ch | comp) PATTACH(cd | ch, dir , cross) PSTOP(| dir , adj , ce, comp)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 47 / 60

  • Reduced Models

    How to better exploit more data better?

    more text in long sentences, but those can be hard...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 48 / 60

  • Reduced Models

    How to better exploit more data better?

    more text in long sentences, but those can be hard...◮ require Viterbi training, punctuation constraints, etc.

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 48 / 60

  • Reduced Models

    How to better exploit more data better?

    more text in long sentences, but those can be hard...◮ require Viterbi training, punctuation constraints, etc.

    could we “start small” and still use more data?

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 48 / 60

  • Reduced Models

    How to better exploit more data better?

    more text in long sentences, but those can be hard...◮ require Viterbi training, punctuation constraints, etc.

    could we “start small” and still use more data?◮ ... and wouldn’t it be nice if we could just split things up!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 48 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier◮ much more dense coverage of available data:

    ⋆ number of training inputs goes up 2.2x⋆ number of tokens increases 4.3x

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier◮ much more dense coverage of available data:

    ⋆ number of training inputs goes up 2.2x⋆ number of tokens increases 4.3x

    but, also impact on quality of data

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier◮ much more dense coverage of available data:

    ⋆ number of training inputs goes up 2.2x⋆ number of tokens increases 4.3x

    but, also impact on quality of data:◮ many fewer complete sentences exhibiting full structure

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier◮ much more dense coverage of available data:

    ⋆ number of training inputs goes up 2.2x⋆ number of tokens increases 4.3x

    but, also impact on quality of data:◮ many fewer complete sentences exhibiting full structure◮ even less representative than short inputs:

    ⋆ but mostly phrases and clauses...

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models What If

    What if we chopped up input at punctuation?

    impact on quantity of data (with a 15-token threshold):◮ more and simpler word sequences incorporated earlier◮ much more dense coverage of available data:

    ⋆ number of training inputs goes up 2.2x⋆ number of tokens increases 4.3x

    but, also impact on quality of data:◮ many fewer complete sentences exhibiting full structure◮ even less representative than short inputs:

    ⋆ but mostly phrases and clauses...

    ... however, we have an appropriate model family!

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 49 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    length & type left & right

    complete 51 S IN NN

    incomplete 12 SBAR IN NNS2 NP NN NN6 PP IN NNS

    14 VP VBZ NNS4 NP DT NNP

    8 NP DT NNPS5 VP VBD NN

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    DBM-2length & type left & right

    complete 51 S IN NN

    incomplete 12 SBAR IN NNS2 NP NN NN6 PP IN NNS

    14 VP VBZ NNS4 NP DT NNP

    8 NP DT NNPS5 VP VBD NN

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    DBM-1length & type left & right

    complete 51 S IN NN

    incomplete 12 SBAR IN NNS2 NP NN NN6 PP IN NNS

    14 VP VBZ NNS4 NP DT NNP

    8 NP DT NNPS5 VP VBD NN

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    length & type left & right

    complete 51 S IN NN

    incomplete 12 SBAR IN NNS2 NP NN NN6 PP IN NNS

    14 VP VBZ NNS4 NP DT NNP

    8 NP DT NNPSDBM-3 5 VP VBD NN

    partial parse forests “easy-first” (Goldberg and Elhadad, 2010)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Example (cont’d)

    length & type left & right

    complete 51 S IN NN

    incomplete 12 SBAR IN NNS2 NP NN NN6 PP IN NNS

    reduced model 14 VP VBZ NNSDBM-0 4 NP DT NNP

    8 NP DT NNPS5 VP VBD NN

    partial parse forests “easy-first” (Goldberg and Elhadad, 2010)

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 50 / 60

  • Reduced Models Example Continued

    Integrated System

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 51 / 60

  • Reduced Models Example Continued

    System: Combiners... (from Leapfrog)

  • Reduced Models Example Continued

    System: Combiners... (from Leapfrog)

    LD

    LD

    LD

    +arg

    MAXLD

    C1C ∗1 = L(C1)

    C2C ∗2 = L(C2)

    C ∗1 + C∗2 = C+

  • Reduced Models Example Continued

    System: Combiners... (from Leapfrog)

    LD

    LD

    LD

    +arg

    MAXLD

    C1C ∗1 = L(C1)

    C2C ∗2 = L(C2)

    C ∗1 + C∗2 = C+

    LDC2

    C1

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 52 / 60

  • Reduced Models Example Continued

    System: Inductors... (from Baby Steps and Lateen EM)

  • Reduced Models Example Continued

    System: Inductors... (from Baby Steps and Lateen EM)

    HL·DBMD

    SDBMD

    SDBM0D

    C

    F

    S

    D0

    C1

    C2

    C ′1

    C ′2

  • Reduced Models Example Continued

    System: Inductors... (from Baby Steps and Lateen EM)

    HL·DBMD

    SDBMD

    SDBM0D

    C

    F

    S

    D0

    C1

    C2

    C ′1

    C ′2

    CD

    lsplit

    Dl+1split

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 53 / 60

  • Reduced Models Example Continued

    System: Iterating... (from Baby Steps, Less is More, etc.)

  • Reduced Models Example Continued

    System: Iterating... (from Baby Steps, Less is More, etc.)

    1 2 14 15

  • Reduced Models Example Continued

    System: Iterating... (from Baby Steps, Less is More, etc.)

    1 2 14 15

    HL·DBMDl+1

    split∅Cl Cl+1l+1

    l+1

  • Reduced Models Example Continued

    System: Iterating... (from Baby Steps, Less is More, etc.)

    1 2 14 15

    HL·DBMDl+1

    split∅Cl Cl+1l+1

    l+1

    HL·DBMD45split

    HL·DBMD45C

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 54 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%Leapfrog @45 45.0%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%Leapfrog @45 45.0%

    (Gimpel and Smith, 2012) 53.1%(Gillenwater et al., 2010) 53.3%

    (Bisk and Hockenmaier, 2012) 53.3%(Blunsom and Cohn, 2010) 55.7%

    (Tu and Honavar, 2012) 57.0%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral Defense Gates #415 (2013-08-14) 55 / 60

  • Reduced Models Example Continued

    System: Results ... on Section 23 of WSJ

    Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%Leapfrog @45 45.0%

    (Gimpel and Smith, 2012) 53.1%(Gillenwater et al., 2010) 53.3%

    (Bisk and Hockenmaier, 2012) 53.3%(Blunsom and Cohn, 2010) 55.7%

    (Tu and Honavar, 2012) 57.0%Integrated System — no initializers or gold tags — 64.4%

    V.I. Spitkovsky (Stanford & Google) Ph.D. Thesis Oral