Statistical Relational Learning for NLP

25
Statistical Relational Learning for NLP Ray Mooney & Razvan Bunescu Statistical Relational Learning Presented by Michele Banko

description

Statistical Relational Learning for NLP. Ray Mooney & Razvan Bunescu Statistical Relational Learning Presented by Michele Banko. Outline. Quick intro to NLP Problems, History Why SRL? SRL for two NLP tasks Information Extraction using RMNs Semantic Parsing using ILP. Quick Tour of NLP. - PowerPoint PPT Presentation

Transcript of Statistical Relational Learning for NLP

Page 1: Statistical Relational Learning for NLP

Statistical Relational Learning for NLP

Ray Mooney & Razvan BunescuStatistical Relational Learning

Presented by Michele Banko

Page 2: Statistical Relational Learning for NLP

Outline

Quick intro to NLP– Problems, History– Why SRL?

SRL for two NLP tasks– Information Extraction using RMNs– Semantic Parsing using ILP

Page 3: Statistical Relational Learning for NLP

Quick Tour of NLP

Systems– Translation, Question-Answering/NL

Search, Reading Comprehension Sub-tasks

– Part-of-Speech Tagging, Phrase Finding, Parsing [syntax, semantics], Named-entity Recognition

Page 4: Statistical Relational Learning for NLP

Quick Tour of NLP [2] Knowledge engineering/linguistics vs.

statistics Recently, many NLP tasks treated as

sequence labeling problems– Pos-tagging: The/DT cat/NN can/MD walk/VB

– NP-finding: Stanford/B-NP University/I-NP is/O in/O California/B-NP

– SRL with 1 relation, adjacency HMMs, CRFs to model, Viterbi to label

– Find most probable assignment of labels– States = Part-of-speech tags– Compute P(w|t), emit word in a particular state

Page 5: Statistical Relational Learning for NLP

Why SRL?

NLP involves.. Complex reasoning about entities

and relationships between them Predicate logic

Resolving & integrating ambiguities on mulitple levels (morphology, syntax, semantics..)– More than just adjacency! Bayesian methods, graphical models

Page 6: Statistical Relational Learning for NLP

Intro to Information Extraction Early 90s, DARPA Message

Understanding Conference– Identify references to named-entities

(people, companies, locations..)– Multi-lingual, multi-document– Attrributes, relationships, events

Page 7: Statistical Relational Learning for NLP

Attributes

NAME Fletcher Maddox,Dr. Maddox

DESCRIPTORS former Dean of the UCSD Business School,his father,the firm's CEO

CATEGORY PERSON

Facts

PERSON Employee_Of

Fletcher Maddox UCSD Business School, La Jolla Genomatics

Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons.

Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo.

Page 8: Statistical Relational Learning for NLP

IE for Protein Identification

Medline DB: 750K abstracts, 3700 proteinsProduction of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of endothelial nitric oxide synthase ( eNOS ) which effector proteins such as Ca2+ - calmodulin . ... which avidly binds to the carboxyl terminal region of the eNOS oxygenase domain.

Rule-based, HMM, SVM, MaxEnt CRFs outperform rest (Ramani et al,

2005)– May fail to capture long-distance

dependencies

Page 9: Statistical Relational Learning for NLP

Collective IE using RMNs(Bunescu & Mooney, 2004) Typical IE: extractions in isolation Want to consider influences between

extractions– If context surrounding one occurrence strongly

indicates protein, should affect future taggings in different contexts

– Acronyms & their long forms Use Sequence Labeling Treatment to get all

substrings that make up protein names– Start, End, Continue, Unique, Other

Classification of sequence types, not solo tokens

RMNs (Relational Markov Networks)

Page 10: Statistical Relational Learning for NLP

RMNs (Taskar et al., 2002)

For each document d in collection– Associate d with set of candidate entities

d.E Entities = token sequences = too many

possible phrases Either: constrain length or form (baseNPs)

– Characterize each entity e in d.E with a predefined set of boolean features e.F E.label=1 if e is a valid extraction E.HeadWord, E.POS_Left, E.BigramRight,

E.Prefix

Page 11: Statistical Relational Learning for NLP

Clique Templates

Find all subsets of entities satisfying a given constraint, then for each subset, form a clique c

Local Templates– Number of hidden labels = 1– Model correlations between an entity’s observed

features and its label Global Templates

– Number of hidden labels > 1– Model influences among multiple entities

Overlap Template, Repeat Template, Acronym Template

Page 12: Statistical Relational Learning for NLP

Using Local TemplatesVariable Nodes: labels of all candidate entities

in documentPotential Nodes: represent correlations

between >1 entity attributes by linking to variable nodes– Edges: by matching clique templates against d.E

e.label

E.f1=vi E.f2=vj E.fh=vk

e.label

φ.HEAD=enzyme φ.POSLEFT=NOUN

φ.WORDLEFT=the

φ.PREFIX=A0

RMN by Local Templates Factor Graph by Local Templates

Page 13: Statistical Relational Learning for NLP

antioxidant superoxidedismutase-1

Using Global Templates

Connect label nodes of two or more entities

Acronym Factor GraphThe antioxidant superoxide dismutase-1 (SOD1)

φAT (acronym potential)

V (the entity)uOR

φOR

superoxidedismutase-1

dismutase-1

Page 14: Statistical Relational Learning for NLP

Inference & Learning in RMNs Inference

– Labels are only hidden features– Probability distribution over hidden entity

labels computed using Gibbs distribution– Find most probable assignment of values to

labels using max-product algorithm Learning

– Structure defined by clique templates– Find clique potentials that maximize likelihood

over training data using Voted Perceptron

Page 15: Statistical Relational Learning for NLP

Experiments

Datasets– Yapex, Aimed: ~200 Medline abstracts,

~4000 protein references each Systems

– LT-RMN: RMN with local + overlap templates– GLT-RMN: RMN with local + global

templates– CRF: McCallum 2002

Evaluation– Position-based precision, recall

Page 16: Statistical Relational Learning for NLP

Results

Yapex

Method Precision Recall F-Measure

LT-RMN 70.79 53.81 61.14

GLT-RMN 69.71 65.76 67.68

CRF 72.45 58.64 64.81

Aimed

Method Precision Recall F-Measure

LT-RMN 81.33 72.79 76.82

GLT-RMN 82.79 80.04 81.39

CRF 85.37 75.90 80.36

Page 17: Statistical Relational Learning for NLP

Intro to Semantic Parsing NL input logical form Assignment of semantic roles

“John gives the book to Mary” gives1(subj: John2, dobj: book3, iobj: Mary4)

Ambiguities“Every man loves a woman”– Is there one woman loved by all or…

YX man(X) ^ woman(Y) -> loves(X,Y)XY man(X) ^ woman(Y) -> loves(X,Y)

– LF just saysloves1(every m1:man(m1), a w1:woman(w1))

Page 18: Statistical Relational Learning for NLP

Previous Work in Semantic Parsing Hand-built/Linguistic systems

– Grammar development– NLPwin (MSR)

Data-driven approaches– Tagging problem: for each NP, tag role

Gildea & Jurafsky (2002): NB classifier combination using lexical and syntactic features, previous labels

Jurafsky et al. (2004): SVMs

– Sentence + LF Parser CHILL: ILP to learn a generalized Prolog parser

Page 19: Statistical Relational Learning for NLP

ILP in CHILL Parser induction = learn control rules of a shift

reduce parser– Shift symbols from the input string onto a stack– Reduce items on the stack by applying a matching

grammar rule Can be encoded as Prolog program:

– parse(S,Parse) :- parse([],S,[Parse],[]). Start with 3 generic operators

– INRTRODUCE pushes predicate onto stack based on word input

– Lexicon: ‘capital’ -> capital(_,_)– COREF_VARS unifies two variables under consideration– DROP_CONJ embeds one predicate as argument of

another– SHIFT pushes word input onto stack

Page 20: Statistical Relational Learning for NLP

What is the capital of Texas?

Stack Input Buffer Action

[answer(_,_):[]][what,is,the,capital,of,texas,?]

[answer(_,_):[the,is,what]] [capital,of,texas,?]SHIFT, SHIFT, SHIFT

[capital(_,_):[], answer(_,_):[the,is,what]]

[capital,of,texas,?] INTRODUCE

[capital(C,_):[], answer(C,_):[the,is,what]]

[capital,of,texas,?] COREF_VARS

[capital(C,_):[of,capital], answer(C,_):[the,is,what]]

[texas,?] SHIFT, SHIFT

[const(S,stateid(texas)):[],capital(C,S):[of,capital], answer(C,_):[the,is,what]]

[texas,?]INTRODUCE, COREF_VARS

[answer(C, (capital(C,S),const(S,stateid(texas)))):[?,texas,of,capital,the,is,what]]

[]DROP_CONJ, SHIFT, SHIFT

Page 21: Statistical Relational Learning for NLP

Learning Control Rules Operator Generation

– Initial parser is too general, will produce spurious parses

– Use training data to extend program Example Analysis

– Use general parser, recording parse states – Positive Examples

Parse states to which operator should be applied Find 1st correct parse of training pair, ops used to

achieve subgoals become positive examples

– Negative Examples for single-parse systems S is a negative example for the current action if S is a

positive example for a previous action

Page 22: Statistical Relational Learning for NLP

Induction in CHILL

Control-Rule Induction– Cover all positive examples, not negative– Bottom-Up: Compact rule set by forming Least-

General Generalizations of clause pairs – Top-Down: Overly-general rules specialized by

addition of literals– Invention of new predicates

op([X,[Y,det:the]], [the|Z],A,B) :- animate(Y).

animate(man). animate(boy). animate(girl). . .

Fold constraints back into general parser

Page 23: Statistical Relational Learning for NLP

ProbCHILL: Parsing via ILP Ensembles ILP + Statistical Parsing

– ILP: not forced to make decisions based on predetermined features

– SP: handle multiple sources of uncertainty Ensemble classifier

– Combine outputs of > 1 classifiers– Bagging, boosting

TABULATE: generate several ILP hypotheses– Use SP to estimate probability for each potential

operator– Find most probable semantic parse (beam-search)

P(parse state) = product of probs of operators to reach state

Page 24: Statistical Relational Learning for NLP

Experiments

U.S. Geography DB with 800 Prolog facts 250 questions from 50 humans,

annotated with Prolog queriesWhat is the capital of the state with the highest population?answer(C, (capital(S,C), largest(P, (state(S),

population(S,P)))))

10-fold CV, measuring Recall = # correct queries produced

# test sentences

Precision = # correct queries produced

# complete parses reduced

Page 25: Statistical Relational Learning for NLP

Results

System Recall Precision F-measure

Geobase 56.00 96.40 70.85

CHILL 68.50 97.65 80.52

ProbCHILL 80.40 88.16 84.10