Relation Extraction Pierre Bourreau LSI-UPC PLN-PTM.
-
Upload
edwin-cannon -
Category
Documents
-
view
225 -
download
0
Transcript of Relation Extraction Pierre Bourreau LSI-UPC PLN-PTM.
Relation Extraction Description Finding relations between entities into a text Filling pre-defined templates slots
One-value-per-field Multi-value
Depend on analysis: Chunking Tokenization Sentence Parsing…
Plan
Relation Extraction description
Sampling templates (Cox, Nicolson, Finkel, Manning)
Reducing deep analysis errors…
Conclusion
First Example: Sampling Templates Example: workshop announcement PASCAL corpus Relations to extract:
dates of events Workshop conferences names, acronyms and
URL Domain knowledge:
Constraints on dates Constraints on names
PASCAL Corpus: semi-structured corpus <[email protected]> Type: cmu.andrew.academic.bio Topic: "MHC Class II: A Target for Specific Immunomodulation of the Immune Response" Dates: 3-May-95 Time: <stime>3:30 PM</stime> Place: <location>Mellon Institute Conference Room</location> PostedBy: Helena R. Frey on 26-Apr-95 at 11:09 from andrew.cmu.edu Abstract:
Seminar: Departments of Biological Sciences Carnegie Mellon and University of Pittsburgh Name: <speaker>Dr. Jeffrey D. Hermes</speaker> Affiliation: Department of Autoimmune Diseases Research & Biophysical Chemistry Merck Research Laboratories Title: "MHC Class II: A Target for Specific Immunomodulation of the Immune Response" Host/e-mail: Robert Murphy, [email protected] Date: Wednesday, May 3, 1995 Time: <stime>3:30 p.m.</stime> Place: <location>Mellon Institute Conference Room</location> Sponsor: MERCK RESEARCH LABORATORIES
Schedule for 1995 follows: (as of 4/26/95) Biological Sciences Seminars 1994-1995 Date Speaker Host April 26 Helen Salz Javier L~pez May 3 Jefferey Hermes Bob Murphy MERCK RESEARCH LABORATORIES
PASCAL Corpus: semi-structured corpus <[email protected]> Type: cmu.andrew.org.heinz.great-lake Topic: Re: PresentationCC: Dates: 25-Oct-93 Time: <stime>12:30</stime> PostedBy: Richard Florida on 21-Oct-93 at 17:00 from andrew.cmu.edu Abstract:
Folks:
<paragraph> <sentence>Our client has requested that the presentation be postponed until Monday during regular class-time</sentence>. <sentence>He has been asked to make a presentaion for the Governor of Michigan and Premier of Ontario tommorrow morning in Canada, and was afraid he could not catch a plane in time to make our presentation</sentence>. <sentence>After consulting with Rafael and a sub group of project managers, it was decided that Monday was the best feasible presentation alternative</sentence>. <sentence>Greg has been able to secure Room 2503 in Hamburg Hall for our presentation Monday during regular class-time</sentence>. </paragraph>
<paragraph><sentence>We will meet tommmorow in <location>2110</location> at <stime>12:30</stime> (lunch provided) to finalize presentation and briefing book</sentence>. <sentence>Also, the client has faxed a list of reactions and questions for discussion which we should review</sentence>. <sentence>Thanks very much for your hard work and understanding</sentence>. <sentence>Look forward to seeing you tommorrow</sentence>.</paragraph>
Richard
Idea
Sampling Templates: Generate all available templates Give a probability to each of them
Relational model: Constraints on dates: order
1. submission dates 2. acceptance dates 3. workshop dates / camera ready dates
Constraints on names. Slots: name, acronym, URL URL is generated from acronyms
Baselines
CRF Cliques: max=2 Viterbi algorithm Token => GATE tokenization
CMM Idem Window of the four previous tokens
Templates sampling
Tokens p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents
Template: Each slot holds one/no filler value -> date templates:
SUB_DATE ACC_DATE WORK_DATE CAMREADY_DATE
Templates sampling
Tokens p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents
Template: Each slot holds one/no filler value -> name templates:
CONF_NAME CONF_ACRO CONF_URL WORK_NAME WORK_ACRO WORK_URL
Templates scoring: Date Model PA/P: Probability of present/absent fields. Set
with training data Po: Ordering probability. We give penalty to
constraints violations.
PPA/PA/P* P* Poo = P = Prelrel
Templates scoring: Name Model Name->Acronym: independent module
(likelihood score – Chang 2002): Pnam->acr
Acronym->URL: empirical probability from training: Pacr->url
Pb: Pb: missing entry give advantage to incomplete templates. PA/P: pondering templates (in training, most
values are filled) PPrelrel= Pnam->acr *Pacr->url *PA/P
Results
No results over CRF CRF accepts variation (ex: name)
=> lower recall
Rel. Model does not improve CRF (not on graph) Low-window of CRF => less info in distribution.
Substantial improvement over CMM (5%)
Plan
Relation Extraction description
Sampling templates
Reducing deep analysis errors (Zao, Grishman)
Conclusion
Problematic
Use different syntactic analysis for the task: Tokenization Chunking Sentence Parsing …
The more info they give, the less accurate they are.
=>combine them to correct errors
ACE task… remember
Entities: PERPERson – ORGORGanisation – FACFACility –
GGeoPPoliticEEntity - LOCLOCation – WEAWEApon – VEHVEHicle Mentions:
NAM NAM (proper), NOM , NOM (nominal), PRO , PRO (pronoun) Relations:Relations:
EMP-ORG, PHYS, GPE-AFF, PER-SOC, DISC, EMP-ORG, PHYS, GPE-AFF, PER-SOC, DISC, ART, OtherART, Other
Kernel, SVM … nice properties Kernel:
Function replacing scalar vector products Enables us to translate problems into a higher-
dimension space for solution Sum, product generates kernels.
SVM: SVM can pick up features for best separation
The relational model
R=(arg1, arg2, seq, link, path) arg1, arg2: the two entities to compare seq=(t1, …, tn): sequence of tokens intervening link=(t1, …, tm): idem seq but just with important words path: a dependency path…
T=(word, pos, base) pos: Part Of Speech tagging base: morphological base
E=(tk, type, subtype, mtype) type: according to ACE type subtype: refining mtype: the way it is mentioned
DT=(T, dseq) dseq=(arc1, …, arcn)
ARC=(w, dw, label, e) w: current token dw: token connected to w label: role label of this arc e: direction of the arc
The relational model: examplearg1=((“areas”, “NNS”, “area”,
dseq), “LOC”, “region”, “NOM”)
arg1.dseq=((OBJ, areas, in, 1), (OBJ, areas, controlled, 1))
path=((OBJ, areas, controlled, 1), (SBJ, controlled, troops, 0))
Kernels
1. Argument kernel: Matches two tokens,
comparing each fix arguments (word, pos, type…)
2. Bigram kernel: Matches token on a
window of size 1
3. Link sequence kernel: Relations often occur in
a short context.
Kernels (2)
4. Dependency path kernel: How similar are two paths?
5. Local dependency kernel: Idem as path but more
informative. Helpful if dependency path
does not exist.
Results: adding info into SVM The more information
we give, the better the result.
Link Sequence Kernel boosts results.
Results: SVM or KNN
SVM behaves globally better
Polynomial extension has no consequence on KNN.
Training problem in the last three.
… good results over ACE official task… secret, no comparison available
Conclusion
Really simple method Nice properties of Kernel/SVMs This method is generic!!! (tested on
annotated text)
Looks like SVM can process better, for this task.
… but hard to compare the two methods as goals are different.