QuASI: Question Answering using Statistics, Semantics, and Inference
description
Transcript of QuASI: Question Answering using Statistics, Semantics, and Inference
QuASI: Question Answering using
Statistics, Semantics, and Inference
Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan
Univ. of California-Berkeley / ICSI / Stanford University
Outline Project Overview Three topics:
Assigning semantic relations via lexical hierarchies
From sentences to meanings via syntax From text analysis to inference using
conceptual schemas
Main Goals
Support Question-Answering and NLP in general by:
Deepening our understanding of concepts that underlie all languages
Creating empirical approaches to identifying semantic relations from free text
Developing probabilistic inferencing algorithms
Two Main Thrusts Text-based:
Use empirical corpus-based techniques to extract simple semantic relations
Combine these relations to perform simple inferences
• “statistical semantic grammar”
Concept-based: Determine language-universal conceptual
principles Determine how inferences are made among
these
Relation Recognition
Abbreviation Definition RecognitionSemantic Relation Identification
UCB, Sept-Nov, 2002 Abbreviation Definition Recognition
Developed and evaluated new algorithm Better results than existing approaches Simpler and faster as well
Semantic Relation Identification Developed syntactic chunker Analyzed sample relations Began development of a new computational
model• Incorporates syntax and semantic labels• Test example: identify “treatment for disease”
Abbreviation Examples “Heat-shock protein 40 (Hsp40) enables Hsp70 to play
critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.”
“Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.”
“Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”
Related Work Pustejovsky et al. present a solution based on hand-build
regular expression and syntactic information. Achieved 72% recall at 98%
Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision.
Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text.
Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries.
* Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.
The Algorithm Much simpler than other approaches. Extracts abbreviation-definition candidates adjacent to
parentheses. Finds correct definitions by matching characters in the
abbreviation to characters in the definition, starting from the right.
The first character in the abbreviation must match a character at the beginning of a word in the definition.
To increase precision a few simple heuristics are applied to eliminate incorrect pairs.
Example: Heat shock transcription factor (HSF). The algorithm finds the correct definition, but not the
correct alignment: Heat shock transcription factor
Results On the “gold-standard” the algorithm
achieved 83% recall at 96% precision.*
On a larger test collection the results were 82% recall at 95% precision.
These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms.
* Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.
From sentences to meanings via syntax
Factored A* Parsing
Relational approaches to semantic relations
Learning aspectual distinctions
Factored A* Parsing Goal: develop a lexicalized parser that is fast,
accurate and exact [finds the model’s best parse] Technology exists to get any two, but not all
three Approximate Parsing – Fast but Inexact
• Beam or “Best-First” Parsing [Charniak, Collins, etc.] Factored: represent tree and dependencies
separately• Simple, modular, extensible design• Permits fast, high accuracy, exact inference• A* estimates combined from product of experts model
Available from: http://nlp.stanford.edu/ [Java, src]
Factored A* Parsing
Syntactic Model
Basic Best
Semantic Model
None 80.8 82.4
Basic 83.8 84.7
Best 85.4 86.0
Syntactic Model
None Basic Best
Semantic Model
Basic 78.9 88.6 89.0
Best 84.0 89.9 90.2
Dependency Accuracy
Labeled Bracketing Accuracy (F1)
Work Done
LT D
Learning Semantic Relations FrameNet as starting point and training data Constraint Resolution for Entire Relations
Logical Relations Probabilistic Models Combinations of the Two
Bootstrap to new domains Building blocks for Q/A relevant tasks:
Semantic Roles in Text Inference Improved Syntactic Parsing
Learning Aspect:The Perfect
English perfect has experiential, relevant, and durative readings: have been to Bali vs. have just eaten lunch
Disambiguation is necessary for text understanding: John has traveled to Malta [now, or in the past?]
Siegel (2000) looked at inherent but not contextual aspect Current status: annotation underway for training statistical classifier
Ref-time
Event
John has lived in Miami for ten years now.
EventJohn has lived in Miami before.
Concept-based Analysis
From text analysis to inference using conceptual schemas
Relational Probabilistic Models Open Domain Conceptual Relations
Inference and Conceptual Schemas
Hypothesis: Linguistic input is converted into a mental simulation
based on bodily-grounded structures. Components:
Semantic schemas • image schemas and executing schemas are abstractions over
neurally grounded perceptual and motor representations Linguistic units
• lexical and phrasal construction representations invoke schemas, in part through metaphor
Inference links these structures and provides parameters for a simulation engine
Conceptual Schemas Much is known about conceptual
schemas, particularly image schemas However, this understanding has not yet
been formalized We will develop such a formalism
They have also not been checked extensively against other languages We will examine Chinese, Russian, and
other languages in addition to English
Schema Formalism
SCHEMA <name>
SUBCASE OF <schema>
EVOKES <schema> AS <local name>
ROLES < self role name>: <role restriction>
< self role name> <-> <role name>
CONSTRAINTS <role name> <- <value>
<role name> <-> <role name>
<setting name> :: <role name> <-> <role name>
<setting name> :: <predicate> | <predicate>
A Simple Example
SCHEMA hypotenuse
SUBCASE OF line-segment
EVOKES right-triangle AS rt
ROLES Comment inherited from line-segment
CONSTRAINTS
SELF <-> rt.long-side
Source-Path-Goal
SCHEMA: spg
ROLES:
source: Place
path: Directed Curve
goal: Place
trajector: Entity
Extending Inferential Capabilities
Given the formalization of the conceptual schemas How to use them for inferencing?
Earlier pilot systems Used metaphor and Bayesian belief networks Successfully construed certain inferences But don’t scale
New approach Probabilistic relational models Support an open ontology
A Common Representation Representation should support
Uncertainty, probability Conflicts, contradictions
Current plan Probabilistic Relational Models (Koller et al.) DAML + OIL
Status of PRM for AQUAINT
Fall 2002• Designed the basic PRM code-base/infrastructure• Packages for BN’s, OOBN. • Designed PRM inference Algorithm.
Spring-Summer 2003• Implement the PRM inference Algorithm• Design Dynamic Probabilistic Relational Models
(DPRM)• Implement DPRM to replace Pilot System DBN• Test DPRM for QA
Related Work• Probabilistic OWL (PrOWL)• Probabilistic FrameNet
An Open Ontology for Conceptual Relations
Build a formal markup language for conceptual schemas We propose to use DAML+OIL/OWL as the base. Advantages of the approach
Common framework for extending and reuse Closer ties to other efforts within AQUAINT as well as
the larger research community on the Semantic Web. Some Issues
Expressiveness of DAML+OIL Representing Probabilistic Information Extension to MetaNet, capture abstract concepts
Current Status Summer/Fall 2002
FrameNet-1 is available in DAML+OIL• http://www.icsi.berkeley.edu/~framenet
Image Schemas have been formalized and DAML+OIL representation designed
Initial set of Metaphors and an SQL Metaphor database is in place.
Spring 2003 Populate Metaphor Database Populate Image Schema Database
Summer 2003 Test Inferencing with Image Schemas for QA.
Putting it all Together
We have proposed two different types of semantics Universal conceptual schemas Semantic relations
In Phase I they will remain separate However, we are exploring using PRMs
as a common representational format In later Phases they will be combined