QuASI: Question Answering using Statistics, Semantics, and Inference

QuASI: Question Answering using

Statistics, Semantics, and Inference

Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan

Univ. of California-Berkeley / ICSI / Stanford University

Outline Project Overview Three topics:

Assigning semantic relations via lexical hierarchies

From sentences to meanings via syntax From text analysis to inference using

conceptual schemas

Main Goals

Support Question-Answering and NLP in general by:

Deepening our understanding of concepts that underlie all languages

Creating empirical approaches to identifying semantic relations from free text

Developing probabilistic inferencing algorithms

Two Main Thrusts Text-based:

Use empirical corpus-based techniques to extract simple semantic relations

Combine these relations to perform simple inferences

• “statistical semantic grammar”

Concept-based: Determine language-universal conceptual

principles Determine how inferences are made among

these

Relation Recognition

Abbreviation Definition RecognitionSemantic Relation Identification

UCB, Sept-Nov, 2002 Abbreviation Definition Recognition

Developed and evaluated new algorithm Better results than existing approaches Simpler and faster as well

Semantic Relation Identification Developed syntactic chunker Analyzed sample relations Began development of a new computational

model• Incorporates syntax and semantic labels• Test example: identify “treatment for disease”

Abbreviation Examples “Heat-shock protein 40 (Hsp40) enables Hsp70 to play

critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.”

“Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.”

“Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, other proteins,arylalkylamines and aminoglycosides.”

Related Work Pustejovsky et al. present a solution based on hand-build

regular expression and syntactic information. Achieved 72% recall at 98%

Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision.

Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text.

Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries.

* Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.

The Algorithm Much simpler than other approaches. Extracts abbreviation-definition candidates adjacent to

parentheses. Finds correct definitions by matching characters in the

abbreviation to characters in the definition, starting from the right.

The first character in the abbreviation must match a character at the beginning of a word in the definition.

To increase precision a few simple heuristics are applied to eliminate incorrect pairs.

Example: Heat shock transcription factor (HSF). The algorithm finds the correct definition, but not the

correct alignment: Heat shock transcription factor

Results On the “gold-standard” the algorithm

achieved 83% recall at 96% precision.*

On a larger test collection the results were 82% recall at 95% precision.

These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms.

* Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.

From sentences to meanings via syntax

Factored A* Parsing

Relational approaches to semantic relations

Learning aspectual distinctions

Factored A* Parsing Goal: develop a lexicalized parser that is fast,

accurate and exact [finds the model’s best parse] Technology exists to get any two, but not all

three Approximate Parsing – Fast but Inexact

• Beam or “Best-First” Parsing [Charniak, Collins, etc.] Factored: represent tree and dependencies

separately• Simple, modular, extensible design• Permits fast, high accuracy, exact inference• A* estimates combined from product of experts model

Available from: http://nlp.stanford.edu/ [Java, src]

http://nlp.stanford.edu/

Factored A* Parsing

Syntactic Model

Basic Best

Semantic Model

None 80.8 82.4

Basic 83.8 84.7

Best 85.4 86.0

Syntactic Model

None Basic Best

Semantic Model

Basic 78.9 88.6 89.0

Best 84.0 89.9 90.2

Dependency Accuracy

Labeled Bracketing Accuracy (F1)

Work Done

LT D

Learning Semantic Relations FrameNet as starting point and training data Constraint Resolution for Entire Relations

Logical Relations Probabilistic Models Combinations of the Two

Bootstrap to new domains Building blocks for Q/A relevant tasks:

Semantic Roles in Text Inference Improved Syntactic Parsing

Learning Aspect:The Perfect

English perfect has experiential, relevant, and durative readings: have been to Bali vs. have just eaten lunch

Disambiguation is necessary for text understanding: John has traveled to Malta [now, or in the past?]

Siegel (2000) looked at inherent but not contextual aspect Current status: annotation underway for training statistical classifier

Ref-time

Event

John has lived in Miami for ten years now.

EventJohn has lived in Miami before.

Concept-based Analysis

From text analysis to inference using conceptual schemas

Relational Probabilistic Models Open Domain Conceptual Relations

Inference and Conceptual Schemas

Hypothesis: Linguistic input is converted into a mental simulation

based on bodily-grounded structures. Components:

Semantic schemas • image schemas and executing schemas are abstractions over

neurally grounded perceptual and motor representations Linguistic units

• lexical and phrasal construction representations invoke schemas, in part through metaphor

Inference links these structures and provides parameters for a simulation engine

Conceptual Schemas Much is known about conceptual

schemas, particularly image schemas However, this understanding has not yet

been formalized We will develop such a formalism

They have also not been checked extensively against other languages We will examine Chinese, Russian, and

other languages in addition to English

Schema Formalism

SCHEMA <name>

SUBCASE OF <schema>

EVOKES <schema> AS <local name>

ROLES < self role name>: <role restriction>

< self role name> <-> <role name>

CONSTRAINTS <role name> <- <value>

<role name> <-> <role name>

<setting name> :: <role name> <-> <role name>

<setting name> :: <predicate> | <predicate>

A Simple Example

SCHEMA hypotenuse

SUBCASE OF line-segment

EVOKES right-triangle AS rt

ROLES Comment inherited from line-segment

CONSTRAINTS

SELF <-> rt.long-side

Source-Path-Goal

SCHEMA: spg

ROLES:

source: Place

path: Directed Curve

goal: Place

trajector: Entity

Extending Inferential Capabilities

Given the formalization of the conceptual schemas How to use them for inferencing?

Earlier pilot systems Used metaphor and Bayesian belief networks Successfully construed certain inferences But don’t scale

New approach Probabilistic relational models Support an open ontology

A Common Representation Representation should support

Uncertainty, probability Conflicts, contradictions

Current plan Probabilistic Relational Models (Koller et al.) DAML + OIL

Status of PRM for AQUAINT

Fall 2002• Designed the basic PRM code-base/infrastructure• Packages for BN’s, OOBN. • Designed PRM inference Algorithm.

Spring-Summer 2003• Implement the PRM inference Algorithm• Design Dynamic Probabilistic Relational Models

(DPRM)• Implement DPRM to replace Pilot System DBN• Test DPRM for QA

Related Work• Probabilistic OWL (PrOWL)• Probabilistic FrameNet

An Open Ontology for Conceptual Relations

Build a formal markup language for conceptual schemas We propose to use DAML+OIL/OWL as the base. Advantages of the approach

Common framework for extending and reuse Closer ties to other efforts within AQUAINT as well as

the larger research community on the Semantic Web. Some Issues

Expressiveness of DAML+OIL Representing Probabilistic Information Extension to MetaNet, capture abstract concepts

Current Status Summer/Fall 2002

FrameNet-1 is available in DAML+OIL• http://www.icsi.berkeley.edu/~framenet

Image Schemas have been formalized and DAML+OIL representation designed

Initial set of Metaphors and an SQL Metaphor database is in place.

Spring 2003 Populate Metaphor Database Populate Image Schema Database

Summer 2003 Test Inferencing with Image Schemas for QA.

Putting it all Together

We have proposed two different types of semantics Universal conceptual schemas Semantic relations

In Phase I they will remain separate However, we are exploring using PRMs

as a common representational format In later Phases they will be combined

QuASI: Question Answering using Statistics, Semantics, and Inference

Documents

Transcript of QuASI: Question Answering using Statistics, Semantics, and Inference