1 31 March 2004 Global Autonomous Language Exploitation.

50
1 GALE GALE 31 March 2004 Global Autonomous Global Autonomous Language Exploitation Language Exploitation

Transcript of 1 31 March 2004 Global Autonomous Language Exploitation.

Page 1: 1 31 March 2004 Global Autonomous Language Exploitation.

1

GALEGALE

31 March 2004

Global AutonomousGlobal AutonomousLanguage ExploitationLanguage Exploitation

Page 2: 1 31 March 2004 Global Autonomous Language Exploitation.

2

High Level Goal

Smart, speedy, supremely capable

Superhuman Assistant

Effectively exploit massive data

Accurately infer analyst need, make maximally productive

Learn and improve continually

Make analysts 10 times more productive

Page 3: 1 31 March 2004 Global Autonomous Language Exploitation.

3

Challenge / Opportunity

Actual Analysts(working individually)

Fantasized Analysts(working as one)

Massive Sea of Data(speech & text, multiple languages)

Page 4: 1 31 March 2004 Global Autonomous Language Exploitation.

4

Vision

Analyst

Massive Sea of Data(speech & text, multiple languages)

EnglishSpeaker +

Superhuman Assistant(big mind to see big picture)

Page 5: 1 31 March 2004 Global Autonomous Language Exploitation.

5

POWERFUL Technology to Exploit Human Language

FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08

TIDES

EARS

GALE

Page 6: 1 31 March 2004 Global Autonomous Language Exploitation.

6

POWERFUL Technology to Exploit Human Language

VisionMultiple MediaMultiple SourcesMultiple Languages

Streaming, ChangingHuge Volumes

Tex

t

+

Sp

eech

EARS

Te

xt

Speech-to-Text

TIDES

FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08

TIDES

EARS

GALE

Requested Information(in English)

Page 7: 1 31 March 2004 Global Autonomous Language Exploitation.

7

POWERFUL Technology to Exploit Human Language

VisionMultiple MediaMultiple SourcesMultiple Languages

Streaming, ChangingHuge Volumes

Tex

t

+

Sp

eech

EARS

Te

xt

Speech-to-Text

• Fewer analysts

TIDES GALE

Critical Information(Fused, in English)

Alerts

FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08

TIDES

EARS

GALE

• Faster awareness

Page 8: 1 31 March 2004 Global Autonomous Language Exploitation.

8

What Sets GALE ApartAutomatic, autonomous analysis of massive amounts of human language from around world

(multiple media, sources, languages)

Rapid discovery, fusion, and alerting – delivering information critical to national defense

Efficient, intelligent interaction with analysts

– Discern interests of individual analysts

– Point out new information & new events

– Organize information for ready access

– Constantly improve, learn how job is done (get smarter with time)

Page 9: 1 31 March 2004 Global Autonomous Language Exploitation.

9

Powerful Tools for English-Speaking Analysts

Multiple MediaMultiple SourcesMultiple Languages

Streaming, ChangingHuge Volumes

Tex

t

+

Spe

ech

Commanders&

Policy Makers

How TIDES Works

Requested Information(in English)

English-speakinganalysts

Report

Tasking

TIDES

Page 10: 1 31 March 2004 Global Autonomous Language Exploitation.

10

Multiple MediaMultiple SourcesMultiple Languages

Streaming, ChangingHuge Volumes

Tex

t

+

Spe

ech

How GALE Will Work

Needed Information(Fused, in English)

English-speakinganalyst

Alerts

Hot

Stu

ff

Report

Tasking

Infoexamined, retained

Explicit requests

Task

ing received

Repor

ts

writtenSystem learns

via painlessautomaticrelevancefeedback

loops

Commanders&

Policy Makers

Precise Information Proactively

GALE

Page 11: 1 31 March 2004 Global Autonomous Language Exploitation.

11

What Will Happen Inside

Data Information Knowledge

Conversion

Extraction

Discovery

Analysis

Tex

tT

ext

Speech

Non

-Eng

lish

InformationExtraction

Spe

ech

Eng

lish

Tex

t

Eng

lish

Linked video, speech, text, English text

VideoV

ide

o

NoveltyDetection

NeedsDiscernment

Eng

lish

Tex

tEnglish text

New Events,New InformationContradictions

English-speakinganalyst

Fused Facts,Associations,Inferences,Summaries

Tasking

Tas

king

Report

Req

uest

Info

Use

d,

Re

tain

ed

Wisdom

Ale

rts

Fus

ed I

nfo

InformationFusion

Transcription

TranslationF

acts

(n

ames

, n

umb

ers,

dat

es,

entit

ies,

rel

atio

ns,

even

ts)

Page 12: 1 31 March 2004 Global Autonomous Language Exploitation.

12

What System Does • Acts automatically, autonomously (takes initiative)

• Reads & interprets human language (source data, tasking, reports)

• Observes user actions (data viewed, used, retained, discarded)

• Adds to its global knowledge base + model of user needs

• Organizes, prioritizes, notifies (pushes precise information)

What Analyst Sees• Fused information • Prioritized information

– Summaries, headlines, alerts

– Tabular, graphic, telegraphic

• Source data + transcripts + translations in English (linked)

Page 13: 1 31 March 2004 Global Autonomous Language Exploitation.

13

Research Communities• Information retrieval

• Computational linguistics

• Machine learning

• Data storage & retrieval

• Knowledge representation & reasoning

• Social network theory

• Human-computer interaction

= tiger team

TID

ES

Page 14: 1 31 March 2004 Global Autonomous Language Exploitation.

14

Key Technologies• Translation (readable, actionable English)

• Extraction (facts about entities, relations, events)

• Novelty detection (new events, new information)

• Information fusion (concise, integrative summaries)

• Needs discernment (analyst / commander needs)

• Language understanding (meaning representation)

= real power

Page 15: 1 31 March 2004 Global Autonomous Language Exploitation.

15

En

glis

h

• Essential for foreign languages (most of world, DoD mission)

• Encouraging results on newswire (big leap last year)

• Not yet good enough– Need higher accuracy, fluency, confidence measures – Must handle varied data sources

• Text that is more difficult than news• Speech that is imperfectly transcribed

• Promising ideas– Trainable rules to translate names & numbers– Morphology & syntax for general translation– Error-driven training & development (big boost last year)

– Automatic data acquisition– Semantics-enabled translation

Translation

En

glis

hCh

ine

se

Ara

bic

Page 16: 1 31 March 2004 Global Autonomous Language Exploitation.

16

State of the Art in TranslationNewswire

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

Editorials

Want only chance to meet with President Clinton his address to Congress on the way the nation "lift across the pace of indictment" of Hezbollah "to peak included him in the" axis of evil "terrorist, in the House of Representatives will continue to Lebanese autopsy if Lebanese nation submerged under growing crises.

Broadcasts

Said Al-Aqsa Martyrs Brigade of the Fatah movement martyr Abu Ali Mustafa brigades of the Popular Front for the liberation of Palestine (PFLP) joint responsibility in the implementation of the process of Kfar Saba guerrilla

Page 17: 1 31 March 2004 Global Autonomous Language Exploitation.

17

Information Extraction

Events

Relations

Entities

WhatWhoWhenWhere

Meeting

• Essential to pull out key facts (structure from unstructured data)

• Not nearly good enough– Low coverage (5 entity types, 24 relations, 0 events)

– Low accuracy (72% for entities, 25% for relations)

– Even lower accuracy on audio & foreign languages (different styles, imperfect transcription + translation)

• Promising ideas– Lightly supervised/unsupervised training algorithms– Synthesis of statistical learning, semantically annotated data, and linguistically

motivated taxonomy of logical terms– Automatically customized language model from user’s document collection, corporate

collection, and media sources – New mathematical learning algorithms (see next slide)

– Semantics-enabled extraction

Page 18: 1 31 March 2004 Global Autonomous Language Exploitation.

18

New Training Method for Extraction

• Combine – New discriminative

learning algorithms • Breakthrough in

accuracy of interpretation

– Small supervised training with massive amounts of unsupervised training

• Breakthrough in customizability & language portability

50

55

60

65

70

75

80

85

90

95

100

1000 10000 100000 1000000

Supervised Approach

Training Size

Sco

re

New Approach

Name Extraction

Page 19: 1 31 March 2004 Global Autonomous Language Exploitation.

19

Novelty DetectionCNNNPR

Al JazeeraXin Hua

APReuters

Time

First story about a new event

Subsequent stories (some new info, lots of old info)

• Must keep on topof changing world

• Need to know when something is new

– New events, new info

– User cannot specify the search

• Need to thread related items

• Promising ideas– Combine shallow semantics (from extraction) and term similarity for document comparisons,

novelty determination – Machine learning to pinpoint relevance and recognize novelty– Rapid, focused feedback from searcher to direct search and define what is novel, important– Semantics-enabled detection

Page 20: 1 31 March 2004 Global Autonomous Language Exploitation.

20

• Information is – Fragmented, fragmentary– Repeated (across docs, languages)

– Contradicted (across sources, time)

– Perceived differently (by different people)

• Need concise synopses of reliable information, highlighting contradictions and different views

• Promising ideas– Normalize names, terminology; update master knowledge base– Exploit redundant information from multiple sources (including imperfect transcriptions, translations)

as confirmation – Utilize shallow semantic features (e.g., who, what, when, where, how) to identify contradictions,

changes over time, different perspectives– Combine semantic interpretation & opinion identification– Semantics-enabled fusion

Information Fusion

Page 21: 1 31 March 2004 Global Autonomous Language Exploitation.

21

Needs Discernment• To meet commander’s needs, must save analyst time &

insure analyst sees important things quickly

• New thrust — not done now

• Promising ideas– Instrument user interface, find out what

analyst does (what used/saved/discarded)

– See what analyst tasked to do, produces – Build up model of analyst interests/needs– Interpret analyst requests in terms of model – Extend functionality of “more like these”

(from word-level to logical content level)

– Learn (continually adapt) • Maximize good reports• Minimize unwanted data• Minimize need for explicit requests

IntentRecognition

English-speakinganalyst

Req

uest

Tasking

Tas

king

Report

Ale

rts

Fus

ed I

nfo

Info

Use

d, R

etai

ned

AnalystNeeds Model

Page 22: 1 31 March 2004 Global Autonomous Language Exploitation.

22

Language Understanding

• Terrible bottleneck (for many applications)

• Key technologies & many applications would work much better if machines couldactually understand meaning of language

• Promising ideas– Logical representation of text meaning (semantics)

– Linguistically motivated taxonomy of terms (ontology)

– Lightly supervised / unsupervised training– User document collection, corporate collection,

media as a language model

Highest risk, highest payoff

Page 23: 1 31 March 2004 Global Autonomous Language Exploitation.

23

Prior Work on Language Understanding

Applications/evaluations

• Typed queries

• Spoken queries (ATIS)

• Semantics (SemEval)

• Message understanding (MUC)

• Information extraction (ACE)

• Question answering (AQUAINT)

Must attack challenges seriously, systematically -- NOW

Resources

• Many corpora

• WordNet

• TreeBank

• PropBank

• OntoBank

Page 24: 1 31 March 2004 Global Autonomous Language Exploitation.

24

Levels of Representation

Words

Syntax

Explicit Semantics

Full Semantics

Morphology

TID

ES

GA

LE

Page 25: 1 31 March 2004 Global Autonomous Language Exploitation.

25

Plan of Attack

• Annotate diverse corpus in terms of meaning (OntoBank)

• Sponsor research to develop algorithms to —

– Mimic human annotations

– Identify predicates (relations & events) and their arguments (entities with their names & descriptions)

– Distinguish word senses

• Establish public competitions to evaluate the above

• Apply successes to improve performance of translation, extraction, detection, & fusion algorithms

Page 26: 1 31 March 2004 Global Autonomous Language Exploitation.

26

What is OntoBank?• Large (1 million words) collection of English texts from

numerous domains and genres

• Important aspects of semantics manually added into each sentence:

– Semantic term(s) representing each major element of meaning (persons, events, objects, locations…)

– Semantic relationships connecting them together (Agent, Theme, Instrument…)

– Additional semantic aspects for major phenomena of meaning (pronominal reference, etc.)

• For each sentence, its meaning representation frame is a set of connected propositions

Page 27: 1 31 March 2004 Global Autonomous Language Exploitation.

27

[A1 :act Acknowledge :agent [E1:

:object Founder:name “Abdul Qadeer Khan”:description [A3

:act Establish :agent E1 :organization [E2

:object Agency:description “Pakistan’s nuclear department” ]

]]

:proposition [A2 :act Transfer :agent E1 :theme [E4 :object Know-How :description “nuclear technology” ]:destination (and [E5

:object Nation:name “Pakistan"][E6 :object Nation:name “Libya"][E7 :object Nation:name “North Korea"]

)

]

]

Meaning Representation Example

The founder of Pakistan’s nuclear department, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya, and North Korea.

Page 28: 1 31 March 2004 Global Autonomous Language Exploitation.

28

Role/Value of OntoBank

• OntoBank is a static (non-procedural) resource, coupled to Ontology

• Direct (first-order) use: enable training of various semantics-based systems (parsers, etc.)

• Indirect (second-order) use: use these semantic systems for semantics-enabled translation, extraction, detection, fusion

Page 29: 1 31 March 2004 Global Autonomous Language Exploitation.

29

Programmatics

• Data

• Schedule

• Evaluations

• Go/No-Go Criteria

• Contracts

• Budget

Page 30: 1 31 March 2004 Global Autonomous Language Exploitation.

30

Data GALE Will UseMedia

– Radio

– Television

– Newswire

– Newsgroups

– Weblogs

– Websites

Languages– English

– Chinese

– Arabic

+ Readily available

+ Usefully diverse

+ No privacy problems

TID

ES

TID

ES

Page 31: 1 31 March 2004 Global Autonomous Language Exploitation.

31

Data Others Could UseMedia

– Intercepts

– Intelligence reports

– Record traffic● ● ●

Languages● ● ● ●

Classified

Page 32: 1 31 March 2004 Global Autonomous Language Exploitation.

32

FY04 FY05 FY06 FY07 FY08 FY09

Key Technology Development

System Integration & Experimentation

Transition

BAA

Schedule

Downselects following formal evaluations(best efforts boosted)

Integrated System Evaluations

Technology Evaluations

Awards

Field TestsOpportunisticSpin-offs

Page 33: 1 31 March 2004 Global Autonomous Language Exploitation.

33

Evaluation MethodologyField Tests

• Defined in consultation with transition partners

Integrated System Evaluations

• Productivity calculated as number of useful reports produced per unit time by an analyst using GALE

• Productivity improvements measured relative to best productivity obtained at end of Phase 1

Technology Evaluations

• Algorithmic errors determined via official NIST tests

• Error reductions measured relative to best results at start of GALE program

Page 34: 1 31 March 2004 Global Autonomous Language Exploitation.

34

Go / No-Go CriteriaPhase 1 (18 months)

– Integrated systems built, productivity baseline established

– Algorithmic errors reduced by 20%

Phase 2 (12 months)

– Productivity improved by 50%

– Algorithmic errors reduced by additional 20%

Phase 3 (12 months)

– Productivity improved by additional 75%

– Algorithm errors reduced by additional 20%

– First system transitioned for field testing

Phase 4 (18 months)

– Productivity improved by additional 100%

– Algorithmic errors reduced by additional 20%

– Two additional systems transitioned for field testing

Page 35: 1 31 March 2004 Global Autonomous Language Exploitation.

35

Efforts

SystemIntegration

TechnologyDevelopment

TranslationNeeds

DiscernmentInformationExtraction

Language Understanding

Support NIST

Evaluation Data

InformationFusion

NoveltyDetection

Page 36: 1 31 March 2004 Global Autonomous Language Exploitation.

36

Contracts

#1System

Integration

TechnologyDevelopment

#2

# 3

TranslationNeeds

DiscernmentInformationExtraction

Language Understanding

Support NIST

Evaluation Data

InformationFusion

NoveltyDetection

?

Page 37: 1 31 March 2004 Global Autonomous Language Exploitation.

37

FY04 FY05 FY06 FY07 FY08 FY09

Budget

Phase 1

Phase 2

Phase 3

Phase 4

Page 38: 1 31 March 2004 Global Autonomous Language Exploitation.

38

SummaryVision

Superhuman Assistant

− Ingest huge volumes ofunstructured speech & text

− Populate comprehensive knowledge base

− Discover trends & deviations

− Discern analyst needs

− Provide critical alerts & fused data (tailored to analyst needs)

Deliverables• Powerful, reusable, multipurpose technology

• Field-testable GALE systems (3)

• Opportunistic spin-offs

+

Page 39: 1 31 March 2004 Global Autonomous Language Exploitation.

39

GALEGALE

Page 40: 1 31 March 2004 Global Autonomous Language Exploitation.

40

Backup

Slides

Page 41: 1 31 March 2004 Global Autonomous Language Exploitation.

41

ACE / TIDES Entities and Relations

ThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea

SubsidiaryArg 1:Arg 2:

E1: PersonNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”

E2: OrganizationDescriptions: “Pakistan’s nuclear department”

E3: GPENames: “Pakistan”

E4: GPENames: “Iran”

E5: GPENames: “Libya”

E6: GPENames: “North Korea”

FounderArg 1:Arg 2:

ACE• Covers certain important entities and relations• Captures too little of meaning• Entity names & descriptions not normalized

Relations Entities Source sentence

Page 42: 1 31 March 2004 Global Autonomous Language Exploitation.

42

TreeBank RepresentationThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea

NPNP

NPNP

NP

PP

PP

VP

NP

NP

NP

NP

NP

NP

VP

S

SPAR S

VP

TreeBank includes• Part of speech• Syntactic structure

Page 43: 1 31 March 2004 Global Autonomous Language Exploitation.

43

AdmittedArg0:Arg1:

PropBank RepresentationThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea

NPNP

NP

NP

NP

PP

PP

VP

NP

NP

NP

NP

NP

NP

VP

S

SBAR S

VP

PropBank adds

• Shallow semantic information

TransferredArg0:Arg1:Arg2:

Page 44: 1 31 March 2004 Global Autonomous Language Exploitation.

44

OntoBank RepresentationThe

founder

of

Pakistan’s

nuclear department

Abdul Qadeer Khan

has

admitted

he

transferred

nuclear technology

to

Iran,

Libya,

and

North Korea

EstablishAgent:Org:

SubsidiarySubOrg:SuperOrg:

E1: FounderNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”

E2: AgencyDescriptions: “Pakistan’s nuclear department”

E3: NationNames: “Pakistan”

E5: NationNames: “Iran”

E6: NationNames: “Libya”

E7: NationNames: “North Korea”

E4: Know-HowDescriptions: “nuclear technology”

AcknowledgePerson:Fact:

TransferAgent:Item:Dest:

OntoBank adds

• Representation of all entities, relations, and events

• Terms from a formal representation (an ontology)

Page 45: 1 31 March 2004 Global Autonomous Language Exploitation.

45

Global Concept Ontology• Represent meaning (in a usable way)

• Use terms within a formal system (not just text strings)

• Include all major entities, events, and relations (far more than ACE)

• Support downstream processing(other IPTO systems, many DoD applications)

• Employ frame notation (propositional logic)

• Start with existing ontology for definition of logical terms(Omega from ISI)

– Text-based, linguistically motivated

• Refine & extend as necessary

• Use DAML to accommodate other ontologies– Compatible with customer legacy

Page 46: 1 31 March 2004 Global Autonomous Language Exploitation.

46

Entities Connected to OntologyE1: FounderNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”

E2: AgencyDescriptions: “Pakistan’s nuclear department”

E3: NationNames: “Pakistan”

E5: NationNames: “Iran”

E6: NationNames: “Libya”

E7: NationNames: “North Korea”

E4: Know-HowDescriptions: “nuclear technology”

Th

ing

Liv

ing

Be

ing

Fo

un

de

r, B

eg

inn

er

Cre

ato

r

Pe

rso

n

Po

liti

ca

l U

nit

So

cia

l G

rou

p

Org

an

iza

tio

n

Un

it

Ad

m U

nit

Na

tio

nA

ge

nc

y

Ac

tiv

ity

Ap

pli

ca

tio

n

Kn

ow

-Ho

w,

Te

ch

no

log

y

Page 47: 1 31 March 2004 Global Autonomous Language Exploitation.

47

Relations Connected to OntologyEstablishAgent: E1Org: E2

SubsidiarySubOrg: E2SuperOrg: E3

AcknowledgePerson: E1Fact:

TransferAgent: E1Item: E4Dest: E3,E6,E7

Co

mp

on

ent

Op

en

Th

ing

Act

ion

Tra

nsf

er,

tra

nsm

it

Mo

ve

Ad

mit

Dec

lare

Th

ink

Ab

stra

cti

on

Rel

ati

on

Est

ab

lish

, se

t u

p

Page 48: 1 31 March 2004 Global Autonomous Language Exploitation.

48

Semantic Representation

The semantic representation captures paraphrases of what is explicitly stated

E1027: PersonNames: “Jeff Bezos”, “Bezos”Descriptions: “Amazon’s founder”

E13043: CorporationNames: “Amazon”, “Amazon.com”Descriptions: “The company”

Jeff Bezos, Amazon’s founder

Bezos, was setting up Amazon.com

Jeff Bezos started the company in his garage

EstablishAgent:Org:

OriginateAgent:Originated:

esta

bli

sh/

fou

nd

/la

un

ch

orig

inat

e/in

itia

te

mak

e/cr

eate

Page 49: 1 31 March 2004 Global Autonomous Language Exploitation.

49

Semantic Network

E67394: Name: Barbara JordanDOD: 1996

E473915:Name: Barbara JordanDOD: 1999

Barbara Jordan would keep the extend of her health problems secret until the day she died early in 1996

Barbara Jordan (1936–1996)

(1999-10-04) Jordan died Sunday of cancer

x

The semantic representation supports detection of apparent contradiction, in this case deconfliction of

two different “Barbara Jordan” references.

Page 50: 1 31 March 2004 Global Autonomous Language Exploitation.

50

Detecting Novelty

[Tel Aviv] A lone gunman attacked

pedestrians near the American embassy.

An assailant wielding an AK-47 killed three people in downtown Tel Aviv Thursday.

Event4329: shootingAgent: gunmanVictim: 3 pedestriansLoc: Tel AvivInstr: AK-47

wro

ngdo

er crim

inal

mur

dere

r

gunm

an

assa

ilant

pers

on

trav

eler

pede

stria

n

The semantic representation supports novelty detection by recognizing two accounts of one event

and highlighting what is new in the 2nd.