Pure financial exploitation vs. Hybrid financial exploitation co ...
1 31 March 2004 Global Autonomous Language Exploitation.
-
Upload
avis-benson -
Category
Documents
-
view
221 -
download
0
Transcript of 1 31 March 2004 Global Autonomous Language Exploitation.
1
GALEGALE
31 March 2004
Global AutonomousGlobal AutonomousLanguage ExploitationLanguage Exploitation
2
High Level Goal
Smart, speedy, supremely capable
Superhuman Assistant
Effectively exploit massive data
Accurately infer analyst need, make maximally productive
Learn and improve continually
Make analysts 10 times more productive
3
Challenge / Opportunity
Actual Analysts(working individually)
Fantasized Analysts(working as one)
Massive Sea of Data(speech & text, multiple languages)
4
Vision
Analyst
Massive Sea of Data(speech & text, multiple languages)
EnglishSpeaker +
Superhuman Assistant(big mind to see big picture)
5
POWERFUL Technology to Exploit Human Language
FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08
TIDES
EARS
GALE
6
POWERFUL Technology to Exploit Human Language
VisionMultiple MediaMultiple SourcesMultiple Languages
Streaming, ChangingHuge Volumes
Tex
t
+
Sp
eech
EARS
Te
xt
Speech-to-Text
TIDES
FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08
TIDES
EARS
GALE
Requested Information(in English)
7
POWERFUL Technology to Exploit Human Language
VisionMultiple MediaMultiple SourcesMultiple Languages
Streaming, ChangingHuge Volumes
Tex
t
+
Sp
eech
EARS
Te
xt
Speech-to-Text
• Fewer analysts
TIDES GALE
Critical Information(Fused, in English)
Alerts
FY01 FY02 FY03 FY04 FY05 FY06 FY07 FY08
TIDES
EARS
GALE
• Faster awareness
8
What Sets GALE ApartAutomatic, autonomous analysis of massive amounts of human language from around world
(multiple media, sources, languages)
Rapid discovery, fusion, and alerting – delivering information critical to national defense
Efficient, intelligent interaction with analysts
– Discern interests of individual analysts
– Point out new information & new events
– Organize information for ready access
– Constantly improve, learn how job is done (get smarter with time)
9
Powerful Tools for English-Speaking Analysts
Multiple MediaMultiple SourcesMultiple Languages
Streaming, ChangingHuge Volumes
Tex
t
+
Spe
ech
Commanders&
Policy Makers
How TIDES Works
Requested Information(in English)
English-speakinganalysts
Report
Tasking
TIDES
10
Multiple MediaMultiple SourcesMultiple Languages
Streaming, ChangingHuge Volumes
Tex
t
+
Spe
ech
How GALE Will Work
Needed Information(Fused, in English)
English-speakinganalyst
Alerts
Hot
Stu
ff
Report
Tasking
Infoexamined, retained
Explicit requests
Task
ing received
Repor
ts
writtenSystem learns
via painlessautomaticrelevancefeedback
loops
Commanders&
Policy Makers
Precise Information Proactively
GALE
11
What Will Happen Inside
Data Information Knowledge
Conversion
Extraction
Discovery
Analysis
Tex
tT
ext
Speech
Non
-Eng
lish
InformationExtraction
Spe
ech
Eng
lish
Tex
t
Eng
lish
Linked video, speech, text, English text
VideoV
ide
o
NoveltyDetection
NeedsDiscernment
Eng
lish
Tex
tEnglish text
New Events,New InformationContradictions
English-speakinganalyst
Fused Facts,Associations,Inferences,Summaries
Tasking
Tas
king
Report
Req
uest
Info
Use
d,
Re
tain
ed
Wisdom
Ale
rts
Fus
ed I
nfo
InformationFusion
Transcription
TranslationF
acts
(n
ames
, n
umb
ers,
dat
es,
entit
ies,
rel
atio
ns,
even
ts)
12
What System Does • Acts automatically, autonomously (takes initiative)
• Reads & interprets human language (source data, tasking, reports)
• Observes user actions (data viewed, used, retained, discarded)
• Adds to its global knowledge base + model of user needs
• Organizes, prioritizes, notifies (pushes precise information)
What Analyst Sees• Fused information • Prioritized information
– Summaries, headlines, alerts
– Tabular, graphic, telegraphic
• Source data + transcripts + translations in English (linked)
13
Research Communities• Information retrieval
• Computational linguistics
• Machine learning
• Data storage & retrieval
• Knowledge representation & reasoning
• Social network theory
• Human-computer interaction
= tiger team
TID
ES
14
Key Technologies• Translation (readable, actionable English)
• Extraction (facts about entities, relations, events)
• Novelty detection (new events, new information)
• Information fusion (concise, integrative summaries)
• Needs discernment (analyst / commander needs)
• Language understanding (meaning representation)
= real power
15
En
glis
h
• Essential for foreign languages (most of world, DoD mission)
• Encouraging results on newswire (big leap last year)
• Not yet good enough– Need higher accuracy, fluency, confidence measures – Must handle varied data sources
• Text that is more difficult than news• Speech that is imperfectly transcribed
• Promising ideas– Trainable rules to translate names & numbers– Morphology & syntax for general translation– Error-driven training & development (big boost last year)
– Automatic data acquisition– Semantics-enabled translation
Translation
En
glis
hCh
ine
se
Ara
bic
16
State of the Art in TranslationNewswire
Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
Editorials
Want only chance to meet with President Clinton his address to Congress on the way the nation "lift across the pace of indictment" of Hezbollah "to peak included him in the" axis of evil "terrorist, in the House of Representatives will continue to Lebanese autopsy if Lebanese nation submerged under growing crises.
Broadcasts
Said Al-Aqsa Martyrs Brigade of the Fatah movement martyr Abu Ali Mustafa brigades of the Popular Front for the liberation of Palestine (PFLP) joint responsibility in the implementation of the process of Kfar Saba guerrilla
17
Information Extraction
Events
Relations
Entities
WhatWhoWhenWhere
Meeting
• Essential to pull out key facts (structure from unstructured data)
• Not nearly good enough– Low coverage (5 entity types, 24 relations, 0 events)
– Low accuracy (72% for entities, 25% for relations)
– Even lower accuracy on audio & foreign languages (different styles, imperfect transcription + translation)
• Promising ideas– Lightly supervised/unsupervised training algorithms– Synthesis of statistical learning, semantically annotated data, and linguistically
motivated taxonomy of logical terms– Automatically customized language model from user’s document collection, corporate
collection, and media sources – New mathematical learning algorithms (see next slide)
– Semantics-enabled extraction
18
New Training Method for Extraction
• Combine – New discriminative
learning algorithms • Breakthrough in
accuracy of interpretation
– Small supervised training with massive amounts of unsupervised training
• Breakthrough in customizability & language portability
50
55
60
65
70
75
80
85
90
95
100
1000 10000 100000 1000000
Supervised Approach
Training Size
Sco
re
New Approach
Name Extraction
19
Novelty DetectionCNNNPR
Al JazeeraXin Hua
APReuters
Time
First story about a new event
Subsequent stories (some new info, lots of old info)
• Must keep on topof changing world
• Need to know when something is new
– New events, new info
– User cannot specify the search
• Need to thread related items
• Promising ideas– Combine shallow semantics (from extraction) and term similarity for document comparisons,
novelty determination – Machine learning to pinpoint relevance and recognize novelty– Rapid, focused feedback from searcher to direct search and define what is novel, important– Semantics-enabled detection
20
• Information is – Fragmented, fragmentary– Repeated (across docs, languages)
– Contradicted (across sources, time)
– Perceived differently (by different people)
• Need concise synopses of reliable information, highlighting contradictions and different views
• Promising ideas– Normalize names, terminology; update master knowledge base– Exploit redundant information from multiple sources (including imperfect transcriptions, translations)
as confirmation – Utilize shallow semantic features (e.g., who, what, when, where, how) to identify contradictions,
changes over time, different perspectives– Combine semantic interpretation & opinion identification– Semantics-enabled fusion
Information Fusion
21
Needs Discernment• To meet commander’s needs, must save analyst time &
insure analyst sees important things quickly
• New thrust — not done now
• Promising ideas– Instrument user interface, find out what
analyst does (what used/saved/discarded)
– See what analyst tasked to do, produces – Build up model of analyst interests/needs– Interpret analyst requests in terms of model – Extend functionality of “more like these”
(from word-level to logical content level)
– Learn (continually adapt) • Maximize good reports• Minimize unwanted data• Minimize need for explicit requests
IntentRecognition
English-speakinganalyst
Req
uest
Tasking
Tas
king
Report
Ale
rts
Fus
ed I
nfo
Info
Use
d, R
etai
ned
AnalystNeeds Model
22
Language Understanding
• Terrible bottleneck (for many applications)
• Key technologies & many applications would work much better if machines couldactually understand meaning of language
• Promising ideas– Logical representation of text meaning (semantics)
– Linguistically motivated taxonomy of terms (ontology)
– Lightly supervised / unsupervised training– User document collection, corporate collection,
media as a language model
Highest risk, highest payoff
23
Prior Work on Language Understanding
Applications/evaluations
• Typed queries
• Spoken queries (ATIS)
• Semantics (SemEval)
• Message understanding (MUC)
• Information extraction (ACE)
• Question answering (AQUAINT)
Must attack challenges seriously, systematically -- NOW
Resources
• Many corpora
• WordNet
• TreeBank
• PropBank
• OntoBank
24
Levels of Representation
Words
Syntax
Explicit Semantics
Full Semantics
Morphology
TID
ES
GA
LE
25
Plan of Attack
• Annotate diverse corpus in terms of meaning (OntoBank)
• Sponsor research to develop algorithms to —
– Mimic human annotations
– Identify predicates (relations & events) and their arguments (entities with their names & descriptions)
– Distinguish word senses
• Establish public competitions to evaluate the above
• Apply successes to improve performance of translation, extraction, detection, & fusion algorithms
26
What is OntoBank?• Large (1 million words) collection of English texts from
numerous domains and genres
• Important aspects of semantics manually added into each sentence:
– Semantic term(s) representing each major element of meaning (persons, events, objects, locations…)
– Semantic relationships connecting them together (Agent, Theme, Instrument…)
– Additional semantic aspects for major phenomena of meaning (pronominal reference, etc.)
• For each sentence, its meaning representation frame is a set of connected propositions
27
[A1 :act Acknowledge :agent [E1:
:object Founder:name “Abdul Qadeer Khan”:description [A3
:act Establish :agent E1 :organization [E2
:object Agency:description “Pakistan’s nuclear department” ]
]]
:proposition [A2 :act Transfer :agent E1 :theme [E4 :object Know-How :description “nuclear technology” ]:destination (and [E5
:object Nation:name “Pakistan"][E6 :object Nation:name “Libya"][E7 :object Nation:name “North Korea"]
)
]
]
Meaning Representation Example
The founder of Pakistan’s nuclear department, Abdul Qadeer Khan, has admitted he transferred nuclear technology to Iran, Libya, and North Korea.
28
Role/Value of OntoBank
• OntoBank is a static (non-procedural) resource, coupled to Ontology
• Direct (first-order) use: enable training of various semantics-based systems (parsers, etc.)
• Indirect (second-order) use: use these semantic systems for semantics-enabled translation, extraction, detection, fusion
29
Programmatics
• Data
• Schedule
• Evaluations
• Go/No-Go Criteria
• Contracts
• Budget
30
Data GALE Will UseMedia
– Radio
– Television
– Newswire
– Newsgroups
– Weblogs
– Websites
Languages– English
– Chinese
– Arabic
+ Readily available
+ Usefully diverse
+ No privacy problems
TID
ES
TID
ES
31
Data Others Could UseMedia
– Intercepts
– Intelligence reports
– Record traffic● ● ●
Languages● ● ● ●
Classified
32
FY04 FY05 FY06 FY07 FY08 FY09
Key Technology Development
System Integration & Experimentation
Transition
BAA
Schedule
Downselects following formal evaluations(best efforts boosted)
Integrated System Evaluations
Technology Evaluations
Awards
Field TestsOpportunisticSpin-offs
33
Evaluation MethodologyField Tests
• Defined in consultation with transition partners
Integrated System Evaluations
• Productivity calculated as number of useful reports produced per unit time by an analyst using GALE
• Productivity improvements measured relative to best productivity obtained at end of Phase 1
Technology Evaluations
• Algorithmic errors determined via official NIST tests
• Error reductions measured relative to best results at start of GALE program
34
Go / No-Go CriteriaPhase 1 (18 months)
– Integrated systems built, productivity baseline established
– Algorithmic errors reduced by 20%
Phase 2 (12 months)
– Productivity improved by 50%
– Algorithmic errors reduced by additional 20%
Phase 3 (12 months)
– Productivity improved by additional 75%
– Algorithm errors reduced by additional 20%
– First system transitioned for field testing
Phase 4 (18 months)
– Productivity improved by additional 100%
– Algorithmic errors reduced by additional 20%
– Two additional systems transitioned for field testing
35
Efforts
SystemIntegration
TechnologyDevelopment
TranslationNeeds
DiscernmentInformationExtraction
Language Understanding
Support NIST
Evaluation Data
InformationFusion
NoveltyDetection
36
Contracts
#1System
Integration
TechnologyDevelopment
#2
# 3
TranslationNeeds
DiscernmentInformationExtraction
Language Understanding
Support NIST
Evaluation Data
InformationFusion
NoveltyDetection
?
37
FY04 FY05 FY06 FY07 FY08 FY09
Budget
Phase 1
Phase 2
Phase 3
Phase 4
38
SummaryVision
Superhuman Assistant
− Ingest huge volumes ofunstructured speech & text
− Populate comprehensive knowledge base
− Discover trends & deviations
− Discern analyst needs
− Provide critical alerts & fused data (tailored to analyst needs)
Deliverables• Powerful, reusable, multipurpose technology
• Field-testable GALE systems (3)
• Opportunistic spin-offs
+
39
GALEGALE
40
Backup
Slides
41
ACE / TIDES Entities and Relations
ThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea
SubsidiaryArg 1:Arg 2:
E1: PersonNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”
E2: OrganizationDescriptions: “Pakistan’s nuclear department”
E3: GPENames: “Pakistan”
E4: GPENames: “Iran”
E5: GPENames: “Libya”
E6: GPENames: “North Korea”
FounderArg 1:Arg 2:
ACE• Covers certain important entities and relations• Captures too little of meaning• Entity names & descriptions not normalized
Relations Entities Source sentence
42
TreeBank RepresentationThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea
NPNP
NPNP
NP
PP
PP
VP
NP
NP
NP
NP
NP
NP
VP
S
SPAR S
VP
TreeBank includes• Part of speech• Syntactic structure
43
AdmittedArg0:Arg1:
PropBank RepresentationThefounderofPakistan’snuclear departmentAbdul Qadeer Khanhasadmittedhe transferrednuclear technologytoIran,Libya,and North Korea
NPNP
NP
NP
NP
PP
PP
VP
NP
NP
NP
NP
NP
NP
VP
S
SBAR S
VP
PropBank adds
• Shallow semantic information
TransferredArg0:Arg1:Arg2:
44
OntoBank RepresentationThe
founder
of
Pakistan’s
nuclear department
Abdul Qadeer Khan
has
admitted
he
transferred
nuclear technology
to
Iran,
Libya,
and
North Korea
EstablishAgent:Org:
SubsidiarySubOrg:SuperOrg:
E1: FounderNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”
E2: AgencyDescriptions: “Pakistan’s nuclear department”
E3: NationNames: “Pakistan”
E5: NationNames: “Iran”
E6: NationNames: “Libya”
E7: NationNames: “North Korea”
E4: Know-HowDescriptions: “nuclear technology”
AcknowledgePerson:Fact:
TransferAgent:Item:Dest:
OntoBank adds
• Representation of all entities, relations, and events
• Terms from a formal representation (an ontology)
45
Global Concept Ontology• Represent meaning (in a usable way)
• Use terms within a formal system (not just text strings)
• Include all major entities, events, and relations (far more than ACE)
• Support downstream processing(other IPTO systems, many DoD applications)
• Employ frame notation (propositional logic)
• Start with existing ontology for definition of logical terms(Omega from ISI)
– Text-based, linguistically motivated
• Refine & extend as necessary
• Use DAML to accommodate other ontologies– Compatible with customer legacy
46
Entities Connected to OntologyE1: FounderNames: “Abdul Qadeer Khan”Descriptions: “The founder of Pakistan’s nuclear department”, “he”
E2: AgencyDescriptions: “Pakistan’s nuclear department”
E3: NationNames: “Pakistan”
E5: NationNames: “Iran”
E6: NationNames: “Libya”
E7: NationNames: “North Korea”
E4: Know-HowDescriptions: “nuclear technology”
Th
ing
Liv
ing
Be
ing
Fo
un
de
r, B
eg
inn
er
Cre
ato
r
Pe
rso
n
Po
liti
ca
l U
nit
So
cia
l G
rou
p
Org
an
iza
tio
n
Un
it
Ad
m U
nit
Na
tio
nA
ge
nc
y
Ac
tiv
ity
Ap
pli
ca
tio
n
Kn
ow
-Ho
w,
Te
ch
no
log
y
47
Relations Connected to OntologyEstablishAgent: E1Org: E2
SubsidiarySubOrg: E2SuperOrg: E3
AcknowledgePerson: E1Fact:
TransferAgent: E1Item: E4Dest: E3,E6,E7
Co
mp
on
ent
Op
en
Th
ing
Act
ion
Tra
nsf
er,
tra
nsm
it
Mo
ve
Ad
mit
Dec
lare
Th
ink
Ab
stra
cti
on
Rel
ati
on
Est
ab
lish
, se
t u
p
48
Semantic Representation
The semantic representation captures paraphrases of what is explicitly stated
E1027: PersonNames: “Jeff Bezos”, “Bezos”Descriptions: “Amazon’s founder”
E13043: CorporationNames: “Amazon”, “Amazon.com”Descriptions: “The company”
Jeff Bezos, Amazon’s founder
Bezos, was setting up Amazon.com
Jeff Bezos started the company in his garage
EstablishAgent:Org:
OriginateAgent:Originated:
esta
bli
sh/
fou
nd
/la
un
ch
orig
inat
e/in
itia
te
mak
e/cr
eate
49
Semantic Network
E67394: Name: Barbara JordanDOD: 1996
E473915:Name: Barbara JordanDOD: 1999
Barbara Jordan would keep the extend of her health problems secret until the day she died early in 1996
Barbara Jordan (1936–1996)
(1999-10-04) Jordan died Sunday of cancer
x
The semantic representation supports detection of apparent contradiction, in this case deconfliction of
two different “Barbara Jordan” references.
50
Detecting Novelty
[Tel Aviv] A lone gunman attacked
pedestrians near the American embassy.
An assailant wielding an AK-47 killed three people in downtown Tel Aviv Thursday.
Event4329: shootingAgent: gunmanVictim: 3 pedestriansLoc: Tel AvivInstr: AK-47
wro
ngdo
er crim
inal
mur
dere
r
gunm
an
assa
ilant
pers
on
trav
eler
pede
stria
n
The semantic representation supports novelty detection by recognizing two accounts of one event
and highlighting what is new in the 2nd.