Post on 08-Jan-2016
description
Natural Language Processing for Underground Communications
Dan Klein
MURI Kickoff, 11/20/2009
Underground Communications
Example Data
Underground Communications
Example Data, Manual Extraction
Processing: Information Extraction
Observation Graphs
http://www.spam-reklama.ru/contact.html
http://www.rossmail.ru/offline.htm
http://www.fax-reklama.ru/contact.html
http://www.f-mail.ru/kontact/
Underlying Entities and Relations
Person 1211Alias: SteakcapICQ: 598199837Location: France
ReferralFrom: Person 2133To: Person 1211Product: 3319
Person 2133Alias: ThunderelviICQ: 787659871Location: USA
Product 3319Type: FB HarvesterContact: 709-324-0989
Person 9876Alias: ZakarICQ: 234150301Email: zakar@e-...
EmployeePerson: Person 9876Product: 5621Role: Developer
Product 5621Type: Spam SenderContact: 495-210-4423
Extraction Goal
Existing NLP Tasks
Discourse Structure
sign deliver vote
General Approach
An Entity Reference Model
Our Existing Approach
Adding Semantic Knowledge
America Online company
Our Current Work
Evaluation: Reference
MUC F1 - Cluster Similarity UnsupervisedSupervised
UnsupervisedBaseline
Bengston &Roth 08
PreliminaryCurrent Work
Does it Work?
Cross-Document IdentityWhat’s Coming Up
Extracting Global Entities
Underlying Entities and Relations
Person 1211Alias: SteakcapICQ: 598199837Location: France
ReferralFrom: Person 2133To: Person 1211Product: 3319
Person 2133Alias: ThunderelviICQ: 787659871Location: USA
Product 3319Type: FB HarvesterContact: 709-324-0989
Person 9876Alias: ZakarICQ: 234150301Email: zakar@e-...
EmployeePerson: Person 9876Product: 5621Role: Developer
Product 5621Type: Spam SenderContact: 495-210-4423
Subsequent Goals
Summary
Goal: systems which simultaneously extract and dedupe Train in an unsupervised / discovery manner Requires: both new statistical machinery and good models of
underlying domain structure (transactions, etc) Requires: processing domain-specific language (domain adaptation,
grammar induction)
Evaluation: are the entities and relations correct? First steps: measure general approach on newswire, etc. where we
know the right answers Also: evaluate on underground network data
Near term: increased accuracy in identity resolution, begin to extract simple relations, better basic analysis
Thanks!