Information Extraction. Information Extraction System Converts unstructured text into a form that...

25
Information Extraction

Transcript of Information Extraction. Information Extraction System Converts unstructured text into a form that...

Information Extraction

Information Extraction System• Converts unstructured text into a form that

can be loaded into a database table• Mentions of entities extracted without deep

understanding• Identifies useful/relevant text in a document– Text segment and its associated attributes

Morita said that to overcome the same currency problems, Japan needs to restructure its economy in order to live less from exports and more from domestic demand.

Morita Japan

Information Extraction

• Names can be identified without deep parse or complete text understanding

• Pattern recognition, machine learning algorithms

• Part of a higher level application– Question answering– Summarization

Information Extraction : Example

List the news reports of car bombings in Basra and surrounding areas between June 2004 and December 2004?

Need semantic information

Date Format

Subattributes of attributes

Conversion from unstructured to structured form

Natural language questions are open-ended whereas SQL queries are NOT!

Polysemy/Synonymity

IE Applications• Financial, legal, medical industries produce/use large amts of text• Medicine

– Information extracted from papers• Patients’ response to drugs• Summarize symptoms• Capture gene-drug interactions

• Law– Case reports can be mined for

• Description of individual cases• Case type(?)-ruling • Court opinion on different case categories• Related cases

• Finance– Financial reports or news articles can be mined for

• Companies revenues• Earnings• Sales/assets over a period

IE Applications

E-Recruitement

Extracting Sales Information

Intelligence Collection for news articles

Positions Available

Positions Available• Company: Overwatch• Location: Fort Leavenworth , KS66027• Status: Full Time, Employee• Job Category: IT/Software Development• Career Level: Experienced (Non-Manager)• Position Description:• Database Administrator

– The overall goal of the TRADOC Intelligence Support Activity is to develop and apply the Contemporary Operational Environment (COE) to training, leader development, and combat developments in order to enhance operational capabilities of Army units. Duty is at TRISA in Fort Leavenworth, KS.

– Develop a culture-based data standards database construct incorporating all schema (data model) types to be considered as a candidate for the Army’s common culture data standard—i.e., culture data standards to drive army non-kinetic simulations/models

Positions Available (cont)

1. Develop a conceptual schema (data model) consisting of entity classes (representing things of significance in the domain) and relationships (assertions about associations between pairs of entity classes).

2. Develop a logical schema (data model) consisting of descriptions of tables and columns, object oriented classes, and XML tags, among other things.

3. Develop a physical schema (data model) consisting of partitions, CPUs, table-spaces, and the like.

Positions Available (cont)• 5-10 years DB administration experience• BS Computer Science minimum; MCDBA certification or any related

certification(s) a plus• Expert at designing, integrating, all three database schema (conceptual,• logical, and physical). • Familiar with data mining techniques and various methodologies of

translating text files into data models (Visual Basic.NET, Advanced knowledge of SQL, MS Access, and Postgres).

• Top Secret security clearance with SCI access.• Be willing to learn/ramp-up on BLUFOR and OPFOR/COE doctrine,

organization, tactics, techniques, and procedures.• Become familiar with Future Force organization and maneuver concepts. • Excellent briefing and writing skills; familiar with Microsoft Word, EXCEL,

Power Point, and ACCESS.• Be a team player/builder.

Sales Information

Sales Information

Sales Information

HP zv6000 Notebook Featuring AMD Sempro Processor DELL XPS 3.6 GHz 1GB RAID 0 DVDRW CDRW 20” LCD XP PRO $500 OFF Reflected in Price $2,499.99 DELL 4700 P4 540 3.2 GHz 512MB, 160GB DVD+/-RW, DVD 19” LCD XP Home/Works Suite $1,199.99 Dell 8400 P4 3.4 GHz 1 GB, 250GBB DVDRW,DVD 19” LCP XP Pro Works Suite $1,699.99

Intelligence from news Articles v1

BAGHDAD, Iraq – Police reported that insurgents in two separate attacks had killed the of a Baghdad police station and four officers on Monday. The head of the Balat-al-Shouhada police station, Col. Abdul Kahrim Fahad and his driver were killed in a drive-by shooting on Monday morning. The attack occurred when Col. Fahad was on his way to the police station in southeastern Baghdad.

Intelligence from news Articles v2

• Insurgents struck Iraqi security forces Monday, killing the head of a Baghdad police station and four other officers in separate attacks, police said.

• Col. Abdul Kahrim Fahad, head of the Balat al-Shouhada police station, and his driver were gunned down in a drive-by shooting.

• Earlier Monday, a roadside bomb exploded near an Iraqi police patrol in southwestern Baghdad, killing one Iraqi policeman and wounding five other people, including three Iraqi police.

Entity Extraction

• Classification problem– Not every word is associated with a semantic class

• Two phases1. Identify potential entity words2. Classify into entity types

• Lists of entities– All inclusive list of entities (?)– Names in more than one list (?)

• Machine Learning Techniques

Entity Extraction

• Tokens/tags• Sentence analysis• Merging of multiple references to the same

entity• Extraction• Population of db tables

IE Systems

Tokenization and Tagging

Sentence Analysis

ExtractorMergingTemplate Generation

TEXTTokens

POS Tags

Combined

Entities

Assigned

Entities

POS

Groups

Difficulties in Entity Extraction

• Words in multiple lists• Boundary problem• Use of conjunction/disjunction• Embedded NEs• Abbreviations• Acronyms

MUC-6• Markup Description• The output of the systems to be evaluated will be in the form of SGML text markup.

The only insertions allowed during tagging are tags enclosed in angled brackets. No extra whitespace or carriage returns are to be inserted; otherwise, the offset count would change, which would adversely affect scoring.The markup will have the following form:

• <ELEMENT-NAME ATTR-NAME="ATTR-VALUE" ...>text-string</ELEMENT-NAME>• Example:• <ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX>• The markup is defined in SGML Document Type Descriptions (DTDs), written for

MUC-6 use by personnel at MITRE and maintained by personnel at NRaD. The DTDs enable annotators and system developers to use SGML validation tools to check the correctness of the SGML-tagged texts produced by the annotator or the system. The validation tools are available to MUC-6 participants in the file called muc6-sgml-tools. Annotators are using a software tool provided for MUC-6 by SRA Corporation to assist in generating the answer keys to be used for system training and testing.

MUC-6

• Named Entities (ENAMEX tag element)– This subtask is limited to proper names, acronyms, and

perhaps miscellaneous other unique identifiers, which are categorized via the TYPE attribute as follows:

– ORGANIZATION: named corporate, governmental, or other organizational entity

– PERSON: named person or family– LOCATION: name of politically or geographically

defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.)

MUC-6

• Temporal Expressions (TIMEX tag element)– This subtask is for "absolute" temporal

expressions only; explanation is provided in appendix B. The tagged tokens are categorized via the TYPE attribute as follows:

– DATE: complete or partial date expression– TIME: complete or partial expression of time of

day

MUC-6

• Number Expressions (NUMEX tag element)– This subtask is for two useful types of numeric

expressions, monetary expressions and percentages. The numbers may be expressed in either numeric or alphabetic form.The task covers the complete expression, which is categorized via the TYPE attribute as follows:

– MONEY: monetary expression– PERCENT: percentage

Excerpt from MUC-6 dataset

Filled Template