Flexible Text Mining using Interactive Information Extraction David Milward...

Flexible Text Mining using Interactive Information Extraction

David Milwarddavid.milward@linguamatics.com

Text mining vs. Data Mining

• Text mining– getting nuggets of information

from text

– extracting relationships

– structured results to feed into data mining, visualisation or databases

company activity companySanofi bid AventisRoche partner Antisoma

• Data mining– getting new knowledge from databases

– suggesting new relationships, trends, patterns

Text Data Mining

• Emphasizes finding new knowledge from text

• Typically knowledge that is implicit within multiple documents

What is the relationship to IR?

• IR finds the most relevant documents

• Text mining finds information from within documents, or across documents– What drugs are used for psoriasis treatment?

– Who are associated directly or indirectly with the Board of Exxon?

• There is overlap …– we often search to answer a question, not to find a

document

Traditional Information Extraction

• Uses natural language processing to distinguish– Sanofi bid for Aventis – Aventis bid for Sanofi

• Provides structured results for easy review and analysis

• Uses normalised terminology to allow integration with databases e.g.

– Preferred term: Sanofi, – Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo …

• But:– typically limited to patterns on a single sentence– constructing, testing and running queries can take days

• Appropriate if you always have the same question e.g. want to run over a newsfeed every night

company activity companySanofi bid AventisRoche partner Antisoma

I2E: Interactive Information Extraction

• A new concept• Encompasses

– keywords → documents– patterns → relationships (structured output)

• Queries ranging from:– General Motors – General Motors & acquisition in the same

document– Automotive companies & acquisitions in the

same sentence– What companies is General Motors

associated with?

• Not limited to patterns within sentences e.g.– Merger and acquisition activity in

documents mentioning Japan

• Fast, scalable, versatile

I2EInformation ExtractionInformation Extraction

NLPNLP

Taxonomies/ Ontologies

Text SearchText Search

Structured Output

Linguistic Processing

We find that p42mapk phosphorylates c-Myb on serine and threonine .

Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences

• Groups words into meaningful units

• Morphology allows search for different forms of words

morphology -

different forms

noun phrases

match entities

verb groups

match actions

Monitoring Merger and Acquisition Activity

Company Positions

Using I2E in the Life Sciences

• Good resources– Scientific abstracts are readily

available in XML

– Large number of existing taxonomies/terminologies

• Very large scale– 16 million abstracts relevant to life

sciences. Growing ???? a year

– Large numbers of internal reports and full-text articles

– Internal documents often > 1000 pages, may be PDF images

– Taxonomies/terminologies are large, often deeply structured e.g.

• 350K nodes, ??? synonyms

– Still need to augment terminology for specific areas

• Relatively large scale– 17 million abstracts

– Large numbers of internal reports and full-text articles

– Internal documents can be >1000 pages, may be PDF images

– Taxonomies/terminologies are large, often deeply structured

> 100K concepts

> 400K synonyms

– Still need to augment terminology for specific areas

Examples of Pharma Questions

• R&D

– Which proteins interact with metabolite X?

– What are the reaction kinetics for canonical pathway Y?

– What attributes are common to sets of biomarker genes

– What are the known associations between expressed genes and environmental factors.

– What dosages of compound B cause adverse reactions?

• Competitive Intelligence

– Which companies are working on technology C?

– What compounds are available for in-licensing in a disease area?

– Which research groups are my competitors collaborating with?

Linking Drugs to Adverse Events

Measurements

• Extraction of numerical parameters, – e.g. amounts, dosages, concentrations

Benefits of Flexible Text Mining

• The ideal final query may use – co-occurrence of terms within a document or sentence

– a precise linguistic pattern

– a mixture of both

• It depends on– the nature of the task

– the availability of terminologies

– the kind of documents (news vs. science, abstract vs. full text)

– the time available to check results

• Flexibility to mix different techniques is also critical for fast development of queries– e.g. start with broad queries to explore the “results space”,

then home in

Fast query creation

I2E: Better Results, Faster

Fast return of results

Fast review and analysis

BCL2 CDKN1A DMPK EPHB2 INS MAP2K1 MAPK1 MAPK3 MAPK7 RB1 STK3 VIM

suppress

regulate

phosphorylate

mediate

interact

inhibit

induce

inactivate

co-express

activate

[c] Reln

Impact of I2E

• Significant reduction in time spent searching/reading the literature– weeks reduced to days or hours

• Structure the unstructured to – provide systematic and comprehensive review of

information content

– enable integration with traditional structured data

– allow complex analysis of literature derived information

– generate hypotheses, gain insight

Flexible Text Mining using Interactive Information Extraction David Milward...

Documents

Transcript of Flexible Text Mining using Interactive Information Extraction David Milward...

Flexible Data Extraction for Analysis using ...628562/FULLTEXT01.pdf · Flexible Data Extraction for Analysis using Multidimensional Databases and OLAP Cubes Flexibelt extraherande

EXTRACTION and SUGAR INDUSTRY APPLICATIONS. EXTRACTION EXTRACTION 1-LEACHING(SOLID EXTRACTION) 1-LEACHING(SOLID EXTRACTION) a) GENERAL INFORMATION a)

microwaveassisted extraction antioxidant extraction using ...

JON MILWARD Head of Development Drivers Jonas Disposal Strategy: RNOH Stanmore.

ICIC 2013 Conference Proceedings David Milward Linguamatics

Sep. 21-22, 2006 v FME Worldwide User Conference - Vancouver Flexible Extraction and Transformation from ArcSDE to AutoCad Ulf Månsson, SWECO Position.

1 A Brief History of Particle Physics Geoff Milward.

Pairing up the Beauty with the Beast - Milward Brown & On-Device

Minibus Operators & Passenger Welfare in Malawi Milward Tobias Malawi.

Skill level - Advanceddfsm9194vna0o.cloudfront.net/711990-0-15ENmacramenecklace.pdf · Anchor Metallic (Embroidery Thread). 50m spool of shade 00300. Additional Requirements Milward

FLEXIBLE POWER GRID RESOURCES —AN NEA ANALYSIS · 2015. 4. 27. · • Raw Materials Extraction • Materials Production • System/Plant Component Manufacture . 99.8% • Fuel

APPLICATIONS AND DECISIONS - GOV UK · Director(s): STEPHEN MILWARD, TROY LYNDEN MILWARD. THE OAKS, PRESTON GUBBALS ROAD, BOMERE HEATH SHREWSBURY SY4 3LU Operating Centre: UNIT 27,

Flexible Unsupervised Feature Extraction for Image ...Based on this context, we propose an unsupervised di-mensionality reduction model named ﬂexible unsupervised feature extraction

How Flexible is US Shale Oil Production? Evidence … Flexible is US Shale Oil Production? Evidence from North Dakota ... I Study production from conventional extraction ... in the

Milward 2012-13

Linguamatics â€“ David Milward - ChemAxon

INF 5300 – Flexible shape extraction II€¦ · Snakes 10.4.13 INF 5300 1 INF 5300 – Flexible shape extraction II Anne Solberg (anne@ifi.uio.no) • The pratical part of the Kass

Relation Extraction and Machine Learning for IE Feiyu Xu feiyu@dfki€¦ · •Topic Extraction •Term Extraction •Named Entity Extraction •Binary Relation Extraction •N-ary

MULTI-PHASE EXTRACTION AND PRODUCT RECOVERY · PDF fileMULTI-PHASE EXTRACTION AND PRODUCT RECOVERY. ... Dual Extraction & Total Fluids Extraction • Dual extraction – Concept: pump

KMBT C654-20151124141204 - Tasker Milward Schooltaskermilward.org.uk/.../11/...propaganda-to-strengthen-his-regime.pdf · Literature, Drama & Music ... oppose that pseudo-revolution'