Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific...

1 1

Source – J Kreulen

2 2

Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value

Stephen K. Boyer, Ph.D. [email protected]

408-858-5544

Moneyball

Medicine

mailto:[email protected]

3 3

The Problem

All content and no discovery ?

4 4

Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ?

The Question

5 5

As text

Chemical names found in the text of

documents

As bitmap images

Pictures of chemicals found in the document

Images

For Example : Patents and scientific papers contain molecular data in many different forms -

6 6

Massive Computing Environment

Find and compute the 3D structures

Identify every protein

Identify every disease

Identify every Medline MeSh code

Identify occurrence of every biomarker

Equivalent to 240K simultaneous Google

searches -

Data warehouse

Compute properties, &

find relationships,

Chemical & Biological information derived from text analytics

7 7

Dat

a So

urc

es

View selected

Documents & Reports

U.S. Patents (1976 -—

2009)

U.S. Pre-

Grants (All)

PCT & EPO Apps

Medline Abstracts

(>18 M)

Selected Internet Content

User Applications

In-House

Content

Knime or Pipeline Pilot

BIW

SIMPLE

Chem Axon Search

Cognos/DDQB/ Other Apps

Parse & Extract

data

Annotator 1

Annotator 2

Database

+ compu ted Meta Data

e Classifier & Other Data Associations

Annotation Factory

Computational Analytics

(Semantic

Associations)

Computer Curation Process Overview

IP Database (e.g. DB2)

ADU*

* ADU = Automated Data Update

ChemVerse

db

ChemVerse

Services Hosted at IBM Almaden

8 8

Current Activates …

9 9

- - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - -

= Chemical

= Target

= Disease

= Assay data

Text Annotated Text

- - - - - - - - - - - - - - - - - - - -

Annotated Text

identify chemical names

convert chem names to chemical structures SMILES – then convert these Into inchi & Inchkeys

- - - - - - - - - - - - - - - - - - - -

Annotated Text

replace all chemical names with the term “inchikey_& the unique inchikey” for that chemical

Re-index inchikeys w SOLR

= aspirin = inchikey = BSYNRYMUTXBXSQ-UHFFFAOYSA-N

= aspirin = CC(=O)OC1=CC=CC=C1C(=O)O

dB SOLR index

Current activity : “in line” entity tagging & classification

http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&db=pccompound&term=%22BSYNRYMUTXBXSQ-UHFFFAOYSA-N%22%5BInChIKey%5D





10 10

- - - - - - - - - - - - - - - - - - - -

= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N”

= Target

= Disease

= Assay data

Text

= Chemical

compound Target 1

Target 2

Target 3

Target 1

Target 2

Target 3

= [target _gene name]

Target 4

Target 5

Compound – Targets associations Known from the literature

Compound – Targets associations Known from the SEA computations

dB

Current activity : “in line” entity tagging & classification

In line text tagging (classification) coupled with computational & experimental data

NIH HTS Assay data

Compound – Targets associations Known from NIh experimental HTS

11 11

Medline co-occurrence of Statin structures vs. MeSH -Signs & Symptoms (C23)

Chemical Structures vs. Signs and Symptoms

12 12

Search Chemical Search using ChemAxon w/ DB2

Proximal Search Nearest Neighbor Search

13 13

BioTerm Analysis

Clustering Claims Originality

Discovery

14 14

Landscape Analysis

Visualization

Networks

15 15

Dat

a So

urc

es

View selected

Documents & Reports

U.S. Patents (1976 -—

2009)

U.S. Pre-

Grants (All)

PCT & EPO Apps

Medline Abstracts

(>18 M)

Selected Internet Content

User Applications

In-House

Content

Knime or Pipeline Pilot

BIW

SIMPLE

Chem Axon Search

Cognos/DDQB/ Other Apps

Parse & Extract

data

Annotator 1

Annotator 2

Database

+ compu ted Meta Data

e Classifier & Other Data Associations

Annotation Factory

Computational Analytics

(Semantic

Associations)

Computer Curation Process Overview

IP Database (e.g. DB2)

ADU*

* ADU = Automated Data Update

ChemVerse

db

ChemVerse

Services Hosted at IBM Almaden

16 16

backup

17 17

Orange Book -

- Legal status - Assignee - Foreign filings - Expiration Date

IP Attributes

Molecules have

Various Attributes ( From different sources)

NIST dB

-IR spectra -NMR, -Mass Spec, etc.

Spectral Attributes

Computational

-MW, -MF -Bp -Mp , etc

Physical Attributes

Drugbank

- Activity - Pharm data - Protein Binding - half life

WomBat

- Activity - Pharm data - Target data for SRA - Literature references

PubChem -Activity - Pharm data - Target data for SRA - Literature references

Drug Attributes

Screening Attributes

EPA databases

- Toxicity studies - LD50 - Literature references

Toxicity Attributes

Molecular Attributes

Attributes derived from different sources

18 18

The Tank

Internet

Data Sources

Attributes

Orange Book

Pub Chem

Drugbank FDA

Others

Attributes |||||||||| |||||||||

Data Source 1 Schema 1

Attributes |||||||||| |||||||||

Data Source 2 Schema 2

Database C (Tox)

Location

Structure (trusted database) Database A (Medline)

SMILE

InChi_id

Binding site

Code name

Target Activity

app_id Trade Name

Geo

Country

Pathway Tox

IP status

Certifications Licensing

Input list of Attributes

Output file list of SMILES

Output file list of

attributes

Input list of SMILES

Cross mapping attributes from different sources

Semantic association of attributes

19 19

Watson

20 20

IBM’s - Massively Parallel Probabilistic Architecture

Question/T

opic

Analysis

Question

Hypothesis & Evidence

Scoring

Answer,

Confidence

Synthesis Final Merging

& Ranking

Query

Decomposition

Hypothesis

Generation

Hypothesis &

Evidence Scoring

Soft

Filtering

Hypothesis

Generation

Hypothesis &

Evidence Scoring Soft

Filtering

Hypothesis

Generation

Trained

Models

Primary

Search

Candidate

Answer

Generation

A.

Sources Supporting

Evidence

Retrieval

Deep

Evidence

Scoring

Answer

Scoring

E.

Sources

Evidence

Retrieval

Deep

Evidence

Scoring

20

Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing,

Machine Learning and Reasoning Algorithms. These gather and weigh evidence over both unstructured and

structured content to determine the answer with the best confidence.

Source – J Kreulen

21 21

DeepQA Application (Java/C++)

Watson Infrastructure

• 90 Power 750 Servers

• Each Server 3.5GHz POWER7 8 Core Processor with

4 threads/core

• Total: 2880 POWER7 Cores with 16TB RAM

• Processing speed: 500Gb/sec; 80 TeraFLOPS

• 94th on Top 500 Supercomputers

• Note: This hardware is for Jeopardy. Any other

application of Watson will require appropriate sizing

and optimization for purpose.

SUSE Linux Enterprise Server 11

Apace Hadoop + Apache UIMA

Nature of Domain: Open vs. Closed

Closed domain implies all knowledge is contained within a specific domain

characterized by ontologies and there is no need to go outside the domain.

Jeopardy is an open-domain example where it is general knowledge.

Knowledge/Data Sources: Availability

QA systems are natural language search engines. Watson goes beyond NL

search. If knowledge sources are incomplete, unavailable, insufficient or

inadequate then it is not possible for the system to provide an answer. In some

cases one would need to envisage Interactive QA that require human

interaction to guide the search. Another very important consideration is the

availability of sufficient sample data for training (i.e. training corpus).

Need for multi-modality

Is there a need for Transcription from Speech to Text before a question is

answered? This would require integration of Speech to Text capabilities that are

not really ready for real-time applications.

Latency

Watson is capable of processing 500GB of information per second with 3 sec

response to questions and used most of its knowledge source in memory (as

opposed to disk) for speed. What is the latency requirement for the application?

Multi-Lingual or Cross-Lingual Support

Watson can support only English at this time; with language-specific parsers

other languages can be supported . If knowledge sources or QA is required in

multiple languages then that would not be a good candidate. Additionally if

cultural context have to be accommodated in the answer then it would not be

prudent to deploy QA systems directly interacting with users.

Question Type

Decomposition and classification of the question is critical to how QA systems

work. Bulk of the question types in Jeopardy were Factoid questions. Watson

did not include 2 question categories: One is Audio/Video type questions that

require looking at a video to answer and another are questions that require

special instructions (e.g. verbal instructions to explain a question.)

Answer Types

Watson is not designed to curate a task-oriented system. It can handle temporal

and geo-spatial reasoning in its answers. As it stands it cannot handle business

process type of reasoning (to do task B tasks A, C must be completed etc.)

Technical Issues to consider when applying QA systems like Watson

22 22

I would like to acknowledge the IBM Almaden Research – team

Jeff Kreulen

Ying Chen

Scott Spangler

Alfredo Alba

Tom Griffin

Eric Louie

Su Yan

Issic Cheng

Prasad Ramachandran

Bin He

Ana Lelescu

Brian Langston

Qi He

Linda Kato

Ana Lelescu

Brad Wade

John Colino

Meenakshi Nagarajan

Timothy J Bethea

German Attanasio

Cassidy Kelly

Jack Labrie

Fredrick Eduardo

Ionia Stanoi

+ a host of folks from

IBM China Labs -

Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific...

Documents

Transcript of Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific...