Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific...
Transcript of Computer Curation of Patents & Scientific Literature ...Computer Curation of Patents & Scientific...
1 1
Source – J Kreulen
2 2
Computer Curation of Patents & Scientific Literature Information Analytics Transforming Information Into Value
Stephen K. Boyer, Ph.D. [email protected]
408-858-5544
Moneyball
Medicine
3 3
The Problem
All content and no discovery ?
4 4
Can we use computers to “read” documents, identify critical entities, and perform meaningful associations – that can help us with our work ?
The Question
5 5
As text
Chemical names found in the text of
documents
As bitmap images
Pictures of chemicals found in the document
Images
For Example : Patents and scientific papers contain molecular data in many different forms -
6 6
Massive Computing Environment
Find and compute the 3D structures
Identify every protein
Identify every disease
Identify every Medline MeSh code
Identify occurrence of every biomarker
Equivalent to 240K simultaneous Google
searches -
Data warehouse
Compute properties, &
find relationships,
Chemical & Biological information derived from text analytics
7 7
Dat
a So
urc
es
View selected
Documents & Reports
U.S. Patents (1976 -—
2009)
U.S. Pre-
Grants (All)
PCT & EPO Apps
Medline Abstracts
(>18 M)
Selected Internet Content
User Applications
In-House
Content
Knime or Pipeline Pilot
BIW
SIMPLE
Chem Axon Search
Cognos/DDQB/ Other Apps
Parse & Extract
data
Annotator 1
Annotator 2
Database
+ compu ted Meta Data
e Classifier & Other Data Associations
Annotation Factory
Computational Analytics
(Semantic
Associations)
Computer Curation Process Overview
IP Database (e.g. DB2)
ADU*
* ADU = Automated Data Update
ChemVerse
db
ChemVerse
Services Hosted at IBM Almaden
8 8
Current Activates …
9 9
- - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
= Chemical
= Target
= Disease
= Assay data
Text Annotated Text
- - - - - - - - - - - - - - - - - - - -
Annotated Text
identify chemical names
convert chem names to chemical structures SMILES – then convert these Into inchi & Inchkeys
- - - - - - - - - - - - - - - - - - - -
Annotated Text
replace all chemical names with the term “inchikey_& the unique inchikey” for that chemical
Re-index inchikeys w SOLR
= aspirin = inchikey = BSYNRYMUTXBXSQ-UHFFFAOYSA-N
= aspirin = CC(=O)OC1=CC=CC=C1C(=O)O
dB SOLR index
Current activity : “in line” entity tagging & classification
10 10
- - - - - - - - - - - - - - - - - - - -
= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N”
= Target
= Disease
= Assay data
Text
= Chemical
compound Target 1
Target 2
Target 3
Target 1
Target 2
Target 3
= [target _gene name]
Target 4
Target 5
Compound – Targets associations Known from the literature
Compound – Targets associations Known from the SEA computations
dB
Current activity : “in line” entity tagging & classification
In line text tagging (classification) coupled with computational & experimental data
NIH HTS Assay data
Compound – Targets associations Known from NIh experimental HTS
11 11
Medline co-occurrence of Statin structures vs. MeSH -Signs & Symptoms (C23)
Chemical Structures vs. Signs and Symptoms
12 12
Search Chemical Search using ChemAxon w/ DB2
Proximal Search Nearest Neighbor Search
13 13
BioTerm Analysis
Clustering Claims Originality
Discovery
14 14
Landscape Analysis
Visualization
Networks
15 15
Dat
a So
urc
es
View selected
Documents & Reports
U.S. Patents (1976 -—
2009)
U.S. Pre-
Grants (All)
PCT & EPO Apps
Medline Abstracts
(>18 M)
Selected Internet Content
User Applications
In-House
Content
Knime or Pipeline Pilot
BIW
SIMPLE
Chem Axon Search
Cognos/DDQB/ Other Apps
Parse & Extract
data
Annotator 1
Annotator 2
Database
+ compu ted Meta Data
e Classifier & Other Data Associations
Annotation Factory
Computational Analytics
(Semantic
Associations)
Computer Curation Process Overview
IP Database (e.g. DB2)
ADU*
* ADU = Automated Data Update
ChemVerse
db
ChemVerse
Services Hosted at IBM Almaden
16 16
backup
17 17
Orange Book -
- Legal status - Assignee - Foreign filings - Expiration Date
IP Attributes
Molecules have
Various Attributes ( From different sources)
NIST dB
-IR spectra -NMR, -Mass Spec, etc.
Spectral Attributes
Computational
-MW, -MF -Bp -Mp , etc
Physical Attributes
Drugbank
- Activity - Pharm data - Protein Binding - half life
WomBat
- Activity - Pharm data - Target data for SRA - Literature references
PubChem -Activity - Pharm data - Target data for SRA - Literature references
Drug Attributes
Screening Attributes
EPA databases
- Toxicity studies - LD50 - Literature references
Toxicity Attributes
Molecular Attributes
Attributes derived from different sources
18 18
The Tank
Internet
Data Sources
Attributes
Orange Book
Pub Chem
Drugbank FDA
Others
Attributes |||||||||| |||||||||
Data Source 1 Schema 1
Attributes |||||||||| |||||||||
Data Source 2 Schema 2
Database C (Tox)
Location
Structure (trusted database) Database A (Medline)
SMILE
InChi_id
Binding site
Code name
Target Activity
app_id Trade Name
Geo
Country
Pathway Tox
IP status
Certifications Licensing
Input list of Attributes
Output file list of SMILES
Output file list of
attributes
Input list of SMILES
Cross mapping attributes from different sources
Semantic association of attributes
19 19
Watson
20 20
IBM’s - Massively Parallel Probabilistic Architecture
Question/T
opic
Analysis
Question
Hypothesis & Evidence
Scoring
Answer,
Confidence
Synthesis Final Merging
& Ranking
Query
Decomposition
Hypothesis
Generation
Hypothesis &
Evidence Scoring
Soft
Filtering
Hypothesis
Generation
Hypothesis &
Evidence Scoring Soft
Filtering
Hypothesis
Generation
Trained
Models
Primary
Search
Candidate
Answer
Generation
A.
Sources Supporting
Evidence
Retrieval
Deep
Evidence
Scoring
Answer
Scoring
E.
Sources
Evidence
Retrieval
Deep
Evidence
Scoring
20
Watson generates and scores many hypotheses using an extensible collection of Natural Language Processing,
Machine Learning and Reasoning Algorithms. These gather and weigh evidence over both unstructured and
structured content to determine the answer with the best confidence.
Source – J Kreulen
21 21
DeepQA Application (Java/C++)
Watson Infrastructure
• 90 Power 750 Servers
• Each Server 3.5GHz POWER7 8 Core Processor with
4 threads/core
• Total: 2880 POWER7 Cores with 16TB RAM
• Processing speed: 500Gb/sec; 80 TeraFLOPS
• 94th on Top 500 Supercomputers
• Note: This hardware is for Jeopardy. Any other
application of Watson will require appropriate sizing
and optimization for purpose.
SUSE Linux Enterprise Server 11
Apace Hadoop + Apache UIMA
Nature of Domain: Open vs. Closed
Closed domain implies all knowledge is contained within a specific domain
characterized by ontologies and there is no need to go outside the domain.
Jeopardy is an open-domain example where it is general knowledge.
Knowledge/Data Sources: Availability
QA systems are natural language search engines. Watson goes beyond NL
search. If knowledge sources are incomplete, unavailable, insufficient or
inadequate then it is not possible for the system to provide an answer. In some
cases one would need to envisage Interactive QA that require human
interaction to guide the search. Another very important consideration is the
availability of sufficient sample data for training (i.e. training corpus).
Need for multi-modality
Is there a need for Transcription from Speech to Text before a question is
answered? This would require integration of Speech to Text capabilities that are
not really ready for real-time applications.
Latency
Watson is capable of processing 500GB of information per second with 3 sec
response to questions and used most of its knowledge source in memory (as
opposed to disk) for speed. What is the latency requirement for the application?
Multi-Lingual or Cross-Lingual Support
Watson can support only English at this time; with language-specific parsers
other languages can be supported . If knowledge sources or QA is required in
multiple languages then that would not be a good candidate. Additionally if
cultural context have to be accommodated in the answer then it would not be
prudent to deploy QA systems directly interacting with users.
Question Type
Decomposition and classification of the question is critical to how QA systems
work. Bulk of the question types in Jeopardy were Factoid questions. Watson
did not include 2 question categories: One is Audio/Video type questions that
require looking at a video to answer and another are questions that require
special instructions (e.g. verbal instructions to explain a question.)
Answer Types
Watson is not designed to curate a task-oriented system. It can handle temporal
and geo-spatial reasoning in its answers. As it stands it cannot handle business
process type of reasoning (to do task B tasks A, C must be completed etc.)
Technical Issues to consider when applying QA systems like Watson
22 22
I would like to acknowledge the IBM Almaden Research – team
Jeff Kreulen
Ying Chen
Scott Spangler
Alfredo Alba
Tom Griffin
Eric Louie
Su Yan
Issic Cheng
Prasad Ramachandran
Bin He
Ana Lelescu
Brian Langston
Qi He
Linda Kato
Ana Lelescu
Brad Wade
John Colino
Meenakshi Nagarajan
Timothy J Bethea
German Attanasio
Cassidy Kelly
Jack Labrie
Fredrick Eduardo
Ionia Stanoi
+ a host of folks from
IBM China Labs -