Improving Interoperability of Text Mining Tools with BioC

Ritu Khare, Chih-‐Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI) National Institutes of Health

¡ Motivation ¡  Our Text Mining Tools ¡  Building BioC Compatible Tools ¡  Results and Conclusions

¡  Building complex text mining applications requires combining different tools developed by different groups

¡  Each tool is developed independently §  Group conventions: data representation, programming, execution environments

¡  Heterogeneity in data/text representations limits and slows down §  tool interoperability, application development, and research and innovation.

EXISTING SOLUTIONS ¡  Unstructured information

management architecture (UIMA) – 2004

¡  General Architecture for Text Engineering (GATE) -‐ 2009

¡  Steep Learning Curve ¡  Substantial Development

and Re-‐development time

BIOC ¡  Minimal change

requirement to existing applications and datasets

¡  BioC family §  XML formats to present text

documents and annotations §  Functions (C++, JAVA) to read/

write documents in BioC format

DNormDNorm

tmVartmVar

SR4GNSR4GN

tmChemtmChem

GenNormGenNorm

PubMed Abstract

Disease Mentions with MEDIC IDs

Mutation Mentions

Species Mentions with Taxonomy IDs

Chemical Mentions

Gene Mentions with Entrez IDs

Annotations for Various BioConcepts

Concept Recognition and Annotation Toolkit

PubMed Abstracts or Full-‐Text Articles

DNorm Disease Mentions with MEDIC IDs (F-‐measure= 80.90%)

tmVar Mutation Mentions (F-‐measure= 91.39%)

SR4GN Species Mentions with Taxonomy IDs (F-‐measure= 85.42%)

tmChem Chemical Mentions (F-‐measure= 88.27%)

GenNorm Gene Mentions with Entrez IDs (F-‐measure= 92.89%)

Annotations with various BioConcepts

NER tools Programming Language Method

Formats

PubMed/ PMC XML Free Text

PubTator Format

GenNorm Format

tmChem (Chemical) Java, Perl, C++ CRF √ √

DNorm (Disease) Java CRF √ √

tmVar (Mutation) Perl, C++ CRF √ √ √

SR4GN (Species) Perl Rule-‐based √ √ √

GenNorm (Gene) Perl Statistical √ √ √

PubTator Perl, JavaScript Web server √ √

¡  Official corpus for BioCreative IV GO Task ¡  200 full-‐text articles along with their gene ontology (GO) annotations §  evidence sentences §  gene/protein entities, GO terms, GO evidence codes

¡  Developed by expert GO curators via a web-‐based annotation tool.

¡ Motivation ¡  The NCBI Text Mining Toolkit ¡  Building BioC Compatible Tools ¡  Results and Conclusions

¡  The BioC family §  XML DTD ▪  how to present text

document and annotations (higher-‐level semantics)

§  C++ and Java Libraries ▪  functions/classes to read/

write documents in BioC format

¡  BioC Recommendations §  Full-‐text articles and

Annotations ▪  Present in BioC XML Format ▪  Keep in separate files

§  Key file ▪  describes how data should

be interpreted in the annotation file (lower-‐level semantics)

▪  needs to be created for a specific type of data.

¡  Steps taken to comply our tools with BioC §  Created the key file § Modified the input/output formats of the tools ▪  Added the BioC format as a new option for input/output

¡  Challenges

§  Defining an appropriate key file §  Offset calculation §  Translating web-‐based annotation file to BioC annotation file (Unicode to ASCII conversion)

¡  Common key file for all tools since they are designed for similar types of data

id: PubMed id.

Passage: e.g., title, abstract

Offset of the passage

Id of the bioconcept

Offset of the bioconcept

Length of the bioconcept

Mention of the bioconcept

date: the time annotation create

NER tools bioconcept

PubMed/ PMC XML BioC

Free Text PubTator GenNorm

tmChem Chemical √ √ √

DNorm Disease √ √ √

tmVar Mutation √ √ √ √

SR4GN Species √ √ √ √

GenNorm Gene √ √ √ √

PubTator N/A √ √ √

Our Text Mining Toolkit available for public access: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/

BioC Article File

BioC Annotation File

DNorm tmVar tmChem SR4GN GenNorm

Identifying Disease

Identifying Mutation

Identifying chemical

Identifying Species

Identifying Gene

id: PubMed id.

passage: title

date: the time file download

passage: abstract

Id of the bioconcept

Offset of the bioconcept

Length of the bioconcept

Mention of the bioconcept

Type of the bioconcept

Time: Time annotation created.

ID: PMID of the article.

GO term: e.g., receptor-‐mediated endocytosis

GO evidence code: e.g., Inferred from Mutant Phenotype (IMP)

Curatable entity: i.e., gene or gene product

Text: GO evidence text

¡  Our experience with BioC §  Minimal changes required to prepare BioC versions §  Easy to learn and use §  Improved interoperability within the toolkit

¡  Implications §  Improved interoperability ▪  With other tools to build sophisticated applications

§  The key file could evolve as a standard for concept recognition and normalization tasks

§  Anticipate broader usage of our tools as BioC gains popularity

¡  BioC Developers § W. John Wilbur §  Rezarta Islamaj Doğan §  Donald Comeau

¡  Intramural Research Program of the NIH, National Library Medicine

¡  Chih-Hsuan Wei §  weic4@ncbi.nlm.nih.gov §  +1 301-594-5290

Improving Interoperability of Text Mining Tools with BioC

Technology

Transcript of Improving Interoperability of Text Mining Tools with BioC

BIOC 301 ProblemSet3-2012Answers

Poster/cheatsheet for R/BioC package genomation

BioC for HTS - PDCB topic Bioconductor 01lcolladotor.github.io/courses/Courses/PDCB-HTS/...BioC for HTS - PDCB topic Bioconductor 01 BioC for Dev Build reports I To ensure that R and

Enterprise Interoperability: Interoperability for Agility ... · Enterprise Interoperability Interoperability for Agility, ... On Optimizing Collaborative ... Enterprise Interoperability:

Www.seegrid.csiro.au Interoperability in the Australian Mining Industry AUSIndustry Workshops Executive Briefing Part 2 Lesley Wyborn Geoscience Australia.

BioC: a minimalist approach to interoperability for ... · a minimalist approach to interoperability for biomedical text processing . Don Comeau . 2 . Outline • Background and origin

Proposal to the Senate Educational Policy CommitteeTechnqs Biochem & Biotech BIOC 460 Biochemistry Senior Seminar BIOC 406 Gene Expression & Regulation BIOC 445 Current Topics in Biochemistry

BIOC*3570 Analytical Biochemistry - uoguelph.ca

Case Study for BIOC 460 by Courtney Dubbels

BIOC*3570 Analytical Biochemistry

Display Printable PDF - McGill University · Detailed New Course Proposal – BIOC 470 - Lipids and Lipoproteins in Disease BIOC 470 Lipids and Lipoproteins in Disease: Structure,

Analytical Biochemistry Lab BIOC 343

T2 - Hacking 101 Armando Bioc - SF ISACAsfisaca.org/images/FC09_Presentations/T2 - Hacking 101.pdf · T2 - Hacking 101 Armando Bioc . 1 ... Qualys eEye McAfee ... validating the identity

Enzymes chp-6-7-bioc-361-version-oct-2012b

BioC: a minimalist approach to interoperability for biomedical text processing Don Comeau.

Package ‘lumi’bioconductor.riken.jp/packages/release/bioc/manuals/lumi/...# controlFile

BIOC: A MINIMALIST APPROACH TO INTEROPERABILITY FOR ...Islamaj Doğanet al., BioC and Simplified Use of the PMC Open Access Dataset for Biomedical Text Mining.. In the Proceedings

Exam II Learning Objectives BIOC 384

BIOC 384 Learning Objectives

MetabolisM-2 bioC-312 - kau