Improving Interoperability of Text Mining Tools with BioC

Post on 10-May-2015

139 views 0 download

Tags:

Transcript of Improving Interoperability of Text Mining Tools with BioC

Ritu  Khare,  Chih-­‐Hsuan  Wei,  Yuqing  Mao,  Robert  Leaman,  Zhiyong  Lu  National  Center  for  Biotechnology  Information  (NCBI)  National  Institutes  of  Health    

1  

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

2  

¡  Building  complex  text  mining  applications  requires  combining  different  tools  developed  by  different  groups  

¡  Each  tool  is  developed  independently  §  Group  conventions:  data  representation,  programming,  execution  environments  

¡  Heterogeneity  in  data/text  representations  limits  and  slows  down  §  tool  interoperability,  application  development,  and  research  and  innovation.  

3  

EXISTING  SOLUTIONS      ¡  Unstructured  information  

management  architecture  (UIMA)  –  2004  

¡  General  Architecture  for  Text  Engineering  (GATE)  -­‐  2009  

¡  Steep  Learning  Curve    ¡  Substantial  Development  

and  Re-­‐development  time  

BIOC  ¡  Minimal  change  

requirement  to  existing  applications  and  datasets  

¡  BioC  family  §  XML  formats  to  present  text  

documents  and  annotations  §  Functions  (C++,  JAVA)  to  read/

write  documents  in  BioC  format      

4  

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

5  

6  

DNormDNorm

tmVartmVar

SR4GNSR4GN

tmChemtmChem

GenNormGenNorm

PubMed  Abstract

Disease  Mentions  with  MEDIC  IDs

Mutation  Mentions

Species  Mentions  with  Taxonomy  IDs

Chemical  Mentions

Gene  Mentions  with  Entrez  IDs

Annotations  for  Various  BioConcepts

Concept  Recognition  and  Annotation  Toolkit

PubMed  Abstracts  or  Full-­‐Text  Articles

DNorm  Disease  Mentions  with  MEDIC  IDs  (F-­‐measure=  80.90%)  

tmVar  Mutation  Mentions    (F-­‐measure=  91.39%)  

SR4GN  Species  Mentions  with  Taxonomy  IDs  (F-­‐measure=  85.42%)  

tmChem  Chemical  Mentions    (F-­‐measure=  88.27%)  

GenNorm  Gene  Mentions  with  Entrez  IDs  (F-­‐measure=  92.89%)  

Annotations  with  various  BioConcepts  

NER  tools  Programming  Language   Method  

Formats  

PubMed/  PMC  XML   Free  Text  

PubTator  Format  

GenNorm  Format  

tmChem  (Chemical)   Java,  Perl,  C++   CRF   √   √  

DNorm  (Disease)   Java   CRF   √   √  

tmVar  (Mutation)   Perl,  C++   CRF   √   √   √  

SR4GN  (Species)   Perl   Rule-­‐based   √   √   √  

GenNorm  (Gene)   Perl   Statistical     √   √   √  

PubTator   Perl,  JavaScript   Web  server   √   √  

7  

8  

¡  Official  corpus  for  BioCreative  IV  GO  Task    ¡  200  full-­‐text  articles  along  with  their  gene  ontology  (GO)  annotations      §  evidence  sentences  §  gene/protein  entities,  GO  terms,  GO  evidence  codes  

¡  Developed  by  expert  GO  curators  via  a  web-­‐based  annotation  tool.    

9  

¡ Motivation    ¡  The  NCBI  Text  Mining  Toolkit    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

10  

¡  The  BioC  family    §   XML  DTD    ▪  how  to  present  text  

document  and  annotations  (higher-­‐level  semantics)  

§  C++  and  Java  Libraries    ▪  functions/classes  to  read/

write  documents  in  BioC  format    

¡  BioC  Recommendations  §  Full-­‐text  articles  and  

Annotations  ▪  Present  in  BioC  XML  Format    ▪  Keep  in  separate  files  

§  Key  file    ▪  describes  how  data  should  

be  interpreted  in  the  annotation  file  (lower-­‐level  semantics)  

▪  needs  to  be  created  for  a  specific  type  of  data.    

11  

¡  Steps  taken  to  comply  our  tools  with  BioC  §  Created  the  key  file  § Modified  the  input/output  formats  of  the  tools  ▪  Added  the  BioC  format  as  a  new  option  for  input/output  

 ¡  Challenges  

§  Defining  an  appropriate  key  file    §  Offset  calculation    §  Translating   web-­‐based   annotation   file   to   BioC  annotation  file  (Unicode  to  ASCII  conversion)  

12  

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

13  

¡  Common  key  file  for  all  tools  since  they  are  designed  for  similar  types  of  data    

14  

id:    PubMed  id.  

Passage:    e.g.,  title,  abstract  

Offset  of  the  passage  

Id  of  the  bioconcept  

Offset  of  the  bioconcept  

Length  of  the  bioconcept  

Mention  of  the  bioconcept  

date:    the  time  annotation  create  

NER  tools   bioconcept  

PubMed/  PMC  XML   BioC  

Free  Text   PubTator   GenNorm  

tmChem   Chemical   √   √   √  

DNorm   Disease   √   √   √  

tmVar   Mutation   √   √   √   √  

SR4GN   Species   √   √   √   √  

GenNorm   Gene   √   √   √   √  

PubTator   N/A   √   √   √  

15  

Our  Text  Mining  Toolkit  available  for  public  access:  http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/  

16  

BioC  Article  File  

BioC  Annotation    File  

DNorm   tmVar   tmChem   SR4GN   GenNorm  

Identifying  Disease

Identifying  Mutation

Identifying  chemical

Identifying  Species

Identifying  Gene

17  

id:    PubMed  id.  

passage:    title  

date:    the  time  file  download  

passage:    abstract  

18  

Id  of  the  bioconcept  

Offset  of  the  bioconcept  

Length  of  the  bioconcept  

Mention  of  the  bioconcept  

Type  of  the  bioconcept  

Time:    Time  annotation  created.  

ID:  PMID  of  the  article.  

GO  term:  e.g.,  receptor-­‐mediated  endocytosis  

GO  evidence  code:  e.g.,  Inferred  from  Mutant  Phenotype  (IMP)  

Curatable  entity:  i.e.,  gene  or  gene  product  

Text:  GO  evidence  text  

¡  Our  experience  with  BioC    §  Minimal  changes  required  to  prepare  BioC  versions    §  Easy  to  learn  and  use  §  Improved  interoperability  within  the  toolkit  

¡  Implications    §  Improved  interoperability  ▪  With  other  tools  to  build  sophisticated  applications  

§  The  key  file  could  evolve  as  a  standard  for  concept  recognition  and  normalization  tasks  

§  Anticipate  broader  usage  of  our  tools  as  BioC  gains  popularity    

20  

¡  BioC  Developers  § W.  John  Wilbur  §  Rezarta  Islamaj  Doğan    §  Donald  Comeau    

¡  Intramural  Research  Program  of  the  NIH,  National  Library  Medicine  

21  

¡  Chih-Hsuan Wei §  weic4@ncbi.nlm.nih.gov §  +1 301-594-5290

22