Conference Review ISMB 2003 Text Mining SIG meeting report

Comparative and Functional GenomicsComp Funct Genom 2003; 4: 667–673.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.338

Conference Review

ISMB 2003 Text Mining SIG meeting reportISMB’03, Brisbane, Australia, 27 June 2003

Christian Blaschke1*, Alexander Yeh2, Lynette Hirschman2 and Alfonso Valencia1

1Protein Design Group, CNB/CSIC, Madrid, Spain2The MITRE Corporation, Bedford, MA 01730, USA

*Correspondence to:Christian Blaschke, ProteinDesign Group, CNB/CSIC,Madrid, Spain.E-mail: [email protected]

Received: 18 September 2003Revised: 24 September 2003Accepted: 25 September 2003

Introduction

The third meeting of the Special Interest Group(SIG) for Text Mining was held in conjunc-tion with ISMB in Australia this year, follow-ing the 2001 meeting in Copenhagen and the2002 meeting in Edmonton. The Text MiningSIG has been organized by the BioLINK group(http://www.pdg.cnb.uam.es/BioLINK), with itsmain contributors Lynette Hirschman (MITRE,Bedford, MA, USA) and Alfonso Valencia (CNB,Madrid, Spain), together with this year’s localorganizers Christian Blaschke (CNB), Marc Light(University of Iowa, USA) and Alexander Yeh(MITRE). The SIG’s main goal has been to fos-ter communication in text mining and informationextraction applied to biology and biomedicine. Tothis end, the BioLINK group holds regular openmeetings to bring together researchers from thefield to interchange ideas and share them with awider community interested in the latest develop-ments. In the past two meetings, the Text MiningSIG has included reports from related SIGs (e.g.BioOntologies and BioPathways).

Information extraction (IE) is an outgrowth ofwork in automated natural language processing,which began in the 1950s with work on transfor-mational grammar by Zellig Harris and later Noam

Chomsky. Information extraction technology maderapid progress starting in the late 1980s, thanks to aseries of conferences focused on evaluation of IE:the Message Understanding Conferences (MUCs).There has also been a long history of researchon applications in medicine. Applications to themedical field focus on two distinct sub-problems:(a) improved access to the medical literature; and(b) extraction of information from patient records.

Despite the successes in other fields, NaturalLanguage Processing (NLP) techniques were notintroduced into biology until the late 1990s. Thefield has been dominated by two, not necessarilyconvergent, approaches: (a) application-orientated,where simple methods are used (possibly toosimple) to address ‘real’ biological problems; and(b) tool-orientated, where complex, state-of-the-artNLP methods are used to address problems that arenot always relevant to biologists.

During the SIG meeting, it became appar-ent that three major bottlenecks hinder currentdevelopment:

1. The complex and non-standardized nomencla-ture of genes and proteins in the scientific liter-ature. This makes it difficult to identify the basiccontent of a document, in particular the entitiesmentioned.

Copyright 2003 John Wiley & Sons, Ltd.

668 C. Blaschke et al.

2. The absence of large, annotated standard cor-pora for training and evaluation of alternativemethods.

3. The lack of common standards and evaluationcriteria that allow researchers to compare theperformance of different methodologies.

To begin to address these problems, the BioLINKgroup is organizing a critical assessment of textmining methods later this year (see http://www.pdg.cnb.uam.es/BioLink/BioCreative.eval.html).The assessment is inspired by the CASP eval-uations and will be carried out in collaborationwith SwissProt, HighWire Press, FlyBase and othergroups.

Talks

In this report, we review some of the presentationsgiven at the SIG meeting (Table 1). For more

information and copies of the submitted abstractsand presentations, visit http://www.pdg.cnb.uam.es/BioLink/SpecialInterestTextMining/HAND-OUTS/1 BioLINK handouts May28.html.

Ontologies in Bioinformatics (RobertStevens)

The basic function of human language is to com-municate efficiently (between human beings). Todo so, symbols have been created (composed ofwords) that stand for things. The meaning triangle(Figure 1 [6]) describes the relationship betweensymbols and things.

Ontologies are conceptual models of the sharedand common understanding of a domain and theycapture knowledge in a computationally amenablemanner. Therefore, they are of specific interestfor the NLP community because they provide a

Table 1. Talks given at the Text Mining SIG meeting

Speakers and Affiliation Title

Robert Stevens. Univ. Manchester, UK Report from BioOntologies SIGIan Donaldson, Joel Martin, Berry de Bruijn,Christopher W.V. Hogue. Univ. Toronto, Canada

PreBIND and Textomy—mining the biomedicalliterature for protein–protein interactions usinga support vector machine

Andy Fulmer, Jun Xu, Steven Zhao. Procter &Gamble, USA

An overview of text mining in the biologydomain at P&G

George Demetriou, Robert Gaizauskas. Univ.Sheffield, UK

Corpus resources for development andevaluation of a biological text mining system

Yuka Tateisi, Tomoko Ohta, Jin-dong Kim, HuaquingHong, Su Jian, Jun-ichi Tsujii. CREST, Japan Science andTechnology Corporation

The GENIA corpus: MEDLINE abstractsannotated with linguistic information

Seth Kulick, Mark Liberman, Andrew Schein. Univ.Pennsylvania, USA

Shallow semantic annotation of biomedicalcorpora for information extraction

Maria Samsonova. St.Petersburg State PolytechnicalUniversity, Russia

Processing of the natural language queries to arelational database

Tony C. Smith, John G. Cleary. Univ. Waikato, NewZealand

Automatically linking MEDLINE abstracts to theGene Ontology

Yoshimasa Tsuruoka, Teruyoshi Hishiki, OsamuOgasawara, Kousaku Okubo. CREST, Japan Science andTechnology Corporation

Integration of diverse knowledge and data intobiomedical knowledge matrices

Eunji Yi, Gary G. Lee, Soo-Jun Park. Pohang Univ.Science and Technology, Korea

HMM-based protein name recognition with editdistance using automatically annotated corpus

Francisco M. Couto, Mario J. Silva, Pedro Coutinho.Univ Lisbon, Portugal

Curating extracted information through thecorrelation between structure and function

Rune Linding, Peter O’Hanlon, Ulrich Reincke,Toby Gibson. EMBL, Heidelberg, Germany

Profiling and classification of scientificdocuments with SAS Text Miner

Alex Yeh, Lynette Hirschman, Alex Morgan. MITRE,USA

BioCreAtIvE: entity extraction

Christian Blaschke, Alfonso Valencia. CNB, Spain BioCreAtIvE: functional extraction

Bold type indicates the person who spoke at the meeting.

Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 667–673.

ISMB 2003 Text Mining SIG meeting 669

symbol: “dog”stands for

concept

thing/object

refers to

evok

es

Figure 1. The meaning triangle describes how symbols(part of the human language), concepts (abstract meanings)and things (real world objects) are related

framework into which information extracted fromtext can be mapped. On the other hand, textanalysis can support building ontologies by makinginformation in the literature more easily accessible.

There are a number of ontologies that areof specific interest to biology, e.g. the GeneOntology (GO [1]), the Disease Ontology (DO;http://diseaseontology.sourceforge.net/), theMouse Anatomy Ontology (http://www.informatics.jax.org/searches/anatdict form.shtml), theontology of E. coli metabolism (EcoCyc [4]),TAMBIS [7] and PharmGKB (http://www.pharmgkb.org/).

The GO ontology covers molecular function, bio-logical process and cellular components for geneproducts. The first release consisted of about 3500terms; it now contains around 15 000 terms and isstill growing. Currently it covers some 15 (model)organisms. GO has been criticized (mainly by non-biologists) because of its ad hoc construction andimpoverished model of relationships: it containsonly ‘is-a’ and ‘part-of ‘ relationships. However,its success and wide use has proved that the impor-tant issues for an ontology are that it contain usefulknowledge, modelled in a way that can be easilyapplied.

Protein name detection (Eunji Yi)

Before knowledge models such as ontologies canbe applied to text analysis, one first has to detectthe basic symbols (named entities) that represent

the concepts in the text. In the biomedical domain,protein names are among the concepts of centralimportance but they are notoriously difficult todetect because of the absence of an accepted (andused) standard nomenclature. There have been twoapproaches for the detection of protein names inscientific text: machine learning methods and rule-based systems.

Machine learning approaches suffer from thelack of (large) annotated corpora for training andthe excessive spelling variations in names. Yi et al.used a hidden Markov model (HMM)-based systemand showed that training corpora can be createdautomatically from MedLine (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db = PubMed)abstracts tagged with protein names extracted fromSwiss-Prot [2]. Their results indicate that systemstrained on such corpora do not suffer in precisionand show improvements in recall. This makes auto-matically annotated corpora an interesting solutionwhen sufficiently large hand-created datasets arenot available.

The authors propose the use of a distance mea-sure between strings (edit distance, introduced byWagner and Fisher [8]) integrated in the HMM toallow shallow matching between strings to accountfor spelling variants. As expected, this extensionincreases the recall of the system but precision suf-fers somewhat.

In conclusion, automatically annotated trainingcorpora and relaxation in string matching improvethe performance of machine learning methods inthe task of protein name detection in scientific text.

Text mining at P & G (Andrew Fulmer)

Andrew Fulmer from Procter & Gamble gaveinsights in how text mining is used in the con-text of a pharmaceutical company. P&G conductsresearch in biology to support its businesses inpharmaceutical drugs, personal health care, pet careand consumer products. They have established thecapabilities to conduct high-throughput Affymetrixgene chip expression studies, where a single exper-iment generates data on ∼10 000 different genes.The demand to interpret these datasets in a timelymanner has motivated their entry into the text min-ing field.

In 1999, in collaboration with scientists at IowaState University, P&G started to develop Path-Binder, which is now used to harvest signal



transduction–gene regulatory pathway interactions.These interactions are curated by their project biol-ogists into a pathways knowledge base, built on asimple logical interaction model, with a separatesuite of tools to build and analyse pathways. Toincrease recall, better NLP filters to post-processthe PathBinder sentences are now being developedwith a group of linguists at Los Alamos NationalLabs.

In 2001, with the initial pathway text mining ini-tiatives under way, more attention was paid to min-ing ‘functional context’ to characterize membersof a statistically filtered list of genes from a genechip study. A gene chip experiment involves about10 000 genes, which are filtered down to ∼1000genes by statistical data analysis and external infor-mation stemming from the experimental design. Todiscover a ‘useful story’ and create a biologicalmodel, this number has to be reduced by anotherone to two orders of magnitude. One approach is todevelop tools to annotate the genes with GO terms,using information mined from Medline, then ‘clus-ter’ the gene list in ontology space. An ontologyclustering tool is under development with the LosAlamos National Labs, while the GOMedlineMineris still ‘in the intellectual incubator’.

PreBIND and Textomy (Ian Donaldson)

The majority of experimentally verified molecu-lar interaction and biological pathway data arepresent in the unstructured text of biomedicaljournal articles where they are inaccessible tocomputational methods. The Biomolecular Inter-action Network Database (BIND) is a curatedcatalogue of biomolecular interactions, complexesand pathways and seeks to capture these data ina machine-readable format. It currently containsabout 17 000 protein–protein interactions, ∼48 000protein–small molecule interactions, ∼1300 molec-ular complexes and eight pathways.

PreBIND and Textomy are two components ofa literature-mining system designed to find pro-tein–protein interaction information and presentthis to curators or public users for review and sub-mission to the BIND database. This system couplesa co-occurrence network of protein names withSupport Vector Machine (SVM) technology thatidentifies abstracts describing biomolecular inter-actions.

Performance analyses estimated that the SVMF-measure was 92% and that the system would beable to recall up to 60% of all non-high-throughputinteractions present in the MIPS yeast–proteininteraction database. Finally, this system wasapplied to a real-world curation problem and itsuse was found to reduce the task duration by 70%.

Machine learning methods are useful as toolsto direct interaction and pathway database back-filling; however, this potential can only be realizedif these techniques are coupled with human reviewand entry into a factual database such as BIND.The PreBIND system described here is availableto the public. Current capabilities allow searchingfor human, mouse and yeast protein-interactioninformation.

Corpus work (Andrew Schein, RobertGaizauskas, Jin-dong Kim)

It is generally accepted that the free availability ofsuitably annotated text corpora is a prerequisite fordevelopment and evaluation of language processingsystems. Such corpora are linguistic resourcesfundamental for a number of tasks:

1. Definition of a text analysis task. Annotating textfor a specific purpose aids in defining the taskmore precisely, it shows which entities mustbe taken into account, which relationships existbetween them, and it encourages refinement ofthe task guidelines.

2. System development. Two basic methods existfor building NLP systems: hand-crafted rule-based methods or machine learning (ML)-basedmethods that make heavy use of statistics andpattern analysis. ML methods depend on suit-able training corpora (in general the larger thebetter) for setting up the system.

3. System evaluation. To assess performance, dif-ferent NLP systems are applied to the samecorpora and the results are compared. The lackof freely available corpora that are sufficientlylarge and general in focus has hindered the com-parison of NLP systems applied to biology.

The different works presented at the SIG meetinghighlighted a number of technical and theoreticalissues that have to be taken into account forbuilding annotated text corpora:



1. Definition of the ‘correct’ annotation scheme.The scheme has to be specific enough to beuseful, but not overly specific (trying to capturemore information than is actually in the text).

2. The previous point influences the level of inter-annotator agreement. The more specific thescheme, the less the annotators will agree onhow to annotate a specific text and no homoge-neous results will be produced.

3. The tension between the domain experts andcomputational linguists needs to be resolved.Domain experts want to capture some kindof information, but have limited understandingof the linguistic basics (they tend to choosecomputationally impractical solutions); whilethe computational linguists have limited domainknowledge and tend to create systems that aretheoretically elegant but difficult to use by thedomain experts.

4. Text corpora can be annotated at very differentlevels:i. Part-of-speech (POS) of words.

ii. Entities (e.g. genes, proteins, chemical sub-stances, diseases, protein structure elements,etc.).

iii. Partial or full parse-tree structures to expressrelations between the tagged entities or moregeneral subject–object relationships.

iv. Co-references (pronouns like ‘it’ or ‘they’that refer to something explicitly expressedearlier in the text).

5. Evaluation tools developed for different domainshave to be adapted to the biological literature.

Kulick et al. are developing new linguisticresources in three categories: a large corpus ofbiomedical text annotated with syntactic structure(Treebank [5]) and predicate–argument structure(’proposition bank’ or Propbank); a large set ofbiomedical abstracts and full-text articles annotatedwith entities and relations of interest to researchers,such as enzyme inhibition, or mutation/cancer con-nections (Factbanks); and broad-coverage lexiconsand tools for the analysis of biomedical texts. Fur-thermore, they are developing and adapting soft-ware tools that allow human experts to annotatebiomedical texts for entity tagging, as well as fortreebanking and propbanking. The project focusesinitially on two applications: drug development (incollaboration with researchers in the Knowledge

Integration and Discovery Systems group at Glax-oSmithKline) and paediatric oncology (in collab-oration with researchers in the eGenome group atChildren’s Hospital of Philadelphia). These appli-cations, worthwhile in their own right, provideexcellent test beds for broader research effortsin natural language processing and data integra-tion.

Tateisi et al. are developing the GENIA cor-pus as part of a project for building NLP sys-tems for information extraction of biological reac-tions. The corpus is a collection of articles con-cerning the reactions of transcription factors inhuman blood cells, extracted from MEDLINE,which are annotated in XML format. The first focuswas to annotate the actors of biological events interms of an ontology specifically created for theproject; the ontology includes organic and inor-ganic substances, nucleic acids and proteins andalso classes for the place of action such as cellcomponent, tissue or body parts. The authors alsoreported on current developments to annotate part-of-speech, (partial) parse tree structures and co-references.

Demetriou and Gaizauskas reported on thedevelopment of the PASTA system. PASTA, forProtein Active Site Template Acquisition system,is a text mining system for the automatic extrac-tion of information relating to protein structuresfrom the biological literature. A corpus of textsrelevant to the study of protein structures wasassembled. The primary source of informationfor retrieving Medline abstracts relevant to pro-tein structures was the Protein Data Bank (PDB[3]). The papers for all macromolecular struc-tures deposited in the PDB during the years1994–1998 were extracted from Medline. Thefinal corpus consisted of 1514 abstracts, with414 257 words. Texts were annotated with termi-nology class information (e.g. species, regions inprotein structures, secondary structures, residues,etc.) in SGML markup and a number of tem-plates to capture key information about proteinstructure, in the style of MUC (http://www-nlpir.nist.gov/related projects/muc/proceedings/muc 7 toc.html), were defined. The resources thatwere created during the PASTA project are freelyavailable.



Commercial systems (John G. Cleary,Rune Linding, Ulrich Reincke)

In recent years, companies have been developingNLP systems for biology. These efforts are nowbearing fruit and the first commercial systems(although rather narrow in their focus) are enteringthe market.

Smith and Cleary presented the ‘Gene OntologyKnowledge Discovery System’ (GO-KDS), a pub-licly available web application that automaticallyconnects biomedical documents to terms from theGene Ontology, thereby amplifying the potentialof GO to elucidate the knowledge embedded withinbiomedical literature. GO-KDS uses machine learn-ing techniques to infer general semantic modelsfor each GO term from training documents gleanedfrom the references available in public gene/proteindatabases. The expert gives the learning algorithmsome number of documents deemed exemplars of aparticular semantic class. The algorithm identifiesall salient features of the documents and weightsthose that are the best indicators for determiningwhether or not each document is an instance of theconcept being learned. The result is a characteristiccomputer model that can subsequently be used topredict how likely it is that any future novel doc-ument also belongs in that semantic class and toclassify these abstracts to appropriate GO terms.

Linding et al. presented a cooperative workbetween the SAS Institute and the ELM Con-sortium at the European Molecular Biology Lab-oratory on the development of a text mining-application for the automated identification andranking of scientific articles. The ‘topic scoring’engine is based on the SAS Text Miner. The topicscoring engine identifies documents with similarcontent and creates search-profiles to capture thecongruencies of the documents. The topic scoringengine replaces keyword querying of bibliographicdatabases, such as PubMed, with a structured auto-mated process by means of a ‘document-basedretrieval’. This will reduce research time whileimproving the quality of the results. The topic scor-ing engine does not look for a pre-defined vocabu-lary (which is what a search engine would do) buttests, with different types of singular value decom-positions, all possible information resolutions ofthe concepts underlying the text. These profiles aresubsequently applied as filters to new publications.This allows the user to seek publications matching

these profiles without having to submit complexqueries. SAS and EMBL plan to provide this as apublic service to the scientific community after atrial period.

BioCreative: evaluation of text miningsystems (Alexander Yeh, ChristianBlaschke)

Many groups are now working in the area of textmining. However, despite the increased activity inthis area, the absence of common standards andevaluation criteria has resulted in a situation inwhich it is not possible to compare the differentapproaches because the various groups involved areaddressing different problems, often using privatedata sets. As a result, it is impossible to determinehow good the existing systems are and whatperformance can be expected in real applications.

This is similar to the situation in text processingin the early 1990s, prior to the introduction ofthe Message Understanding Conferences (MUC).With the introduction of a common evaluation andstandardized evaluation metrics, it became possibleto compare approaches, to assess which techniquesdid and did not work, and to make progress. Thisprogress resulted in the creation of standard toolsavailable to the general research community. Thefield of biology is ripe for a similar experiment.

Therefore, the BioLINK group (Biological Lit-erature, Information and Knowledge, http://www.pdg.cnb.uam.es/BioLINK/) is organizing a CASP-like evaluation for the text data mining communityapplied to biology. The two main tasks specifi-cally address the currently existing bottlenecks inthe field: (a) the correct detection of gene and pro-tein names in text (named ‘entity detection’); and(b) the extraction of functional information relatedto proteins based on the GO classification system,based on full-text documents.

Acknowledgements

The organizers of the workshop would like to thank espe-cially all the speakers for their presentations, Marc Lightfor co-organizing the event, Steven Leard and the ISCB(The International Society for Computational Biology) fororganizing the logistics and the infrastructure, and the EUprojects ORIEL (Contract number: IST-2001-32688) andTEMBLOR (Contract number: VEQLRT 2001 00015) forfinancial support.



References

1. Ashburner M, Ball CA, Blake JA et al. 2000. Gene ontology:tool for the unification of biology. Nature Genet 25: 25–29.

2. Bairoch A, Apweiler R. 2000. The SWISS-PROT proteinsequence database and its supplement TrEMBL. Nucleic AcidsRes 28: 45–48.

3. Berman HM, Westbrook J, Feng Z et al. 2000. The protein databank. Nucleic Acids Res 28: 235–242.

4. Karp PD, Riley M, Saier M, et al. 2002. The Ecocyc Database.Nucleic Acids Res 30: 56–58.

5. Marcus MP, Santorini B, Marcinkiewicz MA. 1993. Building alarge annotated corpus of English: the Penn treebank. ComputLinguist 19: 313–330.

6. Ogden C, Richards I. 1923. The Meaning of Meaning: A Studyof the Influence of Language upon Thought and of the Scienceof Symbolism. Routledge & Kegan Paul: London.

7. Stevens R, Baker P, Bechhofer S, et al. 2000. TAMBIS:transparent access to multiple bioinformatics informationsources. Bioinformatics 16: 184–186.

8. Wagner RA, Fisher MJ. 1974. The string to string correctionproblem. J Assoc Comput Mach 21: 168–173.


Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation http://www.hindawi.com

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttp://www.hindawi.com

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research


International Journal of

Microbiology

Conference Review ISMB 2003 Text Mining SIG meeting report

Documents

Transcript of Conference Review ISMB 2003 Text Mining SIG meeting report