talk

29
Matthew Cockerill Technical Director, BioMed Central Text mining and Open Access publishing

Transcript of talk

Page 1: talk

Matthew CockerillTechnical Director, BioMed Central

Text mining and Open Access

publishing

Page 2: talk

March 30th 2004 BioCreative 2004

SummarySummary

What is Open Access publishing?What is Open Access publishing? Open Access publishing and text Open Access publishing and text

miningmining About BMC BioinformaticsAbout BMC Bioinformatics The BioCreative supplementThe BioCreative supplement

Page 3: talk

March 30th 2004 BioCreative 2004

SummarySummary

What is Open Access publishing?What is Open Access publishing? Open Access publishing and text Open Access publishing and text

miningmining About BMC BioinformaticsAbout BMC Bioinformatics The BioCreative supplementThe BioCreative supplement

Page 4: talk

March 30th 2004 BioCreative 2004

The current model of publishing The current model of publishing scientific researchscientific research Scientists carry out researchScientists carry out research They write up their resultsThey write up their results They submit them to a journalThey submit them to a journal Other scientists act as peer Other scientists act as peer

reviewers and editorial advisersreviewers and editorial advisers Finally, the publisher Finally, the publisher sellssells access access

to that research back to the to that research back to the scientific communityscientific community

Page 5: talk

March 30th 2004 BioCreative 2004

What’s wrong with this What’s wrong with this status quo?status quo?

Restricted access to scientific research Restricted access to scientific research is contrary to the interests ofis contrary to the interests of– the scientists who do the researchthe scientists who do the research– the funders who pay for itthe funders who pay for it– society as a wholesociety as a whole

It is an historical artefact of the It is an historical artefact of the economics of print publishingeconomics of print publishing

It is a serious obstacle to mining of full It is a serious obstacle to mining of full text informationtext information

Page 6: talk

March 30th 2004 BioCreative 2004

BioMed Central BioMed Central The Open Access publisherThe Open Access publisher

Commercial organizationCommercial organization Published first article in mid-2000Published first article in mid-2000 Strict policy of immediate Open Strict policy of immediate Open

Access to Access to allall research articles research articles

Page 7: talk

March 30th 2004 BioCreative 2004

Growth of BioMed CentralGrowth of BioMed Central

Open Access research article publications

0

500

1000

1500

2000

2000 2001 2002 2003

Fulltext accesses to Open Access articles

0m1m2m

3m4m5m

2000 2001 2002 2003

Page 8: talk

March 30th 2004 BioCreative 2004

Momentum for Open Access Momentum for Open Access

PubMed CentralPubMed Central Public Library of SciencePublic Library of Science Open Access declarations:Open Access declarations:

Budapest/Bethesda/Berlin Budapest/Bethesda/Berlin Software open-source movementSoftware open-source movement Mass cancellation of titles from Mass cancellation of titles from

traditional publisherstraditional publishers

Page 9: talk

March 30th 2004 BioCreative 2004

BioMed Central’s business model BioMed Central’s business model for open access publishingfor open access publishing Keep costs down viaKeep costs down via

– Online submission and peer reviewOnline submission and peer review– Automated tools to streamline article processing, conversion Automated tools to streamline article processing, conversion

and layout and layout Processing charge (currently $525) for accepted articlesProcessing charge (currently $525) for accepted articles No processing charge for authors at member institutionsNo processing charge for authors at member institutions

Page 10: talk

March 30th 2004 BioCreative 2004

Institutional membershipInstitutional membership

CalTechCalTech Cancer Research UKCancer Research UK Columbia UniversityColumbia University Cornell UniversityCornell University University of CaliforniaUniversity of California Dana-Farber Cancer InstituteDana-Farber Cancer Institute Harvard UniversityHarvard University INSERMINSERM Imperial College Imperial College Institut PasteurInstitut Pasteur John Innes CentreJohn Innes Centre Johns Hopkins UniversityJohns Hopkins University Kyoto UniversityKyoto University Max Planck InstitutesMax Planck Institutes Memorial Sloan-Kettering Cancer Memorial Sloan-Kettering Cancer

CenterCenter

More than 400 institutions are members of BioMed Central, including, More than 400 institutions are members of BioMed Central, including, to name just a few:to name just a few:

MRC Laboratory of Molecular MRC Laboratory of Molecular BiologyBiology

National Institutes of HealthNational Institutes of Health National Institute for Medical National Institute for Medical

ResearchResearch NHS EnglandNHS England Princeton UniversityPrinceton University Rockefeller UniversityRockefeller University TIGRTIGR TSRITSRI Tufts UniversityTufts University Wellcome Trust Sanger InstituteWellcome Trust Sanger Institute University of WisconsinUniversity of Wisconsin World Health OrganizationWorld Health Organization Yale UniversityYale University

Page 11: talk

March 30th 2004 BioCreative 2004

SummarySummary

What is Open Access publishing?What is Open Access publishing? Open Access publishing and text Open Access publishing and text

miningmining About BMC BioinformaticsAbout BMC Bioinformatics The BioCreative supplementThe BioCreative supplement

Page 12: talk

March 30th 2004 BioCreative 2004

Mining the full textMining the full text

Analysing results of high-throughput Analysing results of high-throughput experiments means biologists experiments means biologists increasingly increasingly need need text-mining toolstext-mining tools

PubMed is currently the primary PubMed is currently the primary resource for text mining (“it’s what’s resource for text mining (“it’s what’s available”) but:available”) but:– Abstracts omit critical informationAbstracts omit critical information– Techniques developed for abstracts may not Techniques developed for abstracts may not

effectively use extra information in full texteffectively use extra information in full text Fully Open Access corpora, in standard Fully Open Access corpora, in standard

XML formats, will helpXML formats, will help

Page 13: talk

March 30th 2004 BioCreative 2004

Data mining - BioMed CentralData mining - BioMed Central

Entire corpus of full text XML downloadable by Entire corpus of full text XML downloadable by ftp as a single zip fileftp as a single zip file

Various groups working with the data Various groups working with the data – E.g Pre-BIND (automatic extraction of possible E.g Pre-BIND (automatic extraction of possible

protein-protein interaction information from full text)protein-protein interaction information from full text) No restrictions on redistributionNo restrictions on redistribution This means other groups can use same corpus This means other groups can use same corpus

to repeat and build on resultsto repeat and build on results

http://www.biomedcentral.com/info/about/datamining

Page 14: talk

March 30th 2004 BioCreative 2004

Data mining - BioMed Central Data mining - BioMed Central (screen shot)(screen shot)

Page 15: talk

March 30th 2004 BioCreative 2004

Data mining - PubMed CentralData mining - PubMed Central

Standard NLM archiving/interchange XML DTD: Standard NLM archiving/interchange XML DTD: common format across multiple publisherscommon format across multiple publishers

Only a subset of PubMed Central participating Only a subset of PubMed Central participating publishers allow download of full text XMLpublishers allow download of full text XML– BioMed Central BioMed Central – Public Library of SciencePublic Library of Science

Hopefully, more will follow….Hopefully, more will follow…. XML made available via OAI interfaceXML made available via OAI interface

http://www.pubmedcentral.com/about/oai.html

Page 16: talk

March 30th 2004 BioCreative 2004

Data mining - PubMed Central Data mining - PubMed Central

Page 17: talk

March 30th 2004 BioCreative 2004

Adding structure to full text dataAdding structure to full text data

Some examples of useful structure:Some examples of useful structure:

1.1. Structure of article itself (figure Structure of article itself (figure legends, materials and methods, legends, materials and methods, references etc)references etc)

2.2. MathML, CML etcMathML, CML etc

3.3. Disambiguated references to Disambiguated references to genes/proteins…genes/proteins…

Page 18: talk

March 30th 2004 BioCreative 2004

Authoring tools are keyAuthoring tools are key

Manuscript structureManuscript structureEndNote, TeX/BibTeX pretty good alreadyEndNote, TeX/BibTeX pretty good already

MathMLMathMLPublicon, TeX etc.Publicon, TeX etc.

CMLCMLChemsketch etc.Chemsketch etc.

Gene/protein reference markupGene/protein reference markup??Semi-automatic markup during authoringSemi-automatic markup during authoringAuthor reviews and confirms markupAuthor reviews and confirms markupSystem prompts author to clarify ambiguity System prompts author to clarify ambiguity c.f.c.f. grammar checker, code intelligence grammar checker, code intelligence

Page 19: talk

March 30th 2004 BioCreative 2004

SummarySummary

What is Open Access publishing?What is Open Access publishing? Open Access publishing and text Open Access publishing and text

miningmining BMC BioinformaticsBMC Bioinformatics The BioCreative supplementThe BioCreative supplement

Page 20: talk

March 30th 2004 BioCreative 2004

BMC series of online journalsBMC series of online journals BMC BiochemistryBMC Biochemistry BMC BioinformaticsBMC Bioinformatics BMC BiotechnologyBMC Biotechnology BMC Cell BiologyBMC Cell Biology BMC Chemical BiologyBMC Chemical Biology BMC Developmental BiologyBMC Developmental Biology BMC EcologyBMC Ecology BMC Evolutionary BiologyBMC Evolutionary Biology BMC GeneticsBMC Genetics BMC GenomicsBMC Genomics BMC ImmunologyBMC Immunology BMC MicrobiologyBMC Microbiology BMC Molecular BiologyBMC Molecular Biology BMC NeuroscienceBMC Neuroscience BMC PharmacologyBMC Pharmacology BMC PhysiologyBMC Physiology BMC Plant BiologyBMC Plant Biology BMC Structural BiologyBMC Structural Biology

BMC AnesthesiologyBMC Anesthesiology BMC Blood DisordersBMC Blood Disorders BMC CancerBMC Cancer BMC Cardiovascular DisordersBMC Cardiovascular Disorders BMC Clinical PathologyBMC Clinical Pathology BMC Clinical PharmacologyBMC Clinical Pharmacology BMC Complementary and BMC Complementary and

Alternative MedicineAlternative Medicine BMC DermatologyBMC Dermatology BMC Ear, Nose and Throat BMC Ear, Nose and Throat

DisordersDisorders BMC Emergency MedicineBMC Emergency Medicine BMC Endocrine DisordersBMC Endocrine Disorders BMC Family PracticeBMC Family Practice BMC GastroenterologyBMC Gastroenterology BMC GeriatricsBMC Geriatrics BMC Health Services ResearchBMC Health Services Research BMC Infectious DiseasesBMC Infectious Diseases BMC International Health and BMC International Health and

Human RightsHuman Rights BMC Medical EducationBMC Medical Education BMC Medical EthicsBMC Medical Ethics BMC Medical GeneticsBMC Medical Genetics

BMC Medical ImagingBMC Medical Imaging BMC Medical Informatics and BMC Medical Informatics and

Decision MakingDecision Making BMC Medical Research BMC Medical Research

MethodologyMethodology BMC Musculoskeletal BMC Musculoskeletal

DisordersDisorders BMC NephrologyBMC Nephrology BMC NeurologyBMC Neurology BMC Nuclear MedicineBMC Nuclear Medicine BMC NursingBMC Nursing BMC OphthalmologyBMC Ophthalmology BMC Oral HealthBMC Oral Health BMC Palliative CareBMC Palliative Care BMC PediatricsBMC Pediatrics BMC Pregnancy and ChildbirthBMC Pregnancy and Childbirth BMC PsychiatryBMC Psychiatry BMC Public HealthBMC Public Health BMC Pulmonary MedicineBMC Pulmonary Medicine BMC SurgeryBMC Surgery BMC UrologyBMC Urology BMC Women's HealthBMC Women's Health

Page 21: talk

March 30th 2004 BioCreative 2004

BMC BioinformaticsBMC Bioinformatics

Page 22: talk

March 30th 2004 BioCreative 2004

RSS feedsRSS feeds

Page 23: talk

March 30th 2004 BioCreative 2004

Open access leads to high visibilityOpen access leads to high visibilityIndexing/LinkingIndexing/Linking PubMedPubMed MEDLINEMEDLINE ISIISI BIOSISBIOSIS CASCAS CrossRefCrossRef ScirusScirus Open Archive InitiativeOpen Archive Initiative CitebaseCitebase GoogleGoogle

ArchivingArchivingPubMed CentralPubMed CentralINISTINISTLOCKSSLOCKSSMax PlanckMax PlanckOhioLINKOhioLINK

Page 24: talk

March 30th 2004 BioCreative 2004

BMC Bioinformatics - citation BMC Bioinformatics - citation impactimpact

BMC Bioinformatics

0

100

200

300

400

2001 2002 2003(projected)

Number ofarticlespublished

Times cited(ISI )

Page 25: talk

March 30th 2004 BioCreative 2004

SummarySummary

What is Open Access publishing?What is Open Access publishing? Open Access publishing and text Open Access publishing and text

miningmining About BMC BioinformaticsAbout BMC Bioinformatics The BioCreative supplementThe BioCreative supplement

Page 26: talk

March 30th 2004 BioCreative 2004

Process for publishing in Process for publishing in BMC BMC BioinformaticsBioinformatics supplement supplement Follow Follow BMC BioinformaticsBMC Bioinformatics ‘Research ‘Research

Article’ instructions for authorsArticle’ instructions for authors Send articles to BioCreative organizers Send articles to BioCreative organizers

who will coordinate peer reviewwho will coordinate peer review[do not submit articles online][do not submit articles online]

Supplement passed on to BioMed Supplement passed on to BioMed Central for XML markup and publicationCentral for XML markup and publication

$400 processing charge/article$400 processing charge/article

Page 27: talk

March 30th 2004 BioCreative 2004

Instructions for authorsInstructions for authors

Page 28: talk

March 30th 2004 BioCreative 2004

Access to supplementAccess to supplement

All articles in supplement covered All articles in supplement covered by BioMed Central’s Open Access by BioMed Central’s Open Access licence agreementlicence agreement– Free accessFree access– Free re-distribution/re-useFree re-distribution/re-use

Supplement indexed in PubMed Supplement indexed in PubMed and permanently archived in and permanently archived in PubMed CentralPubMed Central

Page 29: talk

March 30th 2004 BioCreative 2004

That’s itThat’s it