Scholarly communication today

Oxford e-Research Centre and Department of ZoologyUniversity of Oxford, UK

Fifth Conference on Open Access Scholarly

Riga, Latvia20 September 2013

The Open Citations Corpus – freeing scholarly citation data

© David Shotton, 2013 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

[email protected]

David Shotton

Scholarly articles haven't really changed much in 346 years

4th Aug 1666

1st Jan 1888

19th March 2012

Scholarly communication – an analogy

Scholarly communication, at this mid-point in the digital revolution,is in an ill-defined transitional state—a ‘horseless carriage’ state—that lies somewhere between the world of print and paper and the world of the web and computers, with the former still exercising significantly more influence than the latter

We started here:

We’re now here (online): Great – that’s a significant start

Scholarly communication – an analogy

. . . but this is really where we need to be!

The importance of citations

What is a citation?

The performative act of citing a published work that is relevant to the current work, typically made by including a reference in a reference list

Why are citations important?

The act of bibliographic citation is central to scholarly communication – bibliographic references are the links that knit together independent scholarship

Citations unify the whole world of scholarship into a giant citation network Citation networks reveal the development of academic disciplines

Sir Isaac Newton: “If I have seen a little further, it is by standing on the shoulders of Giants”

How is the present situation imperfect?

The present scholarly citation system inadequately exposes the knowledge networks that exist within the scholarly literature, linking papers, authors, funders, research projects and datasets

Citation data are hidden behind subscription firewalls of commercial companies Academics are not free to use their own citation data as they please

In this Open Access age, it is a scandal that reference lists from journal articles, the core elements of the academic data cycle, are not freely available for use by the scholars who created them

Citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing

Nomenclature and metadata

“a reference”

“a reference”

“a reference”

“a reference”

Current citation practice

Well-formed references in reference lists

. . . relate to clearly defined entities

But extreme ambiguity in terminology!

c4o:InTextReferencePointer

biro:BibliographicReference

cito:cites

Citing article

Cited article

biro:references

c4o:denotes

This is the nomenclature used in our SPAR (Semantic Publishing and Referencing) Ontologies

http://purl.org/spar/

Recommended nomenclature for references and citations

Generic structured metadata required to record a citation

Citing paper

Bibliographic citation

Publication date

e.g. Journal article

Title

Source of citation info, e.g. CrossRef

cito:cites

Cited paper

type

bibliographic metadata

relationship

provenance

Unique identifier

entities

The Open Citations Corpus

The original Open Citations Corpus

An open repository of bibliographic citation data created in 2011

available at http://opencitations.net

Created with JISC funding of the Open Citations Project

project blog: http://opencitations.wordpress.com/

Originally populated with ~6.4 million individual references from the reference lists of ~200,000 articles in the Open Access Subset of PubMed Central (as of January 2011)

These reference >3 million unique papers

~ 20% of all PubMed papers published between 1950 and 2010, including all the highly cited papers in every biomedical field

Multiple citations of the same well-cited papers permitted us to perform error correction of the harvested citations (approx 1% erroneous)

These citations are encoded as Linked Open Data using the SPAR ontologies, and are freely available under a CC0 waiver from http://opencitations.net/data/

Viewing citation networks at http://opencitations.net

The outward citation network of Reis et al. (2008)

Limitations of the original Open Citations Corpus

A snapshot in time of the citation data in PubMed Central as of January 2011 becoming increasingly out of date

Contains references from open access articles only Limited to the biomedical domain

Expanding the Open Citations Corpus

Expanding the Open Citations Corpus - Objectives

Redesign the OCC data model Update the current ingest Increase the domain coverage Include reference lists from subscription-access journals Harvest references on a continuing ongoing basis, as articles are published Improve the user interface and the user experience Publish the citation data both in BibJSON and in RDF as Linked Open Data Build added value services over the citation data

Redesigning the Open Citations Corpus data model

Three record types: Entity Records, Personal Records and Citation Records A clear separation is made between potentially erroneous citation information

'as received’ in text strings from article reference lists ReferenceTextRecords containing NameTextRecords (of authors, editors)

and authoritative bibliographic metadata derived from trustworthy sources such as CrossRef, PubMed and the web pages of published articles

BibliographicRecords and PersonalRecords (of authors, editors) A distinction is also made between an UnmatchedCitationRecord

where no BibliographicRecord exists within the OCC for the cited entity

and a MatchedCitationRecord where the cited entity has a BibliographicRecord within the OCC

A unique internal identifier is created for each OCC record Provenance information details the source of each citation, the date it was

acquired, its format, and the name of the curator responsible for its ingestion

Reconfiguring the Open Citations Corpus

Underlying technical implementation being revised

Bibliographic information encoded in BibJSON

Data stored in BibServer, that handles BibJSON natively

Data from different sources brought into a common BibJSON format as soon as possible

Processing the whole ingest from either source takes over 24 hours

Work still to be done on the ingest pipeline, since the parsing of citation information from the reference list entries is not yet 100% accurate

Matching citation strings to bibliographic records

When a new reference has been extracted from a reference list a ReferenceTextRecord is created for the citation target, and an UnmatchedCitationRecord is created between the BibliographicRecord

of the citing paper and the citation target’s ReferenceTextRecord

The ReferenceTextRecord is then compared with existing BibliographicRecords If a match is found, a new MatchedCitationRecord is created within the OCC

between the BibliographicRecords of the citing and cited entities, and the pre-existing UnmatchedCitationRecord between the citing and cited entities

is deprecated

Similarly, a new NameTextRecord is created for each author and editor named in the new ReferenceTextRecord, and the OCC is then searched for matches to existing PersonalRecords within the OCC

Citation error correction

Examples of errors in reference list entries vary from the trivial – a non-English name with incorrect accents

or an article title containing “beta” instead of the correct “β” to the serious – two papers in the same reference list with the same DOI

Such errors can be detected by comparing a new ReferenceTextRecord with pre-existing BibliographicRecords, and of a new NameTextRecord with pre-existing PersonalRecords

Where there are several OCC ReferenceTextRecords referencing the same multiply-cited paper for which an authoritative OCC BibliographicRecord does not yet exist, we use voting algorithms for reference disambiguation and error correction, enabling the creation of a reliable BibliographicRecord for that entity even when we can find no external authority to provide it

In future, we wish to offer an automated OCC reference correction service to third parties such as authors and journal editors, enabling them to spot and correct errors in the reference lists of submitted papers before publication

New relationship types in the Open Citations Corpus

Entity type relationships The nature of the source entity and the target entity (e.g. journal article, book,

dataset) are separately recorded in the OCC. We can thus infer the nature of each entity type relationship, for example:

Article-to-article bibliographic citation Article-to-database data citation Data_repository-to-article bibliographic citation

Relationships other than bibliographic citations Additional relationship types between entities in the OCC may be encoded

using CiTO, the Citation Typing Ontology, if that information is available:

Citation :EntityA cito:cites :EntityB . Shared authorship :EntityA cito:sharesAuthorsWith :EntityB . Common funding :EntityA cito:sharesFundingAgencyWith EntityB . Common institution :EntityA cito:sharesAuthorInstitutionWith :EntityB . Related :EntityA dcterms:relation :EntityB .

Expansion of the Open Citations Corpus coverage

Ingest from the Open Access Subset of PubMed Central is being updated from ~200,000 articles in Jan 2011 to the current ~658,000 articles in September 2013

Domain coverage is being expanded to include the physical sciences and mathematics, by the ingest of the reference lists from all ~872,000 preprints in the arXiv preprint repository at Cornell University Library

This will bring the total number of references from ~6.4 million to ~40 million

We then intend to ingest all the references in CiteSeer and from Wikipedia, marking these with clear provenance information

To this we will add citations from data repositories such as Dryad, that contain literature references associated with the datasets they hold

and from DataCite, that issues DOIs for datasets, and harvests metadata that contain literature references

Citations from heritage literature – ‘The Future of the Past’

Funding application just submitted to harvest references from the pre-digital biodiversity / biological taxonomy literature, where papers have lasting value

We will use the Biodiversity Heritage Library (http://www.biodiversitylibrary.org/) as a source of references

David King, a text mining colleague at the Open University, will use advanced text mining techniques to dig references out of ‘dirty’ OCR’d page images

We will then ingest these data into the Open Citations Corpus and make them freely available

This will be the only source of digital citation data from a major fraction of the world's heritage literature in the field of biodiversity / biological taxonomy, that is simply not available in digital form anywhere else

Additional citations from PubMed Central

There are ~2.2 million articles in PubMed Central that are not part of the Open Access Subset, presently missing from the Open Citations Corpus

These contain citations not only to other papers, but also to datasets, typically in the form of database accession numbers, buried within the full text or footnotes

Recent text mining initiatives undertaken by Europe PubMed Central (EPMC) have extracted both the bibliographic citations and the data citations from all ~2.8 million PubMed Central articles, which are now freely available

We propose to ingest all these EPMC literature and data citations into the expanded and improved Open Citations Corpus

This will increase the number of PMC articles for which the OCC holds citation information by about 330%

In addition, it will further expand the nature of the citation data held to include the data citations contained within these PMC articles

However, these are just a fraction of the total scholarly citations, most of which are locked behind the pay walls of commercial providers

http://europepmc.org/

Reference lists from subscription–access articles

All fully open access publishers already publish article reference lists openly I am working to persuade other major scholarly publishers to do the same

i.e. to put article reference lists outside the subscription pay-wall, in the same way as abstracts and bibliographic metadata are freely available

Last January, I published an Open Letter to Publishers requesting this Claire Redhead kindly distributed it to all OASPA members The letter is available at

http://imageweb.zoo.ox.ac.uk/pub/2013/letters/Letter_to_all_scholarly_journal_publishers_re_open_citations.pdf

A number of leading STM publishers have expressed their willingness to open the reference lists from subscription-access journal articles

Nature, Science, Taylor & Francis, Royal Society Publishing, Portland Press, MIT Press and Oxford University Press are among the first

another has expressed willingness verbally, but has yet to commit formally

http://opencitations.wordpress.com

Opening article reference lists via CrossRef

How can these be ingested into the Open Citations Corpus? Most publishers already submit their reference lists to CrossRef as part of its CitedBy Linking Service

If you do not at present, you should use this free service! With publisher’s permission, CrossRef can enable reference lists to be ‘opened’

on a publisher-by-publisher basis based on DOI prefixes on a journal-by-journal basis on an article-by-article basis for hybrid journals

References are then available via the CrossRef API for ingest into the OCC

However, because the default CrossRef CitedBy Linking Service agreement is not to publish reference lists, even Open Access publishers must specifically inform CrossRef that the reference lists of their journal articles should be open

Geoff Bilder has a new CrossRef Metadata Best Practice Document that I will circulate, explaining how to specify this choice in your article metadata,

Summary - Benefits of the Open Citations Corpus

Created by scholars for scholars using scholarly data No profit motive constraining free publication of the data Will bring particular benefit to those who are NOT members of First World

academic institutions whose libraries subscribe to commercial citation data from Thomson-Reuters or Elsevier

Will provide integrated access to citation data from a variety of sources, both inside and outside traditional scholarly publishing, with provenance information

Data are semantically described using the SPAR bibliographic ontologies Citations thus become part of the Web of Linked Open Data

Data available in a variety of formats including BibJSON, BibTex and RDF for download by third parties for their own use or to build into cool services

indexing, search and browse (in prototype) timeline visualizations (in prototype) analysis of citation networks, co-authorship networks, etc. trend identification, recommendation services, etc.

Sustainability

Sustainability

The development of the Open Citations Corpus has been enabled by short-term grant funding, but this does not provide a sustainable financial model

For the future, we seek one of the following long-term arrangements: Adoption by a major institutional or national library Adoption by a publishing organization such as CrossRef, with indirect

support from publishers Direct support by the scholarly publishing community Social investment, i.e. the provision of capital to generate social as well as

financial returns, to support open access to scholarly information Income support by charging for added-value services over the open data

I would be grateful for your views on the value of the Open Citations Corpus and the manner in which its ongoing development might be supported

Acknowledgements and thanks

Alex Dutton, who developed the original Open Citations Corpus Richard Jones, Martyn Whitwell and Mark MacGillivray of Cottage

Labs, who have undertaken more recent development work Silvio Peroni, my colleague in developing the suite of SPAR

(Semantic Publishing and Referencing) Ontologies

The JISC, who have funded the development of the Open Citations Corpus

Scholarly communication today

Documents

Transcript of Scholarly communication today