Deuterogate: Causes and consequences of automated...

18
www.guidetopharmacology.org Deuterogate: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs feeding into PubChem Christopher Southan IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh ACS Boston CINF session: Enabling Machines to "Read" the Chemical Literature: Techniques 1 http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of- patentspecified-virtual-deuterated-drugs

Transcript of Deuterogate: Causes and consequences of automated...

Page 1: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

www.guidetopharmacology.org  

Deuterogate: Causes and consequences of automated extraction of patent-specified virtual

deuterated drugs feeding into PubChem

Christopher Southan IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh ACS Boston CINF session: Enabling Machines to "Read" the Chemical Literature: Techniques

1

http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-patentspecified-virtual-deuterated-drugs

Page 2: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Abstract 2

The strategy of deuterating drugs to improve clinical profiles via the kinetic isotope effect has been known for over 50 years. However, recent development candidates have been predicated on a surge of opportunistic patent filings between 2008 and 2011. For automated chemical named entity recognition (CNER) these present particular challenges. These are investigated in this work by comparing sources of the 80K deuterated compounds inside PubChem. Of these, 45K originate from the patent CNER submissions of SCRIPDB, IBM and SureChEMBL plus 23K from Thomson Pharma via manual expert curation (MEXC). For CNER there are three options, image extraction, recognition of [2H] in IUPAC text forms or Complex Work Unit (CWU) molfiles obtained from the USPTO. For images, conversions to structures using OSRA with explicit H and D positions failed. Tests with chemicalize.org and OPSIN established that text “deuterio” did convert. The SureChEMBL pipeline also handles the “dx” prefix (e.g. methyl-d3). These tests, combined with inspection of SureChEMBL export records, confirmed that deuteration feeding into PubChem from patents was predominantly image-only derived. It was also clear that CWUs had provided the majority of these via molfiles. However, despite conceptually simillar CNER pipelines the three CNER sources showed divergent capture. Importantly, inspection of patents from the three major applicants in the deuteration IP Gold Rush indicated little reduction to practice. The unexpected consequences are that most of ~25K derivatives in PubChem of ~500 established drugs. are virtual, (i.e. the structures do not exist). This achilles heel of CNER will be discussed, since it presents database users with the dilemma between virtual swamping but possible IP significance on the one hand, verses the permanent absence of linked bioactivity data on the other.

Page 3: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Introduction 3

Page 4: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Dalbavancin 4

FDA approved May 2014

Page 5: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Scifinder extraction

5

Page 6: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

US20090062182: Deuterium-enriched dalbavancin

6

Page 7: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Protia portfolio

7

Page 8: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

OSRA:fails on explicit “D-” image > struct 8

Page 9: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

The extraction problem for deuts

• Majority of patents are image-only so no conversion •  IUPAC specification of “detero” and “deuterio” is rare but

OPSIN, SureChEMBL and chemicalize.org will do the name-to-struc

•  Thomson (Derwent) and SciFinder draw them in manually for conversion

• SureChEMBL, SCRIPDB and IBM use the Complex Work Units from the USPTO

•  These include the molfiles drawn by the contractors and are the major source of deuteration in PubChem

9

Page 10: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Codeine: the enumeration record from US20080045558 10

Left panel shows a section from one of approximately 55 pages of images. Right panel shows the first three examples from the 520 intersect between the 992 CIDs retrieved via the patent number and the 551 from “Same, Connectivity” for codeine (CID 5284371), ranked by Mw. Thomson Pharma only extracted three examples from this patent

Page 11: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

SureChEMBL indexing

11

First structure in the list SCHEMBL12905541 corresponds to CID 237918906 which has merged the SureChEMBL SID 237918906 with SCRIPDB SID 141460523.

Page 12: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Deuterated source splits

12

Page 13: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Source divergence in deuteration capture 13

TRP, SCR and SCH have an approximate three-way split, with the union of 64195 covering 81% of PubChem deuteration (77882 March 2015)

Page 14: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Propagation: UniChem indexing

14

Page 15: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Deuteration over time: patent surge in Thomson Pharma 15

TRP deuteration in PubChem on a per-year basis (left vertical axis and hatched bars) with patent publication dates taken from the USPTO for Auspex, Concert and Protia combined (the right hand vertical axis and solid lines with triangles).

Page 16: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Picking off drug structures

16

Page 17: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

SciFinder results indicate invention by consortium

•  SciFinder facilitated certain queries orthogonal to PubChem (e.g. assignee query for substances)

•  19841 isotopic substances were derived from 165 Auspex patents •  Concert 6766 from 189 •  Protia 1959 from 252 •  Remarkably, the substance union query gave 28076 with an intersect

of only 30 as deuteration reagents •  This means the assignees somehow contrived to divide up ~ 600

drug filings (i.e. to avoid each others claims)

17

Page 18: Deuterogate: Causes and consequences of automated ...bulletin.acscinf.org/PDFs/250nm/2015-fall_CINF56.pdf · SciFinder results indicate invention by consortium • SciFinder facilitated

Consequences and problems of virtual deuteration

•  Classic case of unintended consequences •  Confounding drug analogue searching •  Breaking the PubChem unofficial rule of extant-only compounds •  Extant and virtual structures cannot be computationally separated •  Secondary submitters cause intra-PubChem proliferation •  Persistence as no-data entries •  Proliferation between open databases •  Both commercial sources of patent chemistry and source

aggregation projects within pharmaceutical companies will be affected

•  Annotation can be confounded (e.g. the attribution of biological study in SciFinder)

•  Equivocal IP situation

18