Download - surechembl poster - Copy - OpenPHACTS · 2016. 2. 22. · are automated options, including SureChEMBL, which is available via the Open PHACTS discovery platform. But how reliable

Transcript
Page 1: surechembl poster - Copy - OpenPHACTS · 2016. 2. 22. · are automated options, including SureChEMBL, which is available via the Open PHACTS discovery platform. But how reliable

64.2%55.7%

54.5%0

50

100

150

200

250

300

350

400

450

Reaxys SureChEMBL IBM-­SIIP

Number  of  compounds

"SureChEMBL  and  IBM  set"

"SureChEMBL  only  set"

59.0% 50.6%

60.8 49.40

500

1000

1500

2000

2500

3000

SciFinder SureChEMBL IBM-­SIIP

Number  of  compounds

All  compoundsBiologically  relevant

Comparison  of  automated  and  manual  patent  chemistry  extraction  methodsLuca  Bartek*,  Stefan  Senger,  George  Papadatos,  Anna  Gaulton

Introduction

Results

Conclusion

Methods Discussion

References

As new chemical entities are often first published in patents,and some new compounds may not even be featuredelsewhere, patents have become an important source ofinformation for researchers.

With more and more patents granted each year, it becomesincreasingly difficult to extract the chemistry manually. Thereare automated options, including SureChEMBL, which isavailable via the Open PHACTS discovery platform. But howreliable are they compared to manually curated sources? Welooked at the following use cases:

Use case 1:

In the second comparison, we used the 1740 unique patent-­compound pairs we had retrieved from Reaxys. We looked howmany of these patent-­compound pairs we would also find inSureChEMBLand IBM SIIP, respectively.

Another interesting question is the source of thecompounds – whether they are present in the patent astext, structural depictions or Markush structures. When wecompared the subset of WO patents for which imageswere recognised to those which were not, we only found a7.7% increase in efficiency which is less than what wehad expected. This could be because today, automatedsystems have no way of recognising Markush structures,which are in fact very common in the patent literature.

In our binary comparison, we found that chemicals with ahigher patent corpus count was much more likely to befound in either of the automatically created databases.This was in line with our expectations. Though, one mightargue that “unique” compounds are more relevant – thatis, those with a low corpus count.

We also found that there was a vast difference in thesuccess rates of one-­ and multi-­component compounds.When only looking at single-­component structures, thesuccess rate of SureChEMBL was over 80%, while forcompounds containing more than two components, it was0%.

We noticed that the highest success rate was achievedwith US patents, therefore we decided to extend thesearch to patent families to examine whether alternativepatent numbers could improve the results. After retrievingall US, WO and EP patent family members of the patentsretrieved from Reaxys (this was done usingSureChEMBL), we only found a moderate increase in thesuccess rate of both SureChEMBLand IBM SIIP.

On average, 50-­66% of the “gold standard” manuallycurated patent chemistry database content can also befound in automatically generated databases. These latterdatabases are also freely available, for example,SureChEMBL will soon be available through the OpenPHACTS api (http://dev.openphacts.org). IBM SIIP is alsofreely available, however it is a static database coveringpatents until 2010, whereas SureChEMBL is updateddaily.

1,  Senger et  al.,  J.  Cheminf,  2015, 7:492.  Akhondi et  al.,  2014,  PLoSOne  9:e1074773.  http://www.uspto.gov/  

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

1963 1973 1983 1993 2003 2013

USPTO  Grants

PATENT

COMPOUND  1

COMPOUND  2

COMPOUND  3

COMPOUND

PATENT  1

PATENT  2

PATENT  3

Use  case  2:

PATENT COMPOUNDS

46  PATENTS(Akhondi et  al.)

SciFinderCOMPOUNDS

SureChEMBLCOMPOUNDS

IBM  SIIPCOMPOUNDS

automatically  generated

manuallycurated

COMPOUND PATENTS

Maybridge HitfinderCollection#  heavy  atoms>19

mw<500

9274  COMPOUNDS

Reaxys Patents  (for  543  compounds)At  least  1  US,  WO  or  EP

patent

Incorrect  or  ambiguous

“SureChEMBL  only”

“SureChEMBL&IBM”438  compounds

PATENT COMPOUNDS

When stereochemistry was removed, the results somewhatimproved, with SureChEMBL returning 64.9% of the SciFindermolecules. This number was 67.1% for the “Biologicallyannotated” subset.

COMPOUND PATENTS

The first comparison we performed was of a binary nature –we looked at whether a compound was found at all in theSureChEMBLand IBM SIIP databases.

From the 438 compounds, 67.1% was found in at least oneof the two databases, 52.7% were found in both, 2.9% wasfound only in IBM SIIP and 11.4% was found only inSureChEMBL.

61.6% 59.3%

0

200

400

600

800

1000

1200

1400

1600

1800

Reaxys SureChEMBL IBM-­SIIP

Num

ber  of  patent  -­compound  

pairs

PATENT COMPOUNDS

The about 60% efficiency of SureChEMBL would most likelyseem low to the researcher who expects every singlecompound of interest to be extracted from each patent. Thisis the reason it is surprising that the coverage was notgreatly increased for “Biologically annotated” molecules.But what compounds are of interest?

SureChEMBL  returned  nearly  5  times  more  compounds  than  SciFinder.

What  is  noise?

COMPOUND PATENTS

*  [email protected]

The  research  leading  to  these  results  has  received  support  from  the  Innovative  Medicines  Initiative  Joint  Undertaking  under grant  agreement n° 115191,  resources  of  which  are  composed  of  financial  contribution  from  the  European  Union's  Seventh  Framework  Programme  (FP7/2007-­2013)  and  EFPIA  companies’  in  kind  contribution.