surechembl poster - Copy - OpenPHACTS · 2016. 2. 22. · are automated options, including...

1
64.2% 55.7% 54.5% 0 50 100 150 200 250 300 350 400 450 Reaxys SureChEMBL IBMSIIP Number of compounds "SureChEMBL and IBM set" "SureChEMBL only set" 59.0% 50.6% 60.8 49.4 0 500 1000 1500 2000 2500 3000 SciFinder SureChEMBL IBMSIIP Number of compounds All compounds Biologically relevant Comparison of automated and manual patent chemistry extraction methods Luca Bartek*, Stefan Senger, George Papadatos, Anna Gaulton Introduction Results Conclusion Methods Discussion References As new chemical entities are often first published in patents, and some new compounds may not even be featured elsewhere, patents have become an important source of information for researchers. With more and more patents granted each year, it becomes increasingly difficult to extract the chemistry manually. There are automated options, including SureChEMBL, which is available via the Open PHACTS discovery platform. But how reliable are they compared to manually curated sources? We looked at the following use cases: Use case 1: In the second comparison, we used the 1740 unique patent compound pairs we had retrieved from Reaxys. We looked how many of these patentcompound pairs we would also find in SureChEMBL and IBM SIIP, respectively. Another interesting question is the source of the compounds – whether they are present in the patent as text, structural depictions or Markush structures. When we compared the subset of WO patents for which images were recognised to those which were not, we only found a 7.7% increase in efficiency which is less than what we had expected. This could be because today, automated systems have no way of recognising Markush structures, which are in fact very common in the patent literature. In our binary comparison, we found that chemicals with a higher patent corpus count was much more likely to be found in either of the automatically created databases. This was in line with our expectations. Though, one might argue that “unique” compounds are more relevant – that is, those with a low corpus count. We also found that there was a vast difference in the success rates of one and multicomponent compounds. When only looking at singlecomponent structures, the success rate of SureChEMBL was over 80%, while for compounds containing more than two components, it was 0%. We noticed that the highest success rate was achieved with US patents, therefore we decided to extend the search to patent families to examine whether alternative patent numbers could improve the results. After retrieving all US, WO and EP patent family members of the patents retrieved from Reaxys (this was done using SureChEMBL), we only found a moderate increase in the success rate of both SureChEMBLand IBM SIIP. On average, 5066% of the “gold standard” manually curated patent chemistry database content can also be found in automatically generated databases. These latter databases are also freely available, for example, SureChEMBL will soon be available through the Open PHACTS api (http://dev.openphacts.org). IBM SIIP is also freely available, however it is a static database covering patents until 2010, whereas SureChEMBL is updated daily. 1, Senger et al., J. Cheminf, 2015, 7:49 2. Akhondi et al., 2014, PLoS One 9:e107477 3. http://www.uspto.gov/ 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 1963 1973 1983 1993 2003 2013 USPTO Grants PATENT COMPOUND 1 COMPOUND 2 COMPOUND 3 COMPOUND PATENT 1 PATENT 2 PATENT 3 Use case 2: PATENT COMPOUNDS 46 PATENTS (Akhondi et al.) SciFinder COMPOUNDS SureChEMBL COMPOUNDS IBM SIIP COMPOUNDS automatically generated manuall ycurated COMPOUND PATENTS Maybridge Hitfinder Collection # heavy atoms>19 mw<500 9274 COMPOUNDS Reaxys Patents (for 543 compounds) At least 1 US, WO or EP patent Incorrect or ambiguous “SureChEMBLonly” “SureChEMBL&IBM” 438 compounds PATENT COMPOUNDS When stereochemistry was removed, the results somewhat improved, with SureChEMBL returning 64.9% of the SciFinder molecules. This number was 67.1% for the “Biologically annotated” subset. COMPOUND PATENTS The first comparison we performed was of a binary nature – we looked at whether a compound was found at all in the SureChEMBL and IBM SIIP databases. From the 438 compounds, 67.1% was found in at least one of the two databases, 52.7% were found in both, 2.9% was found only in IBM SIIP and 11.4% was found only in SureChEMBL. 61.6% 59.3% 0 200 400 600 800 1000 1200 1400 1600 1800 Reaxys SureChEMBL IBMSIIP Number of patent compound pairs PATENT COMPOUNDS The about 60% efficiency of SureChEMBL would most likely seem low to the researcher who expects every single compound of interest to be extracted from each patent. This is the reason it is surprising that the coverage was not greatly increased for “Biologically annotated” molecules. But what compounds are of interest? SureChEMBL returned nearly 5 times more compounds than SciFinder. What is noise? COMPOUND PATENTS * [email protected] The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n° 115191, resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/20072013) and EFPIA companies’ in kind contribution.

Transcript of surechembl poster - Copy - OpenPHACTS · 2016. 2. 22. · are automated options, including...

Page 1: surechembl poster - Copy - OpenPHACTS · 2016. 2. 22. · are automated options, including SureChEMBL, which is available via the Open PHACTS discovery platform. But how reliable

64.2%55.7%

54.5%0

50

100

150

200

250

300

350

400

450

Reaxys SureChEMBL IBM-­SIIP

Number  of  compounds

"SureChEMBL  and  IBM  set"

"SureChEMBL  only  set"

59.0% 50.6%

60.8 49.40

500

1000

1500

2000

2500

3000

SciFinder SureChEMBL IBM-­SIIP

Number  of  compounds

All  compoundsBiologically  relevant

Comparison  of  automated  and  manual  patent  chemistry  extraction  methodsLuca  Bartek*,  Stefan  Senger,  George  Papadatos,  Anna  Gaulton

Introduction

Results

Conclusion

Methods Discussion

References

As new chemical entities are often first published in patents,and some new compounds may not even be featuredelsewhere, patents have become an important source ofinformation for researchers.

With more and more patents granted each year, it becomesincreasingly difficult to extract the chemistry manually. Thereare automated options, including SureChEMBL, which isavailable via the Open PHACTS discovery platform. But howreliable are they compared to manually curated sources? Welooked at the following use cases:

Use case 1:

In the second comparison, we used the 1740 unique patent-­compound pairs we had retrieved from Reaxys. We looked howmany of these patent-­compound pairs we would also find inSureChEMBLand IBM SIIP, respectively.

Another interesting question is the source of thecompounds – whether they are present in the patent astext, structural depictions or Markush structures. When wecompared the subset of WO patents for which imageswere recognised to those which were not, we only found a7.7% increase in efficiency which is less than what wehad expected. This could be because today, automatedsystems have no way of recognising Markush structures,which are in fact very common in the patent literature.

In our binary comparison, we found that chemicals with ahigher patent corpus count was much more likely to befound in either of the automatically created databases.This was in line with our expectations. Though, one mightargue that “unique” compounds are more relevant – thatis, those with a low corpus count.

We also found that there was a vast difference in thesuccess rates of one-­ and multi-­component compounds.When only looking at single-­component structures, thesuccess rate of SureChEMBL was over 80%, while forcompounds containing more than two components, it was0%.

We noticed that the highest success rate was achievedwith US patents, therefore we decided to extend thesearch to patent families to examine whether alternativepatent numbers could improve the results. After retrievingall US, WO and EP patent family members of the patentsretrieved from Reaxys (this was done usingSureChEMBL), we only found a moderate increase in thesuccess rate of both SureChEMBLand IBM SIIP.

On average, 50-­66% of the “gold standard” manuallycurated patent chemistry database content can also befound in automatically generated databases. These latterdatabases are also freely available, for example,SureChEMBL will soon be available through the OpenPHACTS api (http://dev.openphacts.org). IBM SIIP is alsofreely available, however it is a static database coveringpatents until 2010, whereas SureChEMBL is updateddaily.

1,  Senger et  al.,  J.  Cheminf,  2015, 7:492.  Akhondi et  al.,  2014,  PLoSOne  9:e1074773.  http://www.uspto.gov/  

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

1963 1973 1983 1993 2003 2013

USPTO  Grants

PATENT

COMPOUND  1

COMPOUND  2

COMPOUND  3

COMPOUND

PATENT  1

PATENT  2

PATENT  3

Use  case  2:

PATENT COMPOUNDS

46  PATENTS(Akhondi et  al.)

SciFinderCOMPOUNDS

SureChEMBLCOMPOUNDS

IBM  SIIPCOMPOUNDS

automatically  generated

manuallycurated

COMPOUND PATENTS

Maybridge HitfinderCollection#  heavy  atoms>19

mw<500

9274  COMPOUNDS

Reaxys Patents  (for  543  compounds)At  least  1  US,  WO  or  EP

patent

Incorrect  or  ambiguous

“SureChEMBL  only”

“SureChEMBL&IBM”438  compounds

PATENT COMPOUNDS

When stereochemistry was removed, the results somewhatimproved, with SureChEMBL returning 64.9% of the SciFindermolecules. This number was 67.1% for the “Biologicallyannotated” subset.

COMPOUND PATENTS

The first comparison we performed was of a binary nature –we looked at whether a compound was found at all in theSureChEMBLand IBM SIIP databases.

From the 438 compounds, 67.1% was found in at least oneof the two databases, 52.7% were found in both, 2.9% wasfound only in IBM SIIP and 11.4% was found only inSureChEMBL.

61.6% 59.3%

0

200

400

600

800

1000

1200

1400

1600

1800

Reaxys SureChEMBL IBM-­SIIP

Num

ber  of  patent  -­compound  

pairs

PATENT COMPOUNDS

The about 60% efficiency of SureChEMBL would most likelyseem low to the researcher who expects every singlecompound of interest to be extracted from each patent. Thisis the reason it is surprising that the coverage was notgreatly increased for “Biologically annotated” molecules.But what compounds are of interest?

SureChEMBL  returned  nearly  5  times  more  compounds  than  SciFinder.

What  is  noise?

COMPOUND PATENTS

*  [email protected]

The  research  leading  to  these  results  has  received  support  from  the  Innovative  Medicines  Initiative  Joint  Undertaking  under grant  agreement n° 115191,  resources  of  which  are  composed  of  financial  contribution  from  the  European  Union's  Seventh  Framework  Programme  (FP7/2007-­2013)  and  EFPIA  companies’  in  kind  contribution.