Post on 11-Jul-2020
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Chemistry and reactions from non-US patents
Daniel Lowe and Roger Sayle NextMove Software
Cambridge, UK
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Topics
• Coverage of USPTO vs EPO patents
• Extraction of non-chemical-compound data
• Can text-mining provide insights into reaction informatics
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
What am I missing by only using the USPTO data?
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
EPO and USPTO background
• USPTO: 303k grants published in 2013
• EPO: 67k grants published in 2013
• EPO patents must be in one or more of English, French or German
• USPTO documents are all in English
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Compounds (using StdInChI)
3,345,877
2,674,069
445,588
USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/USPTO Reactions (using separately sorted StdInChIs for reactants/agents and products)
706,524
317,533
99,681
USPTO = 1976-2013 patent grants + 2001-2013 patent applications EPO = 1978-2013 patent grants and applications
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Effect of different Languages
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Num
ber o
f str
uctu
res e
xtra
cted
English
German
French
Translation performed by Lexichem v2014.Jun
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
First Disclosure Occurrence by Language
Excluding compounds disclosed by USPTO (1976-
2013) patents
All compounds
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Timeliness
• Competitive intelligence requires up to date information
• Methodology:
– Consider compounds present in both EPO and USPTO date not disclosed by either prior to 2006
– Compare the earliest publication date for each compound
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Com
poun
d di
sclo
sure
s
Average lag between EPO/USPTO
Compounds first disclosed between 2006-2013
Mean: US 540 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
10000
20000
30000
40000
50000
60000
70000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Com
poun
d di
sclo
sure
s
Applications Vs Applications
Mean: US 223 days earlier
Compounds first disclosed between 2006-2013
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
20000
40000
60000
80000
100000
120000
140000
US morethan 5years
earlier
US 3-5years
earlier
US 2-3years
earlier
US 1-2years
earlier
US 6-12monthsearlier
US 3-6monthsearlier
US 1-3monthsearlier
US 1-4weeksearlier
Within aweek
US 1-4weekslater
US 1-3months
later
US 3-6months
later
US 6-12months
later
US 1-2yearslater
US 2-3yearslater
US 3-5yearslater
US morethan 5yearslater
Com
poun
d di
sclo
sure
s
Grants vs Grants
Compounds first disclosed between 2006-2013
Mean: US 287 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Epo/US 1976-2013 and chembl19 overlap
187,486
1,155,034 6,278,048
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Patents 4-5years earlier
Patents 3-4years earlier
Patents 2-3years earlier
Patents 1-2years earlier
Patents 0-1years earlier
Patents 0-1years later
Patents 1-2years later
Patents 2-3years later
Patents 3-5years later
Patents 4-5years later
Average lag between Patents/CHEMbl19
Compounds first disclosed between 2006-2013
Mean: Patents 345 days earlier
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Other data in patents
• Genes/proteins • Reactions including role of reagents • Physical quantities • Spectra • Diseases • Organisms • Companies • …
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
• Identify trends in drug target popularity
• Count number of patents (in a year) mentioning a given gene or its gene products and map to the HGNC symbol
• Many genes are referenced by short ambiguous terms hence great care is required to have good precision
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
(Hepatocyte growth factor receptor) (Epidermal growth factor receptor)
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Ch
EMBL
)
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Pa
tent
s)
EGFR
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Ch
EMBL
)
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Pa
tent
s)
MET
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Gene/protein identification
(Mammalian target of rapamycin) (Tissue plasminogen activator)
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
1976197819801982198419861988199019921994199619982000200220042006200820102012
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Ch
EMBL
)
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Pa
tent
s)
MTOR
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%
0.30%
0.35%
0.40%
0.45%
0.50%
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Ch
EMBL
)
Perc
enta
ge o
f pro
tein
/gen
e m
entio
ns in
that
yea
r (Pa
tent
s)
PLAT
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Melting/Boiling Point Extraction
Over 99k so far!
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
CHEMICAL reactions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Predicting yield
• Features to consider: • 2D fingerprints (especially around the reaction
centers) • Reaction type • Temperature • Time • Change in complexity e.g. chiral centres
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Data extraction
• Can pull out yields, quantities, times and temperatures
• Can sanity check text-mined yield by calculating from amounts (or even masses)
𝑚𝑚𝑚𝑚 𝑜𝑜 𝑐𝑜𝑚𝑐𝑜𝑐𝑐𝑐(𝑔)𝑚𝑜𝑚𝑚𝑚 𝑚𝑚𝑚𝑚 [𝑐𝑚𝑚𝑐 𝑜𝑚𝑜𝑚 𝑚𝑠𝑚𝑐𝑐𝑠𝑐𝑚𝑠](𝑔𝑚𝑜𝑚−1)
= 𝑚𝑜𝑚𝑚 𝑜𝑜 𝑐𝑜𝑚𝑐𝑜𝑐𝑐𝑐
𝑚𝑜𝑚𝑚 𝑜𝑜 𝑐𝑚𝑜𝑐𝑐𝑐𝑠𝑚𝑜𝑚𝑚 𝑜𝑜 𝑚𝑙𝑚𝑙𝑠𝑙𝑐𝑔 𝑚𝑠𝑚𝑐𝑠𝑚𝑐𝑠
= %𝑦𝑙𝑠𝑚𝑐
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Outliers inevitable
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
sCale vs yield
2001-2013 US applications, Suzuki couplings
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Temperatures
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Identify Synthetic Routes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5
0
100000
200000
300000
400000
500000
600000
700000
Occ
urre
nces
Number of steps
Intermediates
Terminal Products
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
ringComplexityScore = log10(nRingBridgeAtoms + 1) + log10(nSpiroAtoms + 1) stereoComplexityScore = log10(nStereoCenters + 1) macrocyclePenalty = log10(nMacrocycles + 1) sizePenalty = natoms1.005 − natoms
∑ Ertl & Schuffenhauer 2009 doi:10.1186/1758-2946-1-8
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Number of steps vs Complexity
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8 9 10 11
Num
ber o
f pro
duct
hea
vy a
tom
s
Number of steps
Average number of heavy atoms
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Yield vs Change in Complexity
2001-2013 US applications, all reactions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Suzu
ki c
oupl
ings
as a
per
cent
age
of re
actio
ns in
a y
ear
Classification performed using NameRXN
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Perc
enta
ge o
f rea
ctio
ns in
that
yea
r
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Are solvents getting greener? 1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Conclusions
• A significant amount of novel chemistry, from EPO patents, comes from the non-English patents
• Compounds disclosed by both the USPTO and EPO are on average published earlier by the USPTO but for many compounds an EPO patent will be the earlier disclosure
• Gene/protein identification can identify clear changes in patenting behaviour over time
• Text mining provides the tools to answer many reaction informatics questions
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Acknowledgements
• Funding provided by:
248th ACS National Meeting, San Francisco CA, USA 10th August 2014
Thank you for your time!
http://nextmovesoftware.com http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com