From Open text mining solutions to Open Data resources

28
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14 th July 2014 From Open text mining solutions to Open Data resources Daniel Lowe NextMove Software Cambridge, UK

description

OPSIN (Open Parser for Systematic IUPAC nomenclature) has developed into a mature solution for chemical name to structure conversion. Together with other Open Source utilities such as OSCAR4, ChemSpot, and ChemicalTagger, we now have the tools to address many of the problems in chemical text mining. This ecosystem of tools has facilitated the extraction of over a million reactions, from the US patent literature, which are now available freely to all under CC-Zero. I will describe advances in OPSIN, how reactions can be extracted from text, and present some interesting analyses that are made possible by the public availability of this dataset.

Transcript of From Open text mining solutions to Open Data resources

Page 1: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

From Open text mining solutions to Open Data resources

Daniel Lowe

NextMove Software

Cambridge, UK

Page 2: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

The idea

Accessible text e.g. US patents

Open Reaction Data resource

Reaction Extraction

System

Page 3: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Building on existing projects

Page 4: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

ol

What is chemical name to structure?

(2S)- but 2- Amino 1- -

Stereochemistry locant substituent locant alk unsaturation suffix

an

NH2•

Page 5: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Supported chain nomenclature

Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides

dodectetractkiliane pentaphosphane disilazane

Trivial acids

butyric acid

Page 6: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Supported ring nomenclature Monocyclic spiro

dispiro[4.2.4.2]tetradecane

Hantzsch-Widman

1,3,5-triazine

furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl

Fused ring Ring assembly

Von Baeyer

tricyclo[2.2.1.12,5]octane

Polycyclic spiro

spiro[piperidine-4,9'-xanthene]

Page 7: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Structural assembly nomenclature

Conjunctive nomenclature

benzeneethanol

Substitutive nomenclature

2,4,6-trinitrotoluene

Additive nomenclature

methylsulfonyl

Multiplicative nomenclature

4,4'-methylenedioxydibenzoic acid

Functional class nomenclature

ethyl alcohol

Page 8: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Structural modifications

Heteroatom replacement

1-thia-4-aza-2,6-disilacyclohexane

Unsaturation

hexa-1,3-dien-5-yne

Hydro, dehydro, indicated hydrogen and added hydrogen

2,7-dihydro-1H-azepine

Functional replacement Suffixes including

infixed suffixes

methanedithioic acid 1-chloro-2,4-

diimidotricarbonic acid

Lambda convention

2λ6-trisulfane

Page 9: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Bridges and stereochemistry Bridges

4a,8a-propanoquinoline

E/Z stereochemistry

(Z)-2-chloro-but-2-ene

Relative cis/trans stereochemistry

trans-2,6-dimethyl-2,6-dihydronaphthalene

R/S stereochemistry

(1R,3S)-3-amino-3-methylcyclohexanol

Page 10: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Miscellaneous nomenclature

1,3-xylene

Groups with indeterminately positioned structural features

Charge and oxidation numbers

methylmercury(1+) or methylmercury(II)

“per-nomenclature”

2-deoxy-ᴅ-ribose

Subtractive nomenclature

perhydroanthracene

perchlorobenzene

Page 11: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Polymer nomenclature

poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]

Structure-based polymer nomenclature

Page 12: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Domain specific nomenclature

Steroid nomenclature

17β-Hydroxy-8α,9β,10α-androst-4-en-3-one

ʟ-leucinamide

Amino acid

cyclo(ᴅ-alanyl-ʟ-phenylalanyl) ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline

Oligopeptide Cyclic peptide

guanylyl(3'-5')uridine 3'-monophosphate

Nucleotide nomenclature

Page 13: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Carbohydrates

ʟ-ribo-ᴅ-manno-nonose

2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid

β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside

β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-(1→3)-ᴅ-glucopyranose

Page 14: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Usage

Batch conversion on the command line

RESTful web service (opsin.ch.cam.ac.uk)

NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName);

Java API

java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi

Page 15: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Who is using OPSIN?

Commercial software

Cinfony (interface to

Python)

Many text mining efforts

Workflows Web services

Page 16: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Steps involved

• Identifying experimental sections

• Identifying chemical entities

• Chemical name to structure conversion (including anaphora resolution)

• Associating chemical entities with quantities

• Assigning chemical roles

• Atom-atom mapping

Page 17: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Example

Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).

Page 18: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Graphical Output

Page 19: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..

<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..

<productList>

<product role="product">

<molecule id="m0">

<name dictRef="nameDict:unknown">title compound</name>

</molecule>

<amount units="unit:mmol">1.8</amount>

<amount units="unit:mg">690</amount>

<amount units="unit:percentYield">85.0</amount>

<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>

<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..

<dl:entityType>definiteReference</dl:entityType>

<dl:state>solid</dl:state>

</product>

</productList>

<reactantList>

<reactant role="reactant" count="1">

<molecule id="m1">

<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>

</molecule>

<amount units="unit:mmol">2.1</amount>

<amount units="unit:mg">606</amount>

<amount units="unit:eq">1.0</amount>

<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>

Quantities including yield are extracted

Entity is classified as an exact compound, definite reference, chemical class or fragment

Reaction SMILES

SMILES and InChIs for every structure resolvable reagent/product

Page 20: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Current status

• ~1 million reactions from US patent applications (2001-2013)

• ~1 million reactions from US patent grants (1976-2013)

• At minimum over a million constitutionally distinct reactions

Page 22: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Identify Synthetic Routes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Intermediates 197702 103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2

Terminal Products 385149 149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5

0

100000

200000

300000

400000

500000

600000

700000

Occ

urr

en

ces

Number of steps

Intermediates

Terminal Products

Page 23: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Trends in Reaction Types

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

19

76

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Suzu

ki c

ou

plin

gs a

s a

per

cen

tage

of

reac

tio

ns

in a

yea

r

Page 24: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Trends In Solvent Use

0.0%

5.0%

10.0%

15.0%

20.0%

19

76

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

Pe

rce

nta

ge o

f re

acti

on

s in

th

at y

ear

Tetrahydrofuran

Dichloromethane

Water

Dimethylformamide

Methanol

Ethyl acetate

Ethanol

1,4-Dioxane

Toluene

Acetonitrile

Acetic acid

Chloroform

Acetone

Benzene

Page 25: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Are solvents getting greener?

1976 2013

Water (21%) Tetrahydrofuran (15%)

Ethanol (11%) Dichloromethane (14%)

Benzene (8%) Water (13%)

Methanol (7%) Dimethylformamide (10%)

Tetrahydrofuran (5%) Methanol (8%)

Dichloromethane (4%) Ethyl acetate (7%)

Dimethylformamide (4%) Ethanol (5%)

Acetic acid (4%) 1,4-Dioxane (4%)

Chloroform (3%) Toluene (3%)

Acetone (3%) Acetonitrile (3%)

Total for top 10: 71% 82%

Page 26: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Conclusions

Open Source tools facilitate reuse and remixing of code

Open Data allows reuse in an infinite number of potential applications and analyses

Page 27: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Acknowledgements

• Albina Asadulina

• Peter Corbett

• Robert Glen

• David Jessop

• Lezan Hawizy

• Peter Murray-Rust

• Roger Sayle

Page 28: From Open text mining solutions to Open Data resources

Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014

Thank you for your time!

http://nextmovesoftware.com

http://nextmovesoftware.com/blog

[email protected]