In grammars we trust: LeadMine, a knowledge driven solution

23
BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8 th October 2013 In grammars we trust: LeadMine, a knowledge driven solution Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK

description

We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these re-sources to identify chemical entities without an explicit tokenization step. To al-low recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of adjacent entities. Recall is enhanced by the use of abbreviation detection and precision is en-hanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore our system achieved 86.2% precision and 85.0% recall on an unused development set.

Transcript of In grammars we trust: LeadMine, a knowledge driven solution

Page 1: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

In grammars we trust: LeadMine, a knowledge driven solution

Daniel Lowe and Roger Sayle

NextMove Software

Cambridge, UK

Page 2: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Approaches to Entity recognition

• Dictionary based

• Grammar based

• Machine Learning

LeadMine LeadMine

Page 3: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Optional

Page 4: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Normalization

Input Normalized

œstradiol oestradiol

5` or 5’ or 5′ (backtick/quotation mark/prime) 5'

<p>H<sub>2</sub>O</p> H2O

Page 5: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries

Page 6: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Advantages of grammars

• Don’t require annotated corpora

• Encode knowledge about the domain

• Very fast recognition

• Allow spelling correction if an entity is a near match to one recognized by the grammar

Page 7: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Simple grammar Example

Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’

Digit : Digit1to9 | ‘0’

Cid : ‘CID:’ Digit1to9 Digit*

C I D 1..9 : 0..9

Page 8: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Grammar for IUPAC names

• Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...

– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...

• Generally aims to match a superset of the nomenclature covered by IUPAC

• Specifically this is the superset that can be theoretically be converted to structures

Page 9: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Grammar inheritance

• Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar

– Inherit rules rather than duplicate them

– Allow overriding of rules

pluralizedChemical : chemical 's'

elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition

metal'|'transuranic element' | _elementaryMetalAtom

Page 10: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Dictionaries… bigger is better

• For high recall of trivial names, dictionaries with high coverage are required.

• The largest publically available dictionary is PubChem with over 94 million terms

• However most of these terms are either not useful or actually detrimental to text mining

Page 11: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Aggressive filtering

• “what you don't see won't hurt you”

• Hence remove terms are also English words or start with an English word

– Accomplished using a large English dictionary with chemistry terms removed

• Remove internal identifiers used by depositors

• Remove terms that are matched by our grammars

• Ultimate result: 94 million 2.94 million

Page 12: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Structure Aware filtering

• “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.”

• About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria

Page 13: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Entity Extension

• Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits

– α-santalol can be recognized from santalol in the dictionary

• Extension is bracketing aware and blocked by English words

• Entity trimming also performed to comply with the annotation guidelines

– ‘Allura Red AC dye’ ‘Allura Red AC’

Page 14: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Entity Merging

• Adjacent entities may actually be part of one entity

– Ethyl ester one entity

– (+)-limonene epoxide one entity

BUT

– Hexane-benzene two entities

Page 15: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Using an ontology to determine when terms add information

• Genistein isoflavone two entities

• Glycine ester one entity

Genistein showing isoflavone core structure

Page 16: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Abbreviation detection

• Based on the Hearst and Schwartz algorithm

• Detects abbreviations of the following forms:

– Tetrahydrofuran (THF)

– THF (tetrahydrofuran)

– Tetrahydrofuran (THF;

– Tetrahydrofuran (THF,

– (tetrahydrofuran, THF)

– THF = tetrahydrofuran

Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.

Page 17: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Domain-specific abbreviations

• Some abbreviations are not acronyms

• Can use string replacements to recognize them e.g.

– Sodium Na

– Estradiol E2

Hence can recognize: 17α-ethinylestradiol EE2

Page 18: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Non-entity abbreviation removal

• Finds entities detected as abbreviations of unrecognized entities

– Can mean a common chemical abbreviation has been redefined in the scope of the document

current good manufacturing practice (cGMP)

cGMP = Cyclic guanosine monophosphate =

Page 19: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Making the most of the knowledge provided

• Use training data to identify:

– Terms that are not currently recognized (whitelist)

– Terms that are often false positives (blacklist)

• Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision and recall)

Page 20: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

CEM Task Results (on development set)

Configuration Precision Recall F-score

Baseline 0.87 0.82 0.84

WhiteList 0.86 0.85 0.86

BlackList 0.88 0.80 0.84

WhiteList + BlackList

0.87 0.83 0.85

Page 21: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

CDI task ranking

• Uses precision of entities when running against the development set with the results broken down by:

– Title vs abstract?

– Which dictionary matched?

– Was the entity’s bounds modified?

– Did the entity occur more than once in the document?

Page 22: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Conclusions

• Grammars complement dictionaries to allow recognition of novel entities

• Both the coverage and quality of dictionaries is important

• The meaning of novel abbreviations can be determined algorithmically

• Entities can be classified based on the resource that recognized them

Page 23: In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

Thank you for your time!

http://nextmovesoftware.com

http://nextmovesoftware.com/blog

[email protected]