In grammars we trust: LeadMine, a knowledge driven solution

BioCreative IV workshop, DoubleTree by Hilton Hotel, Washington DC, USA 8th October 2013

In grammars we trust: LeadMine, a knowledge driven solution

Daniel Lowe and Roger Sayle

NextMove Software

Cambridge, UK


Approaches to Entity recognition

• Dictionary based

• Grammar based

• Machine Learning

LeadMine LeadMine


Optional


Normalization

Input Normalized

œstradiol oestradiol

5` or 5’ or 5′ (backtick/quotation mark/prime) 5'

<p>H<sub>2</sub>O</p> H2O


Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries


Advantages of grammars

• Don’t require annotated corpora

• Encode knowledge about the domain

• Very fast recognition

• Allow spelling correction if an entity is a near match to one recognized by the grammar


Simple grammar Example

Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’

Digit : Digit1to9 | ‘0’

Cid : ‘CID:’ Digit1to9 Digit*

C I D 1..9 : 0..9


Grammar for IUPAC names

• Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...

– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...

• Generally aims to match a superset of the nomenclature covered by IUPAC

• Specifically this is the superset that can be theoretically be converted to structures


Grammar inheritance

• Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar

– Inherit rules rather than duplicate them

– Allow overriding of rules

pluralizedChemical : chemical 's'

elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition

metal'|'transuranic element' | _elementaryMetalAtom


Dictionaries… bigger is better

• For high recall of trivial names, dictionaries with high coverage are required.

• The largest publically available dictionary is PubChem with over 94 million terms

• However most of these terms are either not useful or actually detrimental to text mining


Aggressive filtering

• “what you don't see won't hurt you”

• Hence remove terms are also English words or start with an English word

– Accomplished using a large English dictionary with chemistry terms removed

• Remove internal identifiers used by depositors

• Remove terms that are matched by our grammars

• Ultimate result: 94 million 2.94 million


Structure Aware filtering

• “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.”

• About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria


Entity Extension

• Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits

– α-santalol can be recognized from santalol in the dictionary

• Extension is bracketing aware and blocked by English words

• Entity trimming also performed to comply with the annotation guidelines

– ‘Allura Red AC dye’ ‘Allura Red AC’


Entity Merging

• Adjacent entities may actually be part of one entity

– Ethyl ester one entity

– (+)-limonene epoxide one entity

BUT

– Hexane-benzene two entities


Using an ontology to determine when terms add information

• Genistein isoflavone two entities

• Glycine ester one entity

Genistein showing isoflavone core structure


Abbreviation detection

• Based on the Hearst and Schwartz algorithm

• Detects abbreviations of the following forms:

– Tetrahydrofuran (THF)

– THF (tetrahydrofuran)

– Tetrahydrofuran (THF;

– Tetrahydrofuran (THF,

– (tetrahydrofuran, THF)

– THF = tetrahydrofuran

Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.


Domain-specific abbreviations

• Some abbreviations are not acronyms

• Can use string replacements to recognize them e.g.

– Sodium Na

– Estradiol E2

Hence can recognize: 17α-ethinylestradiol EE2


Non-entity abbreviation removal

• Finds entities detected as abbreviations of unrecognized entities

– Can mean a common chemical abbreviation has been redefined in the scope of the document

current good manufacturing practice (cGMP)

cGMP = Cyclic guanosine monophosphate =


Making the most of the knowledge provided

• Use training data to identify:

– Terms that are not currently recognized (whitelist)

– Terms that are often false positives (blacklist)

• Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision and recall)


CEM Task Results (on development set)

Configuration Precision Recall F-score

Baseline 0.87 0.82 0.84

WhiteList 0.86 0.85 0.86

BlackList 0.88 0.80 0.84

WhiteList + BlackList

0.87 0.83 0.85


CDI task ranking

• Uses precision of entities when running against the development set with the results broken down by:

– Title vs abstract?

– Which dictionary matched?

– Was the entity’s bounds modified?

– Did the entity occur more than once in the document?


Conclusions

• Grammars complement dictionaries to allow recognition of novel entities

• Both the coverage and quality of dictionaries is important

• The meaning of novel abbreviations can be determined algorithmically

• Entities can be classified based on the resource that recognized them


Thank you for your time!

http://nextmovesoftware.com

http://nextmovesoftware.com/blog

[email protected]

In grammars we trust: LeadMine, a knowledge driven solution

Travel

Transcript of In grammars we trust: LeadMine, a knowledge driven solution