Sub-language Processing for phenotype curation

31
Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013 SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION

description

Sub-language Processing for phenotype curation. Hong Cui University of Arizona Phenotype RCN Feb 25-27, 2013. Agenda. CharaParser Methodology Evaluation Applications CharaParser for Phenoscape New modules Evaluations Challenges. “ Fine-Grained S emantic M ark-up”. - PowerPoint PPT Presentation

Transcript of Sub-language Processing for phenotype curation

Page 1: Sub-language Processing for phenotype  curation

Hong CuiUniversity of Arizona

Phenotype RCN Feb 25-27, 2013

SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION

Page 2: Sub-language Processing for phenotype  curation

Agenda• CharaParser• Methodology• Evaluation• Applications• CharaParser for Phenoscape• New modules• Evaluations• Challenges

Page 3: Sub-language Processing for phenotype  curation

“Fine-Grained Semantic Mark-up”• To annotate factual information from textual morphological descriptions of biodiversity in such a detailed manner that the machine readable annotation itself provides information equivalent to the original text.

Page 4: Sub-language Processing for phenotype  curation

An Example

Page 5: Sub-language Processing for phenotype  curation
Page 6: Sub-language Processing for phenotype  curation

Previous Research• Syntactic parsing approach (Taylor, 1995 ; Abascal & Sanchenz, 1999; Vanel, 2004)• Interactive extraction (Diederich, J., Fortuner, R. & Milton, J. 1999).• Semi-supervised bootstrapping for lexicons (Ellen Riloff, 1999) • Supervised regular expression rule learning (Soderland, 1999; Tang & Heidorn 2008)•Ontology driven and parallel text (Woods et. al. 2004)• Supervised association rule learning (Cui & Heidorn, 2007)

Page 7: Sub-language Processing for phenotype  curation
Page 8: Sub-language Processing for phenotype  curation
Page 9: Sub-language Processing for phenotype  curation
Page 10: Sub-language Processing for phenotype  curation

General-Purpose Parsers?

Page 11: Sub-language Processing for phenotype  curation

CharaParser Approach1. Unsupervised machine learning to find anatomy and

character terms from descriptions automatically• No need to prepare training examples• 50% - 80% terms learned

2. General-purpose syntactic parser (e.g., Stanford Parser) to parse syntactic structure of sentences• No need to create special-purpose, domain-dependent

parser• Learned lexicon from 1 is used to adapt the Parser for

biodiversity domains3. Intuitive rules to produce annotations from parse

trees.

Page 12: Sub-language Processing for phenotype  curation

Unsupervised lexicon learning

If it is known “roots” is an organ:

•Roots yellow to medium brown or black, thin.• Petals yellow or white• Petals absent;• Subtending bracts absent;• Abaxial hastula absent;

Page 13: Sub-language Processing for phenotype  curation

CharaParser: Term Reviewer

Page 14: Sub-language Processing for phenotype  curation

Ontology Term Organizer

Page 15: Sub-language Processing for phenotype  curation

Compared against a Heuristics-Based Method

• Parser performance evaluated on the same data sets.• CharaParser: unsupervised learning + Stanford Parser• Heuristics-based: unsupervised learning + regular expression rules

Page 16: Sub-language Processing for phenotype  curation

Annotation Problems• Chunk errors:• Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–

4(–5) mm, pliant;• Attachment errors:• on outer cypselae, crowns of bristlelike scales ca. 0.5 mm;

on inner, of dusky white or pale yellow, plumose bristles 5–6 mm.

• Semantics:• straight posterolateral bounding ridges to subtriangular ,

bilobed ventral muscle field;

Page 17: Sub-language Processing for phenotype  curation

Applications at Various Development Stages• Convert XML markup to • SDD for identification key generation• Character matrices for tree of life• RDF for the Semantic Web and search • Use marked-up descriptions to support search• FNA Experimental Search • Data source is RDF triples• Allow character based search• Plants that give yellow flowers at 200-400 meter elevation in April in North

Carolina

Page 18: Sub-language Processing for phenotype  curation
Page 19: Sub-language Processing for phenotype  curation

To-Dos• Tighter integration of ontologies in annotation process.• Currently internal glossaries are used in place of

ontologies to link a character state (e.g., “red”) to a character (“color”)• Synonyms are not controlled• “Petiolate” = “with petiole”

• Continue to reduce annotation errors• Accommodate various syntactic styles • Diagnosis paragraphs• Comparison among different taxa

Page 20: Sub-language Processing for phenotype  curation

Phenotype Curation• Convert character and character state information from natural language descriptions to EQ statements

Page 21: Sub-language Processing for phenotype  curation

Curator Mental Process

readdescription

Identify key phrases (raw EQ)

ontologized EQ

ontologies

Page 22: Sub-language Processing for phenotype  curation

Adapted CharaParserCharacter Description

State Descriptions

CharaParser

XML to Raw EQs

Raw EQs to Final

EQs

Ontologies

Page 23: Sub-language Processing for phenotype  curation

Evaluations• Internal evaluation: • The development corpus (three publications on fishes and archosaurs)

provided 1,200 character descriptions. 100 of them included in the internal evaluation benchmark.• Raw EQ performance: 90%• Final EQ performance: 50% • BioCreative2012 evaluation:• 50 descriptions independently selected by the organizer (>50% Qs

were not in ontologies)• Gold standard created by chief phenoscape curator (raw and final)• Three biocurators worked in two modes (Phenex vs.

Phenex+CharaParser)• Raw EQ performance: CharaParser better than biocurators• Final EQ perfoamnce: biocuration better than CharaParser • Inter-curator agreements:

Page 24: Sub-language Processing for phenotype  curation

Inter-Curator AgreementsPrecision Recall

Curator 1 vs 2 39 49Curator 1 vs 3 47 56Curator 2 vs 3 77 71

Page 25: Sub-language Processing for phenotype  curation

Error Analyses• Various fixable syntactic problems• E.g., “digits I-III”

• Curation granularity• CharaParser generated more candidate EQs than curators• “Preopercular latero-sensory canal leaves preopercle at first

exit and enters a plate: yes/no”• Annotating relations (relational quality)• “contact between …”

Page 26: Sub-language Processing for phenotype  curation

Ontology Access• Currently use keyword-based search• Class labels and exact, narrower, and related synonyms• False positives • acute(shape) =? acute (process)•  "margin" is a broad synonym of "marginal zone of embryo" in

UBERON• Pre-composed terms in ontology• “ceratobranchial 5 tooth”, “rib of vertebra 5”, “body of

humerus”• Ambiguious term use in descriptions

• ‘epibranchial 1’ => epibranchial 1 element? bone? cartilage?• No matching

Page 27: Sub-language Processing for phenotype  curation

Exploration of Solutions• Experimented with• Word sense disambiguation: • “crinkly” not in PATO• Candidate matches: [undulate->1.00000000000002]

[obovate->1.00000000000001] [flat->1.00000000000001] [flattened->1] [circinate->0.884697579551583]

• Experimenting with• Subsets• Specify included classes: e.g. classes related to vertebrates• Specify excluded classes: e.g. exclude certain developmental stages

• Ideas to try out: • Bootstrapping to narrow down the search space• starting from known classes• evaluating candidate matches based on the distances to the known classes

and other source of evidences.

Page 28: Sub-language Processing for phenotype  curation

Annotation consistency• Instructions given to human curators are helpful to CharaParser • Restricted relation list:• http://phenoscape.org/wiki/Guide_to_Character_Annotation#Relations_used_for_post-compositions

Page 29: Sub-language Processing for phenotype  curation

Feed more info to EQ generation module

Character Description

State DescriptionsCharaPars

er

XML to EQs

Raw EQs to Final EQs

Ontologies

Page 30: Sub-language Processing for phenotype  curation

Recent Improvements• Explorer of Taxon Concepts project• Making it a pure-java program/web-based application• Currently requires MySQL + Perl• Making it faster• Optimization of the program• Removing MySQL and reducing I/O• “Parallel” computing using java threads

• Preliminary evaluation shows • 20 times faster: 2 sec/taxon description• Memory requirements increased by 3 folds

Page 31: Sub-language Processing for phenotype  curation

Acknowledgements• Fine-Grained Semantic Markup Project (current and past)• James Macklin: Agriculture and Agri-Food Canada • Robert (Bob) Morris, Alex Dusenbery: UMass-Boston• Hariharan Gopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan

Krishna Gowda, ParthaPartha Pratim Sanyal, Chunshui Yu: University of Arizona

• Phenoscape Project• Chris Mungall: Laurence Berkeley National Lab• Melissa Haendel : Oregon Health & Science University • Paula Mabee, Alex Dececchi: University of South Dakota• Jim Balhoff, Wasila Dahdul, Hilmar Lapp, Todd Vision: NESCent

• NSF ABI and EF Programs• The Flora of North American Project