Semi-Automated Extension of a Specialized Medical Lexicon for French Bruno Cartoni & Pierre...
-
Upload
morgan-terry -
Category
Documents
-
view
219 -
download
2
Transcript of Semi-Automated Extension of a Specialized Medical Lexicon for French Bruno Cartoni & Pierre...
Semi-Automated Extension of a Specialized Medical Lexicon for French
Bruno Cartoni & Pierre ZweigenbaumLIMSI-CNRS, France
2
Outline
Context : UMLF for French The desired coverage The target lexical information The organisation of a specialised lexicon
Acquiring lexical information Initial coverage Obtaining lexical entries from general lexicon Guessing technique
Results Consensus guessing Acquisition of the full paradigm General improvement
Conclusion and further work
3
Context : the InterSTIS project
InterSTIS: development of Terminology Server for French Medical Terminologies
Sub-Project: Improving the Lexical Coverage of a French medical lexicon (UMLF : Unified Medical Lexicon for French)
Use: support indexation process of medical texts
Issues: What is the desired lexical knowledge ? How to acquire it ?
4
The desired coverage
Reference: “Term-Union” Union of 10 terminologies (CIM-10,
SNOMED, MeSH, CISMeF, …) of French medical domains, organised around concept identifiers (CUI) of the UMLS
311,518 terms 203,300 unique concepts (CUI) 94,964 word-forms
5
Term-Union: example
C0000936 MSHFRE … Accommodation de l'oeiC0000936 MSHFRE … Accommodation des yeuxC0000936 MSHFRE … Accommodation oculaireC0000936 SNMIGIPFRE … accommodation visuelle...C00001558 MSHF … Voie cutanéeC00001558 MSHF … Voie intradermiqueC00001558 MSHF … Voie percutanéeC00001558 MSHF … Voie transcutanée
Observation of term variation
6
Target lexical information
Term variation within Term-Union Graphemic
équilibre acido-basique – équilibre acidobasique [EN: acid-base balance]
Morphosyntactic adaptation de l'oeil - adaptation des yeux
[EN: eye adaptation]
Morphosemantic intoxication à l’alcool - intoxication alcoolique
[EN: alcohol intoxication]
Others ...
7
Organisation of the specialised lexicon
3 types of relational tables for the 3 levels of representation (graphemic, inflection, derivation)
A full-entry lexicon (LMF compliant) that gathers all lexical information
…inter-maxillaire | intermaxillaireinsulino-sécrétantes | insulinosécrétantesscléro-cornéenne | sclérocornéenne …
...abdominal | abdomenaplasique | aplasiearachnoïdien | arachnoïdeargentique | argent…
…sérofibrineux | sérofibrineux | Afpmssérofibrineuse | sérofibrineux | Afpfssérofibrineux | sérofibrineux | Afpmpsérofibrineuses | sérofibrineux | Afpfp…
8
Outline
Context : UMLS for French The desired coverage The target lexical information The organisation of a specialised lexicon
Acquiring lexical information Initial coverage Obtaining lexical entries from general lexicon Guessing technique
Results Consensus guessing Acquisition of the full paradigm General improvement
Conclusion and further work
9
Acquiring the lexical information
Initial coverage of UMLF (previous project, UMLF, based on Baud et al. 1998) 17,192 lexical units
5,353 adjectives 11,799 nouns
36,211 word forms
10
Acquiring the lexical information
From general lexicon Existing French general lexicon
(Morphalou) With a guessing technique
11
Acquiring the lexical information
From guessing technique (Tanguy & Hathout 2007)
3 steps: Learning phase : calculating the most
frequent tag for each ending string in 2 existing lexicons
Guessing phase: assigning possible tag(s)
Cross validation with 2 guessing based on 2 lexicons
12
Acquiring the lexical information
Acquiring the full paradigm All the inflectional forms Lemma
Based on “productive” inflectional paradigms 9 for adjectives 3 for nouns
Algorithm based on lexical tries to cluster forms of the same paradigm
13
Outline
Context : UMLS for French The desired coverage The target lexical information The organisation of a specialised lexicon
Acquiring lexical information Initial coverage Obtaining lexical entries from general lexicon Guessing technique
Results Consensus guessing Acquisition of the full paradigm General improvement
Conclusion and further work
14
Acquisition from general lexicon: results
74,9786,617Morphalou
81,59519,599Initial UMLF
94,964Term-Union
Remaining words to describe
Known words entries
15
Acquisition with guessing techniques: results
74,978 unknown forms 44,515 analyses from Morphalou-based
program 35,438 analyses from UMLF-based
program Cross-validation: 30,137 in common
16
Acquisition with guessing techniques: evaluation
Errors: 82 out of 1000 (8.2 %)
82Total
5Other
10Spelling/segmentation
1English words
5Latin words
49Proper names
12Wrong label
17
Acquisition of the full paradigm: Results
4,453 paradigms captured (incomplete or not, grouping 9352 word forms) 3,308 adjectives 514 nouns
Automatic extension for the full paradigms (with canonical forms only)
Manually checked for the others
18
General improvement
25,7%70,6028,088Acquisition
21,0%74,97817,828Morphalou
14,1%81,59536,211UMLF-v1
CoverageStill unknown in Term-union
Forms added
Source
19
Outline
Context : UMLS for French The desired coverage The target lexical information The organisation of a specialized lexicon
Acquiring lexical information Initial coverage Obtaining lexical entries from general lexicon Guessing technique
Results Consensus guessing Acquisition of the full paradigm General improvement
Conclusion and further work
20
Discussion and conclusion
The acquisition and evaluation of specialised lexical resources require a specific reference Term-Union Extract (full) lexical information Assess lexical needs and target
Other acquisition techniques (CRF for inflectional information, rule-based techniques for derivational information)
21
Acknowledgment
This work was partially funded by project InterSTIS (ANR-07-TECSAN-010)
InterSTIS project: www.interstis.org