Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél...

55
Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital, Medical Informatics Department, Germany Jena University, Language & Information Engineering (JULIE), Germany
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél...

Page 1: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Automatic Lexicon Acquisition for a Medical Cross-Language

Information Retrieval System

Kornél Markó, Stefan Schulz, Udo Hahn

Freiburg University Hospital, Medical Informatics Department, GermanyJena University, Language & Information Engineering (JULIE), Germany

Page 2: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Multilingual Textretrieval

Page 3: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Search Engine

„Correlation of high blood pressure and lesion of the white

substance“

Multilingual Textretrieval

Page 4: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Search Engine

„Correlation of high blood pressure and lesion of the white

substance“

Multilingual Textretrieval

Page 5: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Search Engine

„Correlation of high blood pressure and lesion of the white

substance“

Multilingual Textretrieval

Page 6: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Linguistic Phenomena

• Morphological processes:– Inflection: leukocyte <> leukozytes,

appendix <> appendices– Derivation: leukocyte <> leukocytic– Composition: leuk|em|ia, para|sympath|ectomy,

Magen|schleim|haut|entzünd|ung

• Synonymy:– ascorbic acid <> vitamin C, hemorrhage <> bleeding

• Spelling variants:– oesophagus <> esophagus, – Karzinom <> Carcinom <> Carzinom (carcinoma)

Page 7: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Subword Approach

• Subwords are atomic, conceptual or linguistic units:– Stems: stomach, gastr, diaphys– Prefixes: anti-, bi-, hyper- – Suffixes: -ary, -ion, -itis– Infixes: -o-, -s-

• Equivalence classes contain synonymous subwords and their translations in a thesaurus:

#female = { woman, women, female, frau, weib, mulher }

Page 8: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus

• Subword-Lexicon:– Organizes subwords in several

languages (English, German, Portuguese)

• Subword-Thesaurus: – Groups synonymous subwords

(within and between languages)

• Subword-Segmenter:– Extraction of Subwords and

Assignment of Equivalence Classes

Morphosaurus- Identifier (MID)

Morphosaurus

Page 9: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Example

high tsh value s suggest the diagnos is of primar y hypo thyroid ism

er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

SegmenterSubword Lexicon

High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose ...

Orthografic Rules

Orthographic Normalization

#up tsh #value #suggest #diagnost #primar #small #thyre

Interlingua

#up tsh #value #permit #diagnost #primar #small #thyre Subword

Thesaurus

Semantic Normalization

Page 10: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Example

high tsh value s suggest the diagnos is of primar y hypo thyroid ism

er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

SegmenterSubword Lexicon

High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose ...

Orthografic Rules

Orthographic Normalization

#up tsh #value #suggest #diagnost #primar #small #thyre

Interlingua

#up tsh #value #permit #diagnost #primar #small #thyre

SubwordThesaurus

Semantic Normalization

Page 11: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

Page 12: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

Page 13: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Page 14: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

„#correl #hyper #tens #lesion #whit

#matter“

Page 15: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

„#correl #hyper #tens #lesion #whit

#matter“

Search Engine

Page 16: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

„Korrelation von Hypertonie und

Läsion der Weißen Substanz“

Search Engine

„#correl #hyper #tens #lesion #whit

#matter“

Page 17: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Automatic Language Acquisition

• Automatic Acquisition of Spanish and Swedish subword lexicons

• Step 1: Generation of cognate seed lexicons:– Automatic generation of cognate subword candidates

• Spanish cognates from Portuguese

• Swedish cognates from English and German

– Selection of subword candidates

– Semantic mapping (linkage to equivalence classes)

– Validation of semantic mappings

• Step 2: Use cognate lexicons as a seed for iteratively learning non-cognates

Page 18: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Step 1: Cognate Acquisition

• Resources for cognate acquisition for Spanish (from Portuguese) and

Swedish (from English and German):

– Portuguese (~14,000 stems), German and English (~22,000 stems each) subword lexicons

– Medical corpora for Portuguese, German, English, Spanish and Swedish acquired from the Web

– Word frequency lists generated from these corpora– Manually created list of Spanish and Swedish affixes

Page 19: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Generating Cognate Candidates

...estomagmulher...

List of Portuguese Subwords

(14,004 stems):

Page 20: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Generating Cognate Candidates

...estomagmulher...

List of Portuguese Subwords

(14,004 stems):

Rule

(Port. » Span.)

Portuguese example

Spanish example

English equivalent

qua » cua quadr cuadr frame

eia » ena veia vena vein

ss » s fracass fracas fail

lh » j mulher mujer woman

l » ll lev llev take

i » y ensai ensay trial

f » h formig hormig ant

... ... ... ...

Application of 44 string

substitution rules

Page 21: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Generating Cognate Candidates

...mulher...

List of Portuguese Subwords

(14,004 stems):

Application of 44 string

substitution rules

mulher muller mujer mulhier mulliermujier

Page 22: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Selecting Cognate Candidates

...mulher...

List of Portuguese Subwords

(14,004 stems):

mulher muller mujer mulhier mulliermujier

...mulher 10/mmuller 23/mmujer 50/m...

...mulher 45/n...

Word frequency lists derived from unrelated corpora:

Portuguese (size = n) Spanish (size = m)

m ~ n

Comparison between word frequency lists:

– Elimination of non-matching subwords

– Choose that cognate alternative with the most similar corpus

frequency

Page 23: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Selecting Cognate Candidates

...mulher...

List of Portuguese Subwords

(14,004 stems):

mulher

muller mujer mulhier mulliermujier

...mulher 10/mmuller 23/m

mujer 50/m...

...mulher 45/n...

Word frequency lists derived from unrelated corpora:

Portuguese (size = n) Spanish (size = m)

m ~ n

Comparison between word frequency lists:

– Elimination of non-matching subwords

– Choose that cognate alternative with the most similar corpus

frequency

Page 24: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Semantic Mapping

mulher mujer

#female = { woman, women, female, frau, weib, mulher, mujer }

Page 25: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Semantic Mapping

mulher mujer

#female = { woman, women, female, frau, weib, mulher, mujer }

Language Pair Source Lexicon Selected Cognates

Portuguese-

Spanish14,004 8,644

German-Swedish

English-Swedish

21,705

21,5016,086

Page 26: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

• Use parallel corpora to identify false friends:- Portuguese crianc (child) Spanish crianz (breed)- Portuguese crianc (child) Spanish nin (child)- Portuguese criac (breed) Spanish crianz (breed)

• UMLS Metathesaurus - Contains over 2M medical terms and phrases, aligned in various

languages- English has the broadest coverage

• English-Spanish: 60,526 alignments• English-Swedish: 10,953 alignmens

• English-Spanish Examples- „Cell Growth“ „Crecimiento Celular“- „Heart transplantation, with or without recipient cardiectomy“

„Transplante cardiaco, con o sin cardiectomia en el receptor.“

Page 27: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

• Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system

• Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid

• Candidates that never matched this procedure are discarded

Page 28: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

• Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system

• Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid

• Candidates that never matched this procedure are discarded

Page 29: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall (port. pared)|abdomin|al| #abdom (port. abdomin)

• Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system

• Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid

• Candidates that never matched this procedure are discarded

Page 30: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall (port. pared)|abdomin|al| #abdom (port. abdomin)

• Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system

• Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid

• Candidates that never matched this procedure are discarded

Page 31: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall (port. pared)|abdomin|al| #abdom (port. abdomin)

• Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system

• Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid

• Candidates that never matched this procedure are discarded

Page 32: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Cognate Validation

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall (port. pared)|abdomin|al| #abdom (port. abdomin)

Language Pair Hypotheses Valid

English-Spanish 8,644 3,230 (37%)

English-Swedish 6,086 1,565 (26%)

Page 33: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Step 2: Bootstrapping

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall (port. pared)|abdomin|al| #abdom (port. abdomin)

• Bootstrapping dictionaries using • validated cognate seed lexicons and • parallel corpora • for acquiring non-cognates

Page 34: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

For every alignment in the UMLS do

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall|abdomin|al| #abdom

Page 35: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall|abdomin|al| #abdom

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

Page 36: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall|abdomin|al| #abdom

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Page 37: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia||pared| #wall|abdomin|al| #abdom

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Page 38: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia| |cirug|ia||pared| #wall|abdomin|al| #abdom

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Page 39: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:

|abdomin|al| #abdom |wall| #wall|proced|ure| #operat

„Cirugia de la pared abdominal“:

|cirug|ia| |cirug|ia||pared| #wall|abdomin|al| #abdom

#operat = { proced, surgery, operat, prozess, operier, proced, process,

metod, cirug }

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Page 40: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“: „Cirugia de la pared abdominal“:

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Repeat all until quiescence

Page 41: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:„Skin operations“:

|skin| #derma|operat|ions| #operat

„Cirugia de la pared abdominal“:„Cirugia de piel“:

|cirug|ia| #operat|piel|

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Repeat all until quiescence

Page 42: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:„Skin operations“:

|skin| #derma|operat|ions| #operat

„Cirugia de la pared abdominal“:„Cirugia de piel“:

|cirug|ia| #operat|piel| |piel|

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Repeat all until quiescence

#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel }

Page 43: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:„Skin operations“: „Skin abnormalities“:

|skin| #derma|abnorm|alities| #anomal

„Cirugia de la pared abdominal“:„Cirugia de piel“:„Malformacion de la piel“:

|malformation||piel| #derma

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Repeat all until quiescence

Page 44: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Algorithm

„Abdominal wall procedure“:„Skin operations“: „Skin abnormalities“:

|skin| #derma|abnorm|alities| #anomal

„Cirugia de la pared abdominal“:„Cirugia de piel“:„Malformacion de la piel“:

|malformation| |malform|ation||piel| #derma

For every alignment in the UMLS do

If there is exactly one invalid segmentation in target language

If there is exactly one more MID in source language

Take supernumerary MID and invalid segmentation from target

Restore invalid segmentation and strip off potential affixes

Add new stem into target lexicon. Link it to source MID.

Repeat all until quiescence

#anormal = { abnorm, anomal, abnorm, anomal, abnorm, anomal, malform }

Page 45: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Bootstrapping Results

Dictionary Growth Steps

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 3 4 5

Step

Nu

mb

er

of

en

trie

s

English-Spanish (n=60,526)

English-Swedish (n=10,953)

Total: 7,154 Spanish and 4,148 Swedish entries acquired

Page 46: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Evaluation

• Process the English-Spanish and English-Swedish UMLS alignments with the Morphosaurus system

• Additionally process Spanish-Swedish UMLS alignments

• Measures:– Coverage: At least one MID co-occurs on both sides– Consistency

• A: Number of MIDs co-occuring on both sides• N,M: Number of MIDs occuring on only one side

– Identical Indexes

)(

)*100()(

MNA

AC iAU

Page 47: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Results

0

10

20

30

40

50

60

70

80

90

100

English-German(n=34,296)

English-Spanish(n=60,526)

English-Swedish(n=10,953)

Spanish-Swedish(n=8,993)

Per

cent

Coverage

Consistency

Identical

Page 48: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Conclusion

• Cross-Language Document Retrieval based on the matching of search/document terms on a language-independent, interlingual layer.

• Significant amount of English, German, and Portuguese subwords can be mapped to Spanish and Swedish cognates using simple string substitution rules.

• These cognate seed lexicons are further enlarged by subword translations which are not cognates by bootstrapping and using parallel corpora.

• Methodology proved to be useful in a standardized CLIR experimental setting

• Generality of the approach:– Need of large, aligned corpora– Eurodicautum (12 languages, 5M entries), Eurovoc (13 languages), OECD,

UNESCO, AGROVOC, etc.

Page 49: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

www.morphosaurus.net

Page 50: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Morphosaurus Search

Page 51: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Evaluation

• OHSUMED-Corpus (Hersh et al., 1994)– Subset of MEDLINE– ~233,000 English documents – 106 English user queries, additionally translated to German,

Portuguese, Spanish and Swedish by medical experts – query-document pairs have been manually judged for

relevance

• Search Engine: Lucene

– http://lucene.apache.org/

Page 52: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Evaluation

• Baseline: monolingual text retrieval– (stemmed) English user queries– (stemmed) English texts

• Query translation (QTR)– Google translator– Multilingual dictionary compiled from UMLS

• Morphosaurus Indexing (MSI)– Interlingual representation of both user

queries and documents

Page 53: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

Portuguese MSI

Portuguese QTR

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

German MSI

German QTR

Evaluation Results

German (n = 22,385) Portuguese (n = 14,862)Top 200

95% of Baseline

60% of Baseline

78% of Baseline

52% of Baseline

Page 54: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

Swedish MSI

Swedish QTR

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

Spanish MSI

Spanish QTR

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

Portuguese MSI

Portuguese QTR

0

0,1

0,2

0,3

0,4

0,5

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Prec

isio

n

Baseline

German MSI

German QTR

Evaluation Results

German (n = 22,385) Portuguese (n = 14,862)Top 200

Spanish (n = 7,154) Swedish (n = 4,148)Top 200

95% of Baseline

60% of Baseline

78% of Baseline

52% of Baseline

69% of Baseline

40% of Baseline

27% of Baseline

2% of Baseline

Page 55: Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Semantic Mapping

mulher mujer

#female = { woman, women, female, frau, weib, mulher, mujer }

Language Pair Source Lexicon Selected Cognates

Linked MIDs

Portuguese-

Spanish14,004 8,644 6,036

German-Swedish

English-Swedish

21,705

21,501

4,249

4,140

3,308

3,208

Combined Swedish Evidence

(set union)6,086 4,157