By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es),...

30
By: By: Chris Lu Chris Lu Guy Divita Guy Divita Allen Browne Allen Browne Date: 12.13.2004 Date: 12.13.2004 Remove Parenthesis Plural Forms Remove Parenthesis Plural Forms of (s), (es), and (ies) of (s), (es), and (ies)

Transcript of By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es),...

Page 1: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

By:By:

Chris LuChris Lu

Guy DivitaGuy Divita

Allen BrowneAllen Browne

Date: 12.13.2004Date: 12.13.2004

Remove Parenthesis Plural Forms Remove Parenthesis Plural Forms of (s), (es), and (ies)of (s), (es), and (ies)

Page 2: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

• BackgroundBackground• ProblemsProblems• ObjectiveObjective• MethodsMethods• ResultsResults• Future workFuture work

Table of Content

Page 3: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Norm: Norm: • is the most common used program in Lvgis the most common used program in Lvg• is used to create the normalized string and word is used to create the normalized string and word

indexes to UMLS Metathesaurusindexes to UMLS Metathesaurus• is used to access those indexes in UMLS Metathesaurusis used to access those indexes in UMLS Metathesaurus• includes 10 lvg flows (2004)includes 10 lvg flows (2004)

Background

Page 4: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Norm:Norm:

1.1. Remove genitivesRemove genitives

2.2. Replace punctuations with spaceReplace punctuations with space

3.3. Remove stop wordsRemove stop words

4.4. Strip diacriticStrip diacritic

5.5. Split ligaturesSplit ligatures

6.6. LowercaseLowercase

7.7. Uninflect each wordsUninflect each words

8.8. Retrieve citation Retrieve citation

9.9. Word sortWord sort

10.10. Retrieve Unicode symbolRetrieve Unicode symbol

Background – Cont.

Page 5: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Plural forms with parenthesisPlural forms with parenthesis• (s):(s):

Accessory finger(s)Accessory finger(s) Addiction, drug(s)Addiction, drug(s) Burn of wrist(s) and hand(s)Burn of wrist(s) and hand(s)

• (es):(es):• Abdomen CT Adrenal Mass(es) BilateralAbdomen CT Adrenal Mass(es) Bilateral• Provide picture of fetus(es), as appropriateProvide picture of fetus(es), as appropriate• sequelae of; injury, nerve, roots and plexus(es), spinalsequelae of; injury, nerve, roots and plexus(es), spinal

• (ies):(ies):• Donor pneumonectomy(ies) with preparation and Donor pneumonectomy(ies) with preparation and maintenance pf allograft (cadaver)maintenance pf allograft (cadaver)• Orthotic(s) fitting and training, upper extremity(ies), Orthotic(s) fitting and training, upper extremity(ies), lower lower extremity(ies), and/or trunk, each 15 minutesextremity(ies), and/or trunk, each 15 minutes

Background – Cont.

Page 6: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

• No flow in lvg to handle this issueNo flow in lvg to handle this issue• Can we just simply remove (s), (es), (ies) ?Can we just simply remove (s), (es), (ies) ?

to get the uninflected formto get the uninflected form without change the wordwithout change the word

• (es), (ies): no problem(es), (ies): no problem• (s): ?(s): ?

Problems

Page 7: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

How about:How about:• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine • 9(s)-erythromycylamine 9(s)-erythromycylamine • anatoxin-b(s) anatoxin-b(s) • Ap(s)pCHClpp(s)A Ap(s)pCHClpp(s)A • Bacillus phage rho11(s) Bacillus phage rho11(s) • Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe • EAV G(s) glycoprotein EAV G(s) glycoprotein • G(s), alpha Subunit G(s), alpha Subunit • Histone H1(s) Histone H1(s) • J(s)(b) ANTIBODY J(s)(b) ANTIBODY • N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer • natoxin-a(s) natoxin-a(s) • Salmonella II 6,7:(g),m,(s),t:1,5 Salmonella II 6,7:(g),m,(s),t:1,5 • (s)-(+)-citreofuran (s)-(+)-citreofuran • su(s) protein, Drosophila su(s) protein, Drosophila • XLalpha(s) proteinXLalpha(s) protein• [X]O spontn disrptn/lig(s)knee [X]O spontn disrptn/lig(s)knee • O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee

Challenge

Page 8: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

• Not to remove (s) in chemical, Protein, Gene, mathematics, etc. Not to remove (s) in chemical, Protein, Gene, mathematics, etc. • Sometimes, (s) should be replaced by a space instead of removalSometimes, (s) should be replaced by a space instead of removal

Challenge – Cont.

Page 9: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

• Remove parenthesis plural forms of (s), (es), (ies)Remove parenthesis plural forms of (s), (es), (ies)• Do not remove (s) in chemical, protein, gene, etc..Do not remove (s) in chemical, protein, gene, etc..• Replace (s) with a space appropriatelyReplace (s) with a space appropriately• Fast performance Fast performance • High precisionHigh precision

Objective

Page 10: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

• UMLS Metathesaurus: 2.8 M termsUMLS Metathesaurus: 2.8 M terms• Lexicon: 0.8 M inflected termsLexicon: 0.8 M inflected terms• Total: 3.6 M termsTotal: 3.6 M terms• Terms with (s), (es), (ies) patterns: ~ 2800Terms with (s), (es), (ies) patterns: ~ 2800

Scope

Page 11: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Methods - Pattern ObservationMethods - Pattern Observation

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Page 12: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Pattern Observation – (1)Pattern Observation – (1)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Page 13: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Sample TermSample Term Word SizeWord Size DistanceDistance

9(s)-erythromycylamine9(s)-erythromycylamine 11 11

Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A 22 11

EAV G(s) glycoproteinEAV G(s) glycoprotein 11 11

G(s), alpha SubunitG(s), alpha Subunit 11 11

Histone H1(s)Histone H1(s) 22 11

J(s)(b) ANTIBODYJ(s)(b) ANTIBODY 11 11

N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomerN(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer 00 11

(s)-(+)-citreofuran(s)-(+)-citreofuran 00 11

su(s) protein, Drosophilasu(s) protein, Drosophila 22 11

• The size of the word in front of (s) must be less than/equal to 2

Pattern Observation – (1)Pattern Observation – (1)

Page 14: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Pattern Observation – (2)Pattern Observation – (2)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Page 15: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Sample TermSample Term CharacterCharacter DistanceDistance

9(s)-erythromycylamine9(s)-erythromycylamine Arabic number 9Arabic number 9 11

Bacillus phage rho11(s)Bacillus phage rho11(s) Arabic number 1Arabic number 1 11

Histone H1(s)Histone H1(s) Arabic number 1Arabic number 1 11

• The character in front of (s) is an Arabic number

Pattern Observation – (2)Pattern Observation – (2)

Page 16: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Pattern Observation – (3)Pattern Observation – (3)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Page 17: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Sample TermSample Term CharacterCharacter DistanceDistance

1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine Punctuation -Punctuation - 11

anatoxin-b(s)anatoxin-b(s) Punctuation -Punctuation - 22

Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMeCbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Punctuation (Punctuation ( 11

natoxin-a(s)natoxin-a(s) Punctuation -Punctuation - 22

Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 Punctuation ,Punctuation , 11

• Punctuation is in front of (s) within distance 1 or 2

Pattern Observation – (3)Pattern Observation – (3)

Page 18: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Pattern Observation – (4)Pattern Observation – (4)

• XLalpha(s) protein

• su(s) protein, Drosophila

• (s)-(+)-citreofuran

• Salmonella II 6,7:(g),m,(s),t:1,5

• natoxin-a(s)

• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

• J(s)(b) ANTIBODY

• Histone H1(s)

• G(s), alpha Subunit

• EAV G(s) glycoprotein

• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe

• Bacillus phage rho11(s)

• Ap(s)pCHClpp(s)A

• anatoxin-b(s)

• 9(s)-erythromycylamine

• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine

Page 19: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Sample TermSample Term PatternPattern DistanceDistance

Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pppp 11

XLalpha(s) proteinXLalpha(s) protein alphaalpha 11

• The word in front of (s) ends with: pp alpha

Pattern Observation – (4)Pattern Observation – (4)

Page 20: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Pattern Observation – (5)Pattern Observation – (5)

Sample TermSample Term PatternPattern DistanceDistance

[X]O spontn disrptn/lig(s)knee[X]O spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11

O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11

• (s) followed with an English word• An English word begins with a letter

if (s) followed with a letter, replace (s) with a space

• Exceptions: Ap(s)pCHClpp(s)A G(s)alpha

Page 21: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Wild CardsImplementation – Wild Cards

Wild Card Definition:• ^: start, starting mark of the term• $: end, ending mark of the term right before (s) • C: any character• D: any digit, [0-9] • L any letter, [a-z] • P: punctuation: [- ( ,] • S: space: [ ]

Page 22: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Rule RepresentationsImplementation – Rule Representations

PatternPattern Sample TermSample Term RuleRule

11 (s)-(+)-citreofuran(s)-(+)-citreofuran ^$^$

11 J(s)(b) ANTIBODYJ(s)(b) ANTIBODY ^C$^C$

11 EAV G(s) glycoproteinEAV G(s) glycoprotein SC$SC$

11 su(s) protein, Drosophilasu(s) protein, Drosophila ^CC$^CC$

11 Histone H1(s)Histone H1(s) SCC$SCC$

22 9(s)-erythromycylamine9(s)-erythromycylamine D$D$

33 Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 P$P$

33 natoxin-a(s)natoxin-a(s) PC$PC$

44 Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pp$pp$

44 XLalpha(s) proteinXLalpha(s) protein alpha$alpha$

.... …… ……

Page 23: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

RuleRule

^$^$

^C$^C$

SC$SC$

^CC$^CC$

SCC$SCC$

D$D$

P$P$

PC$PC$

pp$pp$

alpha$alpha$

……

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Page 24: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxin-bExample: anatoxin-b(s)(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Page 25: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxin-Example: anatoxin-b(s)b(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Page 26: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree

• Example: anatoxinExample: anatoxin-b(s)-b(s)

D ^

^S

C S ^

b

t

g

m

l

h

a

p

a

m

a

e

Etc.

p

p

C

P

$

Page 27: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Implementation – Algorithm FlowImplementation – Algorithm Flow

Find (s), (es), and (ies)

if (s)

Remove (es) and (ies) Go through the reversed trie

if patternmatch

End

Start

If followingcharacter a letter

Remove (s) Repalce (s)with a space

No

No

No Yes

Yes

Yes

Page 28: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

ResultsResults

• Remove (s) properly• Remove (es) properly• Remove (ies) properly• Replace (s) with space properly

• A fast, precise, and expandable system

Page 29: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Future WorkFuture Work

• More testing cases, update more rules• Implement this feature to both Norm and LuiNorm• Apply to (ing), (ed), (en)

Page 30: By: Chris Lu Guy Divita Allen Browne Date: 12.13.2004 Remove Parenthesis Plural Forms of (s), (es), and (ies)

Thank you !Thank you !

[email protected]• http://umlslex.nlm.nih.gov/lvg/2005