A Discriminative Alignment Model for Abbreviation Recognition

LOGO

Naoaki Okazaki 1, Sophia Ananiadou 2, 3

Jun’ichi Tsujii 1, 2, 3

1 The University of Tokyo2 The University of Manchester3 The National Centre for Text Mining

Abbreviation recognition (AR)

To extract abbreviations and their expanded forms appearing in actual text

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 2

The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:

14986424

Abbreviation recognition (AR)

To extract abbreviations and their expanded forms appearing in actual text


The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:

14986424Abbreviations (short forms)Expanded forms (long forms; full forms)

Definitions

Term variation

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries)

Training corpora for disambiguation (context information of expanded forms)

Local definitions



Sense inventories (abbreviation dictionaries) What can ‘CT’ stand for?


- Acromine http://www.nactem.ac.uk/software/acromine/





... evaluated using preoperative computed tomography (CT) scan, ...

... by oral administration with the adjuvant cholera toxin (CT), ...Sentences (contexts) in which CT is defined

Biopsies from bone metastatic lesions were performed under CT scan, ...

Training

Classifier

CT =computed tomography




Local definitions


Mice can be sensitized to food proteins by oral administration with the adjuvant cholera toxin (CT), ...BALB/c mice were fed with CT or PBS. The impact of CT on DC subsets ...

(one-sense-per-discourse assumption)




Local definitionsAR plays a key role in managing

abbreviations in text


Outline of this presentation


Introduction (done)

MethodologiesAbbreviation candidate

Region for definitions

Previous work

Unsolved

Problems

Abbreviation alignment

Computing features

Maximum entropy modeling

Experiments

Conclusion

Common

This study

Step 0: Sample text

The task Extract an abbreviation definition from this text We do not extract one if no abbreviation definition is found

in the text


We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...

TTF-1: thyroid transcription factor 1

Step 1: Abbreviation candidates in parentheses

• Parenthetical expressions as clues for abbreviations

• Requirements for abbreviation candidates (Schwartz and Hearst, 03):

– the inner expression consists of two words at most– the length is between two to ten characters– the expression contains at least an alphabetic letter– the first character is alphanumeric

• Abbreviation candidate: y = ‘TTF-1’



Step 2: Region for extracting abbreviation definitions

Heuristics for regions for finding expanded forms (Schwartz and Hearst, 03) min(m + 5, 2m) words before the abbreviation, where m is the

number of alphanumeric letters in the abbreviation

Take 8 words before the parentheses (m = 4)The remaining task is to extract a true

expanded form (if any) in this region



Previous studies: Finding expanded forms

Rule-based Deterministic algorithm (Schwartz & Hearst,

03) Maps all alpha-numerical letters in the abbreviation

to the expanded form, starting from the end of both the abbreviations and expanded forms right to left.



TTF-1: transcription factor 1



03) Four scoring rules for multiple candidates

(Adar, 04) +1: for every abbreviation letter from the head of a

word -1: for every extra word between the definition and

parentheses +1: for definitions followed by the parentheses

immediately -1: for every extra word


transcription factor 1 (TTF-1)thyroid transcription factor 1 (TTF-1)

+4+5



03) Four scoring rules for multiple candidates

(Adar, 04) Detailed rules (Ao & Takagi, 05)

Two pages long (!) in their paper



Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) Detailed rules (Ao & Takagi, 05)

Machine-learning based (Nadeau & Turney 05; Chang & 06) Aimed at obtaining an optimal set of rules through

training Uses 10-20 features that roughly correspond to the

rules proposed by the former work "# of abbreviation letters matching the first letter of a

word“ “# of abbreviation letters that are capitalized in the

definition”


Problems in previous studies

– Difficult in tweaking the extraction rules by hand

• of blood lactate accumulation (OBLA) or• onset of blood lactate accumulation (OBLA)

– Difficult in handling non-definitions (negatives)• of postoperative AF in patients submitted to CABG without

cardiopulmonary bypass (off-pump)

– Difficult in recognizing shuffled abbreviations• receptor of estrogen (ER)

– No breakthrough was reported by applying ML• Previous studies used a few features that are reproductions

from the rule-based methods


This study

Predict origins of abbreviation letters (alignment)

Discriminative training of the abbreviation alignment model A large amount of features that directly express

the events where letters in an expanded form produce or do not produce abbreviation letters

A corpus with abbreviation alignment annotated


),|(argmaxˆ),(

yxaayxa

PC

abbreviation candidate

alignment

surrounding expression

given by steps 1 and 2

Maximum Entropy Modelingpossible

alignments

Step 3: C(x, y): Alignment candidates

x の中で略語に含まれる文字をマークする y の各文字に対して， x 中の文字をひとつだけ割り当てるまったく文字を割り当てないアライメントを必ず含める略語の文字（列）は d 回だけ並び替えてもよい


Associate every alpha-numeric letter in the abbreviation with a letter

Mark positions of letters that also appear in the abbreviations

Distortion = 1(swap ‘thyroid’ and ‘transcription’)

Always include a negative alignment (in case that no definition is appropriate)

Step 4: Abbreviation alignment


Letters ‘t’ and ‘f’ at these points did not produce abbreviation

Prefer non-null mappings that is the most adjacent to t

Atomic feature functions

• Atomic functions for x– letter type: x_ctype– letter position: x_position– lower-cased letter: x_char– lower-cased word: x_word– part-of-speech code: x_pos

• Atomic functions for y – letter type: y_ctype– letter position: y_position– lower-cased letter: y_char

• Atomic function for a– a_state: SKIP, MATCH, ABBR

• Atomic functions for adjacent x

– distance in letters: x_diff– distance in words: x_diff_wd

• Atomic function for adjacent y– distance in letters : y_diff


◦ Offset parameter δ◦ Features are expressed by

combinations of atomic functions

◦ Refer to Table 2 for the complete list of combination rules

Step 5: Computing features


y_ctype0=U/MATCH/ 0.7y_position0=H/MATCH/ 0.9y_ctype0=U;y_position0=H/MATCH/ 1.6y_char0=t/MATCH/ -0.1…

x_ctype0=L/y_ctype0=U/MATCH/ 0.5x_position0=H/y_position0=H/MATCH/ 0.4x_ctype0=L/y_position0=H/MATCH/ 0.3x_position0=H/y_ctype0=U/MATCH/ 1.1…

x_ctype0=L/MATCH/ 0.1x_position0=H/MATCH/ 1.9x_ctype0=L;x_position0=H/MATCH/ 1.2x_ctype1=L/MATCH/ 0.3x_position1=I/MATCH/ 0.2

MATCH/MATCH/ 1.4x_diff_wd=1/MATCH/MATCH/ 3.3 y_diff=1/MATCH/MATCH/ 0.8…

9.1

5.5

Unigram features

Bigram features

Step 6: Probabilities of alignments


9.1 5.5

2.3 2.3 -3.6

8.5

2.7 2.7 2.9

3.71.4 1.9 2.1 0.8 2.3 2.1

6.2 6.24.9

5.9

5.7

5.3

7.7

6.9

The sum of feature weights on the alignment

95.5 83.2 37.5

0.99 4.55e-6 6.47e-26

Take exponents of these values and normalize as probabilities

Maximum entropy modeling

Conditional probability modeled by MaxEnt

Parameter estimation (training) Maximize the log-likelihood of the probability

distribution by applying maximum a posteriori (MAP) estimation


),(

),,(exp),(

),,(exp),(

1),|(

yxa

yxaΛyx

yxaΛyx

yxa

C

Z

ZP

F

F Sum of feature weights on the alignment aVector of the number of occurrences of features on the alignment aVector of feature weights

Possible alignments

22

2

2

1

)()()(2

1

1

1

)()()(1

2),|(log

),|(log

Λ

Λ

N

n

nnn

N

n

nnn

PL

PL

yxa

yxa L1 regularization; solved by OW-LQN method (Andrew, 07)

L2 regularization; solved byL-BFGS method (Nocedal, 80)

Log-likelihood

Training corpus

• Corpus for training/evaluation– Selected 1,000 abstracts randomly from

MEDLINE– Annotated 1,420 parenthetical expressions

manually, and obtained 864 positive instances (aligned abbreviation definitions) and 556 negative instances (non-definitions)


! Measurement of hepatitis C virus (HCV) RNA maybe beneficial in managing the treatment of … 1 2 3 (HCV)

! The mean reduction of UCCA at month 48 was 5.7% for patients initially on placebo who received treatment at 24 months (RRMS) or …

(RRMS)

Experiments

• Experimental settings (parameters)– L1-regularization or L2-regularization (σ = 3)– No distortion (d = 0) or distortion (d = 1)– Average number of alignment candidates per instance:

– 8.46 (d = 0) and 69.1 (d = 1)– Total number of features generated (d=0): 850,009

– Baseline systems– Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05)– Chang & Schutze (06), Nadeau & Turney (05)

– Test data– Our abbreviation alignment corpus (10-fold cross

validation)– Medstract corpus

– Our method is trained on our corpus, and tested on this


Performance on our corpus

• The proposed method achieved the best F1 score of all


System P R F1

Schwartz & Hearst (2003) .978 .940 .959

SaRAD (Adar, 04) .891 .919 .905

ALICE (Ao, 05) .961 .920 .940

Chang & Schutze (2006) .942 .900 .921

Nadeau & Turney (2005) .954 .871 .910

Proposed (d = 0; L1 regularization)

.973 .969 .971


.964 .968 .966


.960 .981 .971


.957 .976 .967

(simple)rule-based(complex)

Machine learning

Proposed

No distortion

Distortion

• The inclusion of distorted abbreviations (d=1) gained the higest recall and F1

• Baseline systems with refined heuristics (SaRAD and ALICE) could not outperform the simlest system (S&H)

• The previous approaches with machine learning (C&S and N&T) were roughly comparable to rule-based methods

• L1 regularization performed better than L2 probably because the number of features are far larger than that of instances

Performance on Medstract corpus

The proposed method was trained on our corpus, and applied to the Medstract corpus

Still outperformed the baseline systems

ALICE delivered much better results than S&H The rules in ALICE might be tuned for this corpus


System P R F1

Schwartz & Hearst (2003) .942 .891 .916

SaRAD (Adar, 04) .909 .859 .884

ALICE (Ao, 05) .960 .945 .953

Chang & Schutze (2006) .858 .852 .855

Nadeau & Turney (2005) .889 .875 .882


.976 .945 .960

(simple)rule-based(complex)

Machine learning

Proposed

Alignment examples (1)

Shuffled abbreviations were successfully recognized


Alignment examples (2)

There are some confusing cases The proposed method failed to choose the third

alignment


Top seven features with high weights

#1: “Associate a head letter in a definition with an uppercase head letter of the abbreviation”

#2: “Produce two abbreviation letters from two consecutive letters in the definition”

#3: “Do not produce an abbreviation letter from a lowercase letter whose preceding letter is also lowercase.”

#4: “Produce two abbreviation letters from two lowercase letters in the same word”


Rank

Feature Weight

1 U: x_position0=H;y_ctype0=U;y_position0=H/M 1.7370

2 B: y_position0=I/y_position0=I/x_diff=1/M-M 1.3470

3 U: x_ctype-1=L;x_ctype0=L/S 0.96342

4 B: x_ctype0=L/x_ctype0=L/x_diff_wd=0/M-M 0.94009

5 U: x_position0=I;x_char1=‘t’/S 0.91645

6 U: x_position0=H;x_pos0=NN;y_ctype0=U/M 0.86786

7 U: x_ctype-1=S;x_ctype0=L/M 0.86474

Conclusion

• Abbreviation recognition successfully formalized as a sequential alignment problem

– Showed remarkable improvements over previous methods– Obtained fine-grained features that express the events

wherein an expanded form produces an abbreviation letter

• Future work– To handle different patters (e.g., ”aka”, “abbreviated as”)– To combine with the statistical approach (Okazaki, 06)

• Construct a comprehensible abbreviation dictionary based on the n-best solutions and statistics of occurrences

– To train the alignment model from a non-aligned corpus, inducing abbreviation alignments simultaneously


More and more abbreviations produced


SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions from the whole of the MEDLINE database released in 2006

A Discriminative Alignment Model for Abbreviation Recognition

Documents

Transcript of A Discriminative Alignment Model for Abbreviation Recognition