A Discriminative Alignment Model for Abbreviation Recognition

33
LOGO Naoaki Okazaki 1 , Sophia Ananiadou 2, 3 Jun’ichi Tsujii 1, 2, 3 1 The University of Tokyo 2 The University of Manchester 3 The National Centre for Text Mining

description

A Discriminative Alignment Model for Abbreviation Recognition. Naoaki Okazaki 1 , Sophia Ananiadou 2, 3 Jun’ichi Tsujii 1, 2, 3. 1 The University of Tokyo 2 The University of Manchester 3 The National Centre for Text Mining. Abbreviation recognition (AR). - PowerPoint PPT Presentation

Transcript of A Discriminative Alignment Model for Abbreviation Recognition

Page 1: A Discriminative Alignment Model for Abbreviation Recognition

LOGO

Naoaki Okazaki 1, Sophia Ananiadou 2, 3

Jun’ichi Tsujii 1, 2, 3

1 The University of Tokyo2 The University of Manchester3 The National Centre for Text Mining

Page 2: A Discriminative Alignment Model for Abbreviation Recognition

Abbreviation recognition (AR)

To extract abbreviations and their expanded forms appearing in actual text

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 2

The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:

14986424

Page 3: A Discriminative Alignment Model for Abbreviation Recognition

Abbreviation recognition (AR)

To extract abbreviations and their expanded forms appearing in actual text

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 3

The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:

14986424Abbreviations (short forms)Expanded forms (long forms; full forms)

Definitions

Term variation

Page 4: A Discriminative Alignment Model for Abbreviation Recognition

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries)

Training corpora for disambiguation (context information of expanded forms)

Local definitions

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 4

Page 5: A Discriminative Alignment Model for Abbreviation Recognition

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries) What can ‘CT’ stand for?

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 5

- Acromine http://www.nactem.ac.uk/software/acromine/

Page 6: A Discriminative Alignment Model for Abbreviation Recognition

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries)

Training corpora for disambiguation (context information of expanded forms)

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 6

... evaluated using preoperative computed tomography (CT) scan, ...

... by oral administration with the adjuvant cholera toxin (CT), ...Sentences (contexts) in which CT is defined

Biopsies from bone metastatic lesions were performed under CT scan, ...

Training

Classifier

CT =computed tomography

Page 7: A Discriminative Alignment Model for Abbreviation Recognition

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries)

Training corpora for disambiguation (context information of expanded forms)

Local definitions

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 7

Mice can be sensitized to food proteins by oral administration with the adjuvant cholera toxin (CT), ...BALB/c mice were fed with CT or PBS. The impact of CT on DC subsets ...

(one-sense-per-discourse assumption)

Page 8: A Discriminative Alignment Model for Abbreviation Recognition

AR for disambiguating abbreviations

Sense inventories (abbreviation dictionaries)

Training corpora for disambiguation (context information of expanded forms)

Local definitionsAR plays a key role in managing

abbreviations in text

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 8

Page 9: A Discriminative Alignment Model for Abbreviation Recognition

Outline of this presentation

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 9

Introduction (done)

MethodologiesAbbreviation candidate

Region for definitions

Previous work

Unsolved

Problems

Abbreviation alignment

Computing features

Maximum entropy modeling

Experiments

Conclusion

Common

This study

Page 10: A Discriminative Alignment Model for Abbreviation Recognition

Step 0: Sample text

The task Extract an abbreviation definition from this text We do not extract one if no abbreviation definition is found

in the text

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 10

We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...

TTF-1: thyroid transcription factor 1

Page 11: A Discriminative Alignment Model for Abbreviation Recognition

Step 1: Abbreviation candidates in parentheses

• Parenthetical expressions as clues for abbreviations

• Requirements for abbreviation candidates (Schwartz and Hearst, 03):

– the inner expression consists of two words at most– the length is between two to ten characters– the expression contains at least an alphabetic letter– the first character is alphanumeric

• Abbreviation candidate: y = ‘TTF-1’

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 11

We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...

Page 12: A Discriminative Alignment Model for Abbreviation Recognition

Step 2: Region for extracting abbreviation definitions

Heuristics for regions for finding expanded forms (Schwartz and Hearst, 03) min(m + 5, 2m) words before the abbreviation, where m is the

number of alphanumeric letters in the abbreviation

Take 8 words before the parentheses (m = 4)The remaining task is to extract a true

expanded form (if any) in this region

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 12

We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...

Page 13: A Discriminative Alignment Model for Abbreviation Recognition

Previous studies: Finding expanded forms

Rule-based Deterministic algorithm (Schwartz & Hearst,

03) Maps all alpha-numerical letters in the abbreviation

to the expanded form, starting from the end of both the abbreviations and expanded forms right to left.

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 13

We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...

TTF-1: transcription factor 1

Page 14: A Discriminative Alignment Model for Abbreviation Recognition

Previous studies: Finding expanded forms

Rule-based Deterministic algorithm (Schwartz & Hearst,

03) Four scoring rules for multiple candidates

(Adar, 04) +1: for every abbreviation letter from the head of a

word -1: for every extra word between the definition and

parentheses +1: for definitions followed by the parentheses

immediately -1: for every extra word

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 14

transcription factor 1 (TTF-1)thyroid transcription factor 1 (TTF-1)

+4+5

Page 15: A Discriminative Alignment Model for Abbreviation Recognition

Previous studies: Finding expanded forms

Rule-based Deterministic algorithm (Schwartz & Hearst,

03) Four scoring rules for multiple candidates

(Adar, 04) Detailed rules (Ao & Takagi, 05)

Two pages long (!) in their paper

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 15

Page 16: A Discriminative Alignment Model for Abbreviation Recognition

Previous studies: Finding expanded forms

Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) Detailed rules (Ao & Takagi, 05)

Machine-learning based (Nadeau & Turney 05; Chang & 06) Aimed at obtaining an optimal set of rules through

training Uses 10-20 features that roughly correspond to the

rules proposed by the former work "# of abbreviation letters matching the first letter of a

word“ “# of abbreviation letters that are capitalized in the

definition”

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 16

Page 17: A Discriminative Alignment Model for Abbreviation Recognition

Problems in previous studies

– Difficult in tweaking the extraction rules by hand

• of blood lactate accumulation (OBLA) or• onset of blood lactate accumulation (OBLA)

– Difficult in handling non-definitions (negatives)• of postoperative AF in patients submitted to CABG without

cardiopulmonary bypass (off-pump)

– Difficult in recognizing shuffled abbreviations• receptor of estrogen (ER)

– No breakthrough was reported by applying ML• Previous studies used a few features that are reproductions

from the rule-based methods

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 17

Page 18: A Discriminative Alignment Model for Abbreviation Recognition

This study

Predict origins of abbreviation letters (alignment)

Discriminative training of the abbreviation alignment model A large amount of features that directly express

the events where letters in an expanded form produce or do not produce abbreviation letters

A corpus with abbreviation alignment annotated

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 18

),|(argmaxˆ),(

yxaayxa

PC

abbreviation candidate

alignment

surrounding expression

given by steps 1 and 2

Maximum Entropy Modelingpossible

alignments

Page 19: A Discriminative Alignment Model for Abbreviation Recognition

Step 3: C(x, y): Alignment candidates

x の中で略語に含まれる文字をマークする y の各文字に対して, x 中の文字をひとつだけ割り当てる まったく文字を割り当てないアライメントを必ず含める 略語の文字(列)は d 回だけ並び替えてもよい

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 19

Associate every alpha-numeric letter in the abbreviation with a letter

Mark positions of letters that also appear in the abbreviations

Distortion = 1(swap ‘thyroid’ and ‘transcription’)

Always include a negative alignment (in case that no definition is appropriate)

Page 20: A Discriminative Alignment Model for Abbreviation Recognition

Step 4: Abbreviation alignment

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 20

Letters ‘t’ and ‘f’ at these points did not produce abbreviation

Prefer non-null mappings that is the most adjacent to t

Page 21: A Discriminative Alignment Model for Abbreviation Recognition

Atomic feature functions

• Atomic functions for x– letter type: x_ctype– letter position: x_position– lower-cased letter: x_char– lower-cased word: x_word– part-of-speech code: x_pos

• Atomic functions for y – letter type: y_ctype– letter position: y_position– lower-cased letter: y_char

• Atomic function for a– a_state: SKIP, MATCH, ABBR

• Atomic functions for adjacent x

– distance in letters: x_diff– distance in words: x_diff_wd

• Atomic function for adjacent y– distance in letters : y_diff

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 21

◦ Offset parameter δ◦ Features are expressed by

combinations of atomic functions

◦ Refer to Table 2 for the complete list of combination rules

Page 22: A Discriminative Alignment Model for Abbreviation Recognition

Step 5: Computing features

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 22

y_ctype0=U/MATCH/ 0.7y_position0=H/MATCH/ 0.9y_ctype0=U;y_position0=H/MATCH/ 1.6y_char0=t/MATCH/ -0.1…

x_ctype0=L/y_ctype0=U/MATCH/ 0.5x_position0=H/y_position0=H/MATCH/ 0.4x_ctype0=L/y_position0=H/MATCH/ 0.3x_position0=H/y_ctype0=U/MATCH/ 1.1…

x_ctype0=L/MATCH/ 0.1x_position0=H/MATCH/ 1.9x_ctype0=L;x_position0=H/MATCH/ 1.2x_ctype1=L/MATCH/ 0.3x_position1=I/MATCH/ 0.2

MATCH/MATCH/ 1.4x_diff_wd=1/MATCH/MATCH/ 3.3 y_diff=1/MATCH/MATCH/ 0.8…

9.1

5.5

Unigram features

Bigram features

Page 23: A Discriminative Alignment Model for Abbreviation Recognition

Step 6: Probabilities of alignments

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 23

9.1 5.5

2.3 2.3 -3.6

8.5

2.7 2.7 2.9

3.71.4 1.9 2.1 0.8 2.3 2.1

6.2 6.24.9

5.9

5.7

5.3

7.7

6.9

The sum of feature weights on the alignment

95.5 83.2 37.5

0.99 4.55e-6 6.47e-26

Take exponents of these values and normalize as probabilities

Page 24: A Discriminative Alignment Model for Abbreviation Recognition

Maximum entropy modeling

Conditional probability modeled by MaxEnt

Parameter estimation (training) Maximize the log-likelihood of the probability

distribution by applying maximum a posteriori (MAP) estimation

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 24

),(

),,(exp),(

),,(exp),(

1),|(

yxa

yxaΛyx

yxaΛyx

yxa

C

Z

ZP

F

F Sum of feature weights on the alignment aVector of the number of occurrences of features on the alignment aVector of feature weights

Possible alignments

22

2

2

1

)()()(2

1

1

1

)()()(1

2),|(log

),|(log

Λ

Λ

N

n

nnn

N

n

nnn

PL

PL

yxa

yxa L1 regularization; solved by OW-LQN method (Andrew, 07)

L2 regularization; solved byL-BFGS method (Nocedal, 80)

Log-likelihood

Page 25: A Discriminative Alignment Model for Abbreviation Recognition

Training corpus

• Corpus for training/evaluation– Selected 1,000 abstracts randomly from

MEDLINE– Annotated 1,420 parenthetical expressions

manually, and obtained 864 positive instances (aligned abbreviation definitions) and 556 negative instances (non-definitions)

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 25

! Measurement of hepatitis C virus (HCV) RNA maybe beneficial in managing the treatment of … 1 2 3 (HCV)

! The mean reduction of UCCA at month 48 was 5.7% for patients initially on placebo who received treatment at 24 months (RRMS) or …

(RRMS)

Page 26: A Discriminative Alignment Model for Abbreviation Recognition

Experiments

• Experimental settings (parameters)– L1-regularization or L2-regularization (σ = 3)– No distortion (d = 0) or distortion (d = 1)– Average number of alignment candidates per instance:

– 8.46 (d = 0) and 69.1 (d = 1)– Total number of features generated (d=0): 850,009

– Baseline systems– Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05)– Chang & Schutze (06), Nadeau & Turney (05)

– Test data– Our abbreviation alignment corpus (10-fold cross

validation)– Medstract corpus

– Our method is trained on our corpus, and tested on this

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 26

Page 27: A Discriminative Alignment Model for Abbreviation Recognition

Performance on our corpus

• The proposed method achieved the best F1 score of all

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 27

System P R F1

Schwartz & Hearst (2003) .978 .940 .959

SaRAD (Adar, 04) .891 .919 .905

ALICE (Ao, 05) .961 .920 .940

Chang & Schutze (2006) .942 .900 .921

Nadeau & Turney (2005) .954 .871 .910

Proposed (d = 0; L1 regularization)

.973 .969 .971

Proposed (d = 0; L2 regularization)

.964 .968 .966

Proposed (d = 1; L1 regularization)

.960 .981 .971

Proposed (d = 1; L2 regularization)

.957 .976 .967

(simple)rule-based(complex)

Machine learning

Proposed

No distortion

Distortion

• The inclusion of distorted abbreviations (d=1) gained the higest recall and F1

• Baseline systems with refined heuristics (SaRAD and ALICE) could not outperform the simlest system (S&H)

• The previous approaches with machine learning (C&S and N&T) were roughly comparable to rule-based methods

• L1 regularization performed better than L2 probably because the number of features are far larger than that of instances

Page 28: A Discriminative Alignment Model for Abbreviation Recognition

Performance on Medstract corpus

The proposed method was trained on our corpus, and applied to the Medstract corpus

Still outperformed the baseline systems

ALICE delivered much better results than S&H The rules in ALICE might be tuned for this corpus

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 28

System P R F1

Schwartz & Hearst (2003) .942 .891 .916

SaRAD (Adar, 04) .909 .859 .884

ALICE (Ao, 05) .960 .945 .953

Chang & Schutze (2006) .858 .852 .855

Nadeau & Turney (2005) .889 .875 .882

Proposed (d = 1; L1 regularization)

.976 .945 .960

(simple)rule-based(complex)

Machine learning

Proposed

Page 29: A Discriminative Alignment Model for Abbreviation Recognition

Alignment examples (1)

Shuffled abbreviations were successfully recognized

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 29

Page 30: A Discriminative Alignment Model for Abbreviation Recognition

Alignment examples (2)

There are some confusing cases The proposed method failed to choose the third

alignment

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 30

Page 31: A Discriminative Alignment Model for Abbreviation Recognition

Top seven features with high weights

#1: “Associate a head letter in a definition with an uppercase head letter of the abbreviation”

#2: “Produce two abbreviation letters from two consecutive letters in the definition”

#3: “Do not produce an abbreviation letter from a lowercase letter whose preceding letter is also lowercase.”

#4: “Produce two abbreviation letters from two lowercase letters in the same word”

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 31

Rank

Feature Weight

1 U: x_position0=H;y_ctype0=U;y_position0=H/M 1.7370

2 B: y_position0=I/y_position0=I/x_diff=1/M-M 1.3470

3 U: x_ctype-1=L;x_ctype0=L/S 0.96342

4 B: x_ctype0=L/x_ctype0=L/x_diff_wd=0/M-M 0.94009

5 U: x_position0=I;x_char1=‘t’/S 0.91645

6 U: x_position0=H;x_pos0=NN;y_ctype0=U/M 0.86786

7 U: x_ctype-1=S;x_ctype0=L/M 0.86474

Page 32: A Discriminative Alignment Model for Abbreviation Recognition

Conclusion

• Abbreviation recognition successfully formalized as a sequential alignment problem

– Showed remarkable improvements over previous methods– Obtained fine-grained features that express the events

wherein an expanded form produces an abbreviation letter

• Future work– To handle different patters (e.g., ”aka”, “abbreviated as”)– To combine with the statistical approach (Okazaki, 06)

• Construct a comprehensible abbreviation dictionary based on the n-best solutions and statistics of occurrences

– To train the alignment model from a non-aligned corpus, inducing abbreviation alignments simultaneously

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 32

Page 33: A Discriminative Alignment Model for Abbreviation Recognition

More and more abbreviations produced

2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 33

SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions from the whole of the MEDLINE database released in 2006