A Discriminative Alignment Model for Abbreviation Recognition
description
Transcript of A Discriminative Alignment Model for Abbreviation Recognition
LOGO
Naoaki Okazaki 1, Sophia Ananiadou 2, 3
Jun’ichi Tsujii 1, 2, 3
1 The University of Tokyo2 The University of Manchester3 The National Centre for Text Mining
Abbreviation recognition (AR)
To extract abbreviations and their expanded forms appearing in actual text
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 2
The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:
14986424
Abbreviation recognition (AR)
To extract abbreviations and their expanded forms appearing in actual text
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 3
The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID:
14986424Abbreviations (short forms)Expanded forms (long forms; full forms)
Definitions
Term variation
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context information of expanded forms)
Local definitions
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 4
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries) What can ‘CT’ stand for?
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 5
- Acromine http://www.nactem.ac.uk/software/acromine/
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context information of expanded forms)
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 6
... evaluated using preoperative computed tomography (CT) scan, ...
... by oral administration with the adjuvant cholera toxin (CT), ...Sentences (contexts) in which CT is defined
Biopsies from bone metastatic lesions were performed under CT scan, ...
Training
Classifier
CT =computed tomography
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context information of expanded forms)
Local definitions
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 7
Mice can be sensitized to food proteins by oral administration with the adjuvant cholera toxin (CT), ...BALB/c mice were fed with CT or PBS. The impact of CT on DC subsets ...
(one-sense-per-discourse assumption)
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context information of expanded forms)
Local definitionsAR plays a key role in managing
abbreviations in text
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 8
Outline of this presentation
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 9
Introduction (done)
MethodologiesAbbreviation candidate
Region for definitions
Previous work
Unsolved
Problems
Abbreviation alignment
Computing features
Maximum entropy modeling
Experiments
Conclusion
Common
This study
Step 0: Sample text
The task Extract an abbreviation definition from this text We do not extract one if no abbreviation definition is found
in the text
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 10
We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...
TTF-1: thyroid transcription factor 1
Step 1: Abbreviation candidates in parentheses
• Parenthetical expressions as clues for abbreviations
• Requirements for abbreviation candidates (Schwartz and Hearst, 03):
– the inner expression consists of two words at most– the length is between two to ten characters– the expression contains at least an alphabetic letter– the first character is alphanumeric
• Abbreviation candidate: y = ‘TTF-1’
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 11
We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...
Step 2: Region for extracting abbreviation definitions
Heuristics for regions for finding expanded forms (Schwartz and Hearst, 03) min(m + 5, 2m) words before the abbreviation, where m is the
number of alphanumeric letters in the abbreviation
Take 8 words before the parentheses (m = 4)The remaining task is to extract a true
expanded form (if any) in this region
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 12
We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...
Previous studies: Finding expanded forms
Rule-based Deterministic algorithm (Schwartz & Hearst,
03) Maps all alpha-numerical letters in the abbreviation
to the expanded form, starting from the end of both the abbreviations and expanded forms right to left.
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 13
We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ...
TTF-1: transcription factor 1
Previous studies: Finding expanded forms
Rule-based Deterministic algorithm (Schwartz & Hearst,
03) Four scoring rules for multiple candidates
(Adar, 04) +1: for every abbreviation letter from the head of a
word -1: for every extra word between the definition and
parentheses +1: for definitions followed by the parentheses
immediately -1: for every extra word
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 14
transcription factor 1 (TTF-1)thyroid transcription factor 1 (TTF-1)
+4+5
Previous studies: Finding expanded forms
Rule-based Deterministic algorithm (Schwartz & Hearst,
03) Four scoring rules for multiple candidates
(Adar, 04) Detailed rules (Ao & Takagi, 05)
Two pages long (!) in their paper
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 15
Previous studies: Finding expanded forms
Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) Detailed rules (Ao & Takagi, 05)
Machine-learning based (Nadeau & Turney 05; Chang & 06) Aimed at obtaining an optimal set of rules through
training Uses 10-20 features that roughly correspond to the
rules proposed by the former work "# of abbreviation letters matching the first letter of a
word“ “# of abbreviation letters that are capitalized in the
definition”
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 16
Problems in previous studies
– Difficult in tweaking the extraction rules by hand
• of blood lactate accumulation (OBLA) or• onset of blood lactate accumulation (OBLA)
– Difficult in handling non-definitions (negatives)• of postoperative AF in patients submitted to CABG without
cardiopulmonary bypass (off-pump)
– Difficult in recognizing shuffled abbreviations• receptor of estrogen (ER)
– No breakthrough was reported by applying ML• Previous studies used a few features that are reproductions
from the rule-based methods
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 17
This study
Predict origins of abbreviation letters (alignment)
Discriminative training of the abbreviation alignment model A large amount of features that directly express
the events where letters in an expanded form produce or do not produce abbreviation letters
A corpus with abbreviation alignment annotated
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 18
),|(argmaxˆ),(
yxaayxa
PC
abbreviation candidate
alignment
surrounding expression
given by steps 1 and 2
Maximum Entropy Modelingpossible
alignments
Step 3: C(x, y): Alignment candidates
x の中で略語に含まれる文字をマークする y の各文字に対して, x 中の文字をひとつだけ割り当てる まったく文字を割り当てないアライメントを必ず含める 略語の文字(列)は d 回だけ並び替えてもよい
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 19
Associate every alpha-numeric letter in the abbreviation with a letter
Mark positions of letters that also appear in the abbreviations
Distortion = 1(swap ‘thyroid’ and ‘transcription’)
Always include a negative alignment (in case that no definition is appropriate)
Step 4: Abbreviation alignment
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 20
Letters ‘t’ and ‘f’ at these points did not produce abbreviation
Prefer non-null mappings that is the most adjacent to t
Atomic feature functions
• Atomic functions for x– letter type: x_ctype– letter position: x_position– lower-cased letter: x_char– lower-cased word: x_word– part-of-speech code: x_pos
• Atomic functions for y – letter type: y_ctype– letter position: y_position– lower-cased letter: y_char
• Atomic function for a– a_state: SKIP, MATCH, ABBR
• Atomic functions for adjacent x
– distance in letters: x_diff– distance in words: x_diff_wd
• Atomic function for adjacent y– distance in letters : y_diff
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 21
◦ Offset parameter δ◦ Features are expressed by
combinations of atomic functions
◦ Refer to Table 2 for the complete list of combination rules
Step 5: Computing features
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 22
y_ctype0=U/MATCH/ 0.7y_position0=H/MATCH/ 0.9y_ctype0=U;y_position0=H/MATCH/ 1.6y_char0=t/MATCH/ -0.1…
x_ctype0=L/y_ctype0=U/MATCH/ 0.5x_position0=H/y_position0=H/MATCH/ 0.4x_ctype0=L/y_position0=H/MATCH/ 0.3x_position0=H/y_ctype0=U/MATCH/ 1.1…
x_ctype0=L/MATCH/ 0.1x_position0=H/MATCH/ 1.9x_ctype0=L;x_position0=H/MATCH/ 1.2x_ctype1=L/MATCH/ 0.3x_position1=I/MATCH/ 0.2
MATCH/MATCH/ 1.4x_diff_wd=1/MATCH/MATCH/ 3.3 y_diff=1/MATCH/MATCH/ 0.8…
9.1
5.5
Unigram features
Bigram features
Step 6: Probabilities of alignments
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 23
9.1 5.5
2.3 2.3 -3.6
8.5
2.7 2.7 2.9
3.71.4 1.9 2.1 0.8 2.3 2.1
6.2 6.24.9
5.9
5.7
5.3
7.7
6.9
The sum of feature weights on the alignment
95.5 83.2 37.5
0.99 4.55e-6 6.47e-26
Take exponents of these values and normalize as probabilities
Maximum entropy modeling
Conditional probability modeled by MaxEnt
Parameter estimation (training) Maximize the log-likelihood of the probability
distribution by applying maximum a posteriori (MAP) estimation
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 24
),(
),,(exp),(
),,(exp),(
1),|(
yxa
yxaΛyx
yxaΛyx
yxa
C
Z
ZP
F
F Sum of feature weights on the alignment aVector of the number of occurrences of features on the alignment aVector of feature weights
Possible alignments
22
2
2
1
)()()(2
1
1
1
)()()(1
2),|(log
),|(log
Λ
Λ
N
n
nnn
N
n
nnn
PL
PL
yxa
yxa L1 regularization; solved by OW-LQN method (Andrew, 07)
L2 regularization; solved byL-BFGS method (Nocedal, 80)
Log-likelihood
Training corpus
• Corpus for training/evaluation– Selected 1,000 abstracts randomly from
MEDLINE– Annotated 1,420 parenthetical expressions
manually, and obtained 864 positive instances (aligned abbreviation definitions) and 556 negative instances (non-definitions)
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 25
! Measurement of hepatitis C virus (HCV) RNA maybe beneficial in managing the treatment of … 1 2 3 (HCV)
! The mean reduction of UCCA at month 48 was 5.7% for patients initially on placebo who received treatment at 24 months (RRMS) or …
(RRMS)
Experiments
• Experimental settings (parameters)– L1-regularization or L2-regularization (σ = 3)– No distortion (d = 0) or distortion (d = 1)– Average number of alignment candidates per instance:
– 8.46 (d = 0) and 69.1 (d = 1)– Total number of features generated (d=0): 850,009
– Baseline systems– Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05)– Chang & Schutze (06), Nadeau & Turney (05)
– Test data– Our abbreviation alignment corpus (10-fold cross
validation)– Medstract corpus
– Our method is trained on our corpus, and tested on this
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 26
Performance on our corpus
• The proposed method achieved the best F1 score of all
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 27
System P R F1
Schwartz & Hearst (2003) .978 .940 .959
SaRAD (Adar, 04) .891 .919 .905
ALICE (Ao, 05) .961 .920 .940
Chang & Schutze (2006) .942 .900 .921
Nadeau & Turney (2005) .954 .871 .910
Proposed (d = 0; L1 regularization)
.973 .969 .971
Proposed (d = 0; L2 regularization)
.964 .968 .966
Proposed (d = 1; L1 regularization)
.960 .981 .971
Proposed (d = 1; L2 regularization)
.957 .976 .967
(simple)rule-based(complex)
Machine learning
Proposed
No distortion
Distortion
• The inclusion of distorted abbreviations (d=1) gained the higest recall and F1
• Baseline systems with refined heuristics (SaRAD and ALICE) could not outperform the simlest system (S&H)
• The previous approaches with machine learning (C&S and N&T) were roughly comparable to rule-based methods
• L1 regularization performed better than L2 probably because the number of features are far larger than that of instances
Performance on Medstract corpus
The proposed method was trained on our corpus, and applied to the Medstract corpus
Still outperformed the baseline systems
ALICE delivered much better results than S&H The rules in ALICE might be tuned for this corpus
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 28
System P R F1
Schwartz & Hearst (2003) .942 .891 .916
SaRAD (Adar, 04) .909 .859 .884
ALICE (Ao, 05) .960 .945 .953
Chang & Schutze (2006) .858 .852 .855
Nadeau & Turney (2005) .889 .875 .882
Proposed (d = 1; L1 regularization)
.976 .945 .960
(simple)rule-based(complex)
Machine learning
Proposed
Alignment examples (1)
Shuffled abbreviations were successfully recognized
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 29
Alignment examples (2)
There are some confusing cases The proposed method failed to choose the third
alignment
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 30
Top seven features with high weights
#1: “Associate a head letter in a definition with an uppercase head letter of the abbreviation”
#2: “Produce two abbreviation letters from two consecutive letters in the definition”
#3: “Do not produce an abbreviation letter from a lowercase letter whose preceding letter is also lowercase.”
#4: “Produce two abbreviation letters from two lowercase letters in the same word”
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 31
Rank
Feature Weight
1 U: x_position0=H;y_ctype0=U;y_position0=H/M 1.7370
2 B: y_position0=I/y_position0=I/x_diff=1/M-M 1.3470
3 U: x_ctype-1=L;x_ctype0=L/S 0.96342
4 B: x_ctype0=L/x_ctype0=L/x_diff_wd=0/M-M 0.94009
5 U: x_position0=I;x_char1=‘t’/S 0.91645
6 U: x_position0=H;x_pos0=NN;y_ctype0=U/M 0.86786
7 U: x_ctype-1=S;x_ctype0=L/M 0.86474
Conclusion
• Abbreviation recognition successfully formalized as a sequential alignment problem
– Showed remarkable improvements over previous methods– Obtained fine-grained features that express the events
wherein an expanded form produces an abbreviation letter
• Future work– To handle different patters (e.g., ”aka”, “abbreviated as”)– To combine with the statistical approach (Okazaki, 06)
• Construct a comprehensible abbreviation dictionary based on the n-best solutions and statistics of occurrences
– To train the alignment model from a non-aligned corpus, inducing abbreviation alignments simultaneously
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 32
More and more abbreviations produced
2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 33
SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions from the whole of the MEDLINE database released in 2006