Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University...
-
Upload
derrick-stokes -
Category
Documents
-
view
214 -
download
0
Transcript of Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University...
Cross-Domain Bootstrapping for
Named Entity Recognition
Ang SunRalph Grishman
New York UniversityJuly 28, 2011
Beijing, EOS, SIGIR 2011 NYU
Outline1. Named Entity Recognition (NER)
2. Domain Adaptation Problem for NER
3. Cross-domain Bootstrapping
3.1Feature Generalization with Word Clusters
3.2Instance Selection Based on Multiple Criteria
4. Conclusion
NYU
1. Named Entity Recognition (NER)
Two missions
U.S. Defense Secretary Donald H. Rumsfeld discussed the resolution …
NYU
Identification
Classification
NAME NAME NAME
GPE ORG PERSON
2. Domain Adaptation Problem for NER NYU NER system performs well on in-domain
data (F-measure 83.08) But performs poorly on out-of-domain data (F-
measure 65.09)
NYU
Source domain(news articles)
George Bush Donald H. Rumsfeld
…Department of
Defense…
Target domain(reports on terrorism)
Abdul Sattar al-RishawiFahad bin Abdul Aziz bin Abdul Rahman Al-
Saud…
Al-Qaeda in Iraq…
2. Domain Adaptation Problem for NER
NYU
1. No annotated data from the target domain
2. Many words are out-of-vocabulary3. Naming conventions are different:
1. Length: short vs longsource: George Bush; Donald H. Rumsfeldtarget: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz
bin Abdul Rahman Al-Saud2. Capitalization: weaker in target
4. Name variation occurs often in targetShaikh, Shaykh, Sheikh, Sheik, …
We want to automatically adapt the source-domain tagger to the target
domainwithout annotating target domain data
3. Cross-domain Bootstrapping
1. Train a tagger from labeled source data2. Tag all unlabeled target data with current tagger3. Select good tagged words and add these to labeled data4. Re-train the tagger
Trained tagger
Unlabeled target
data
Instance Selection
Labeled Source data
President Assad
FeatureGeneralizatio
n
MultipleCriteria
NYU
3.1 Feature Generalization with Word Clusters The source model
Sequential model, assigning name classes to a sequence of tokens
One name type is split into two classes B_PER (beginning of PERSON) I_PER (continuation of PERSON)
Maximum Entropy Markov Model (McCallum et al., 2000)
Customary features
NYU
3. Cross-domain Bootstrapping
U.S. Defense Secretary
Donald H. Rumsfeld
B_GPE B_ORG O B_PER I_PER I_PER
3.1 Feature Generalization with Word Clusters The source/seed model
Customary features Extracted from context window (ti-2, ti-1, ti, ti+1,
ti+2)
NYU
3. Cross-domain Bootstrapping
U.S. Defense Secretary
Donald H. Rumsfeld
B_GPE B_ORG O B_PER I_PER I_PER
currentToken Donald
wordType_currentToken
initial_capitalized
previousToken_-1 Secretary
previousToken_-1_class O
previousToken_-2 Defense
nextToken_+1 H.
… …
3.1 Feature Generalization with Word Clusters• Build a word hierarchy from a 10M word
corpus (Source + Target), using the Brown word clustering algorithm
• Represent each word as a bit string
NYU
Bit string Examples
110100011 John, James, Mike, Steven
11010011101 Abdul, Mustafa, Abi, Abdel
11010011111 Shaikh, Shaykh, Sheikh, Sheik
111111110 Qaeda, Qaida, qaeda, QAEDA
00011110000 FBI, FDA, NYPD
000111100100 Taliban
3.1 Feature Generalization with Word Clusters• Add an additional layer of features that include word
clusters• currentToken = John• currentPrefix3 = 100 fires also for target words
To avoid commitment to a single cluster: cut word hierarchy at different levels
NYU
3.1 Feature Generalization with Word Clusters Performance on the target domain
Test set contains 23K tokens PERSON/ORGANIZATION/GPE
771/585/559 instances All other tokens belong to not-a-name
class 4 points improvement of F-measure
NYU
Model P R F1
Source_Model 70.02 61.86 65.69Source_Model
+ Word Clusters72.82 66.61 69.58
3.2 Instance Selection Based on Multiple Criteria Single-domain bootstrapping uses a confidence
measure as the single selection criterion
In a cross-domain setting, the most confidently labeled instances
are highly correlated with the source domain contain little information about the target
domain.
We propose multiple criteria Criterion 1: Novelty– prefer target-specific instances Promote Abdul instead of John
NYU
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence - prefer confidently labeled
instances
Local confidence: based on local features
NYU
1) minimum: 0. when one name class is predicted with probability 1, e.g., p(ci|v) = 1
2) maximum: when the predictions are evenly distributed over all the name classes.
3) The lower the value, the more confident the instance is.
( ) ( | ) log ( | )i
i ic
LocalConf I p c v p c v
I := instancev := feature vector for I
ci := name class i
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence
Global confidence: based on corpus statistics
NYU
1 Prime Minister Abdul Karim Kabariti PER2 warlord General Abdul Rashid Dostum PER3 President A.P.J. Abdul Kalam will PER4 President A.P.J. Abdul Kalam has PER5 Abdullah bin Abdul Aziz , PER6 at King Abdul Aziz University ORG7 Nawab Mohammed Abdul Ali , PER8 Dr Ali Abdul Aziz Al PER9 Nayef bin Abdul Aziz said PER10 leader General Abdul Rashid Dostum PER
P( Abdul is a PER) = 0.9
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence
Global confidence
Combined confidence: product of local and global confidence
NYU
( ) ( ) log ( )i
i ic
GlobalConf I p c p c
The lower the entropy, the more confident the instance is.
3.2 Instance Selection Based on Multiple Criteria Criterion 3: Density - prefer representative
instances which can be seen as centroid instances
NYU
1
( , )
( )1
N
j j i
Sim i j
Density iN
average similarity between i and all other instances j
Jaccard Similarity between the feature vectors of the two instances
the total number of instances in the corpus
3.2 Instance Selection Based on Multiple Criteria Criterion 4: Diversity - prefer a set of diverse
instances instead of similar instances “, said * in his”
Highly confident instance High density, representative instance BUT, continuing to promote such instance would not gain
additional benefit
NYU
( , ) ( ) ( )diff i j Density i Density j
diff(i, j) := difference between instances i and j Use a small value for diff(i, j) dense instances still have a higher chance to be selected while a certain degree of diversity is achieved at the same time.
3.2 Instance Selection Based on Multiple Criteria Putting all criteria together
1. Novelty: filter out source-dependent instances
2. Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set
3. Density: rank instances in the candidate set in descending order of density
4. Diversity: 1. accepts the first instance (with the highest density) in the
candidate set 2. and selects other candidates based on the diff measure.
NYU
3.2 Instance Selection Based on Multiple Criteria Results
NYU
+ Novelty + CombinedConf + Diversity+ Novelty + CombinedConf + Density + Novelty + CombinedConf + Novelty + LocalConf Generalized seed model (SourceModel + WordCluster)- Novelty + LocalConf +/- := with/without
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
4. Conclusion Proposed a general cross-domain bootstrapping algorithm for
adapting a model trained only on a source domain to a target domain
Improved the source model’s F score by around 7 points
This is achieved 1. without using any annotated data from the target domain 2. without explicitly encoding any target-domain-specific
knowledge into our system
The improvement is largely due to 1. the feature generalization of the source model with word
clusters 2. the multi-criteria-based instance selection method
NYU