Rapid Training of Information Extraction with Local and Global Data Views
-
Upload
tiger-nolan -
Category
Documents
-
view
26 -
download
2
description
Transcript of Rapid Training of Information Extraction with Local and Global Data Views
Rapid Training of Information Extraction with Local and Global Data Views
Dissertation DefenseAng Sun
Computer Science DepartmentNew York University
April 30, 2012
Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian
Outline
I. Introduction
II. Relation Type Extension: Active Learning with Local and Global Data Views
III. Relation Type Extension: Bootstrapping with Local and Global Data Views
IV. Cross-Domain Bootstrapping for Named Entity Recognition
V. Conclusion
Part I
Introduction
Tasks
1. Named Entity Recognition (NER)
2. Relation Extraction (RE)i. Relation Extraction between Namesii. Relation Mention Extraction
NER
Name TypeBill Gates PERSON
Seattle LOCATIONMicrosoft ORGANIZATION
Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current
chairman of Microsoft.
NER
Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current
chairman of Microsoft.
REi. Relation Extraction between Names
NER
Adam, a data analyst for ABC Inc.
ABC Inc.AdamEmployment
RE
REi. Relation Mention Extraction
Entity Extraction
Entity Mention Entity
Adam {Adam, a data analyst}
a data analyst {Adam, a data analyst}
ABC Inc. {ABC Inc.}
Adam, a data analyst for ABC Inc.
REi. Relation Mention Extraction
RE
Adam, a data analyst for ABC Inc.
ABC Inc.a data analyst
Employment
Prior Work – Supervised Learning
• Learn with labeled data
– < Bill Gates, PERSON >
–< <Adam, ABC Inc. >, Employment >
,i ix y
Prior Work – Supervised LearningO. J. Simpson was
arrested and charged with
murdering his ex-wife ,
Nicole Brown Simpson ,
and her friend Ronald
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
murdering his ex-wife ,
Nicole Brown Simpson ,
and her friend Ronald
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
Nicole Brown Simpson ,
and her friend Ronald
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
O O O O
Nicole Brown Simpson ,
and her friend Ronald
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
O O O O
Nicole Brown Simpson ,
P P P O
and her friend Ronald
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
O O O O
Nicole Brown Simpson ,
P P P O
and her friend Ronald
O O O P
Goldman in 1994 .
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
O O O O
Nicole Brown Simpson ,
P P P O
and her friend Ronald
O O O P
Goldman in 1994 .
P O O O
Expensive!
• Expensive• A trained model is typically domain-dependent– Porting it to a new domain usually involves
annotating data from scratch
Prior Work – Supervised Learning
Domains
O. J. Simpson was
P P P O
arrested and charged with
O O O O
murdering his ex-wife ,
O O O O
Nicole Brown Simpson ,
P P P O
and her friend Ronald
O O O P
Goldman in 1994 .
P O O O
Annotation is tedious!
15 minutes1 hour2 hours
Prior Work – Supervised Learning
Prior Work – Semi-supervised Learning
• Learn with both– labeled data– Unlabeled data
• The learning is an iterative process1. Train an initial model with labeled data2. Apply the model to tag unlabeled data3. Select good tagged examples as additional training examples4. Re-train the model5. Repeat from Step2.
,i ix y
ix SmallLarge
,i ix y ix
Prior Work – Semi-supervised Learning
• Problem 1: Semantic DriftExample1:
Learner for PERSON names ends up learning flower names. Because women's first names intersect with names of flowers (Rose,...)
Example 2:Learner for LocatedIn relation patterns ends up learning patterns for other relations (birthPlace, governorOf, …)
Prior Work – Semi-supervised Learning
• Problem 2: Lacks a good stopping criterion
• Most systems – either use a fixed number of iterations – or use a labeled development set to detect the
right stopping point
Prior Work – Unsupervised Learning• Learn with only unlabeled data
• Unsupervised Relation Discovery– Context based clustering– Group pairs of named entities with similar context
to the same relation cluster
ix
Prior Work – Unsupervised Learning
• Unsupervised Relation Discovery (Hasegawa et al., (04))
Prior Work – Unsupervised Learning
• Unsupervised Relation Discovery– The semantics of clusters are usually unknown
– Some clusters are coherent can consistently label them
– Some are mixed, containing different topics difficult to label them
PART IIRelation Type Extension: Active Learning with Local and Global Data Views
Relation Type Extension
• Extend a relation extraction system to new types of relationsACE 2004 Relations
Type ExampleEMP-ORG the CEO of Microsoft
PHYS a military base in GermanyGPE-AFF U.S. businessmanPER-SOC his ailing father
ART US helicoptersOTHER-AFF Cuban-American people
Multi-class Setting:Target relation: one of the ACE relation typesLabeled data: 1) a few labeled examples of the target relation (possibly by random selection). 2) all labeled auxiliary relation examples.Unlabeled data: all other examples in the ACE corpus
Target
Labeled
Relation Type Extension
• Extend a relation extraction system to new types of relationsACE 2004 Relations
Type ExampleEMP-ORG the CEO of Microsoft
PHYS a military base in GermanyGPE-AFF U.S. businessmanPER-SOC his ailing father
ART US helicoptersOTHER-AFF Cuban-American people
Binary Setting:Target relation: one of the ACE relation typesLabeled data: a few labeled examples of the target relation (possibly by random selection). Unlabeled data: all other examples in the ACE corpus
Target
Un-labeled
LGCo-Testing
• LGCo-Testing := co-testing with local and global views• The general idea
1. Train one classifier based on the local view(the sentence that contains the pair of entities)
2. Train another classifier based on the global view(distributional similarities between relation instances)
3. Reduce annotation cost by only requesting labels of contention data points
• The local view
<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.
Words before entity 1 {NIL}Words between {travel, to}Words after entity 2 {for, an}# words between 2Token pattern coupled with entity types PERSON_traveled_to_LOCATION
Token Sequence
Syntactic Parsing
Tree
Path of phrase labels connecting E1 and E2 augmented with the head word of the top phrase
NP--S--traveled--VP--PP
• The local view
<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.
Dependency Parsing
Tree
Shortest path connecting the two entities coupled with entity types
PER_nsubj'_traveled_prep_to_LOC
• The local view
• The local view classifier– Binary Setting: MaxEnt binary classifier– Multi-class Setting: MaxEnt multiclass classifier
• The global view
Corpusof
2,000,000,000 tokens
* * * * * * *(7-grams)
1. Compile corpus to database of 7-grams
2. Represent each relation instance as a relational phrase
3. Compute distributional similarities between phrases in the 7-grams database
4. Build a relation classifier based on the K-nearest neighbor idea
<e1>Clinton</e1> traveled to <e2>the Irish border</e2> for …
The General Idea
traveled to
Relation Instance Relational Phrase
… <e2><e1>his</e1> brother</e2> said that …. his brother
• Compute distributional similarities
<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> for an evening ceremony.
* * * * * traveled to traveled to * * * * ** * * * traveled to * * traveled to * * * ** * * traveled to * * * * traveled to * * *
> * * * traveled to * *3 's headquarters here traveled to the U.S.4 laundering after he traveled to the country3 , before Paracha traveled to the United3 have never before traveled to the United3 had in fact traveled to the United4 two Cuban grandmothers traveled to the United3 officials who recently traveled to the United6 President Lee Teng-hui traveled to the United4 1996 , Clinton traveled to the United4 commission members have traveled to the United4 De Tocqueville originally traveled to the United4 Fernando Henrique Cardoso traveled to the United3 Elian 's grandmothers traveled to the United
• Compute distributional similarities
<e1>Ang</e1> arrived in <e2>Seattle</e2> on Wednesday.
> * * * arrived in * *4 Arafat , who arrived in the other5 of sorts has arrived in the new5 inflation has not arrived in the U.S.3 Juan Miguel Gonzalez arrived in the U.S.6 it almost certainly arrived in the New7 4 to Brazil , arrived in the country4 said Annan had arrived in the country21 he had just arrived in the country5 had not yet arrived in the country3 when they first arrived in the country3 day after he arrived in the country5 children who recently arrived in the country4 Iraq Paul Bremer arrived in the country3 head of counterterrorism arrived in the country3 election monitors have arrived in the country
• Compute distributional similarities– Represent each phrase as a feature vector of
contextual tokens
– Compute cosine similarity between two feature vectors
– Feature weight?
President Clinton traveled to the Irish border
<L2_President, L1_Clinton, R1_the, R2_Irish, R3_Border>
Features for traveled to (sorted by frequency)1-10 11-20 21-30 31-40
R1_the L1_have R4_to R3_in
L2_, R2_and R1_Washington L2_and
R2_to R2_in R1_New L1_He
L1_who L3_. R4_, R1_a
R2_, L1_and R2_on L1_also
L1_, R3_to R4_the R3_a
L1_had L2_who R1_China R2_with
L1_he R2_for L4_, L3_the
L3_, L4_. L2_the L2_when
L1_has R3_the R3_, L1_then
Features for arrived in (sorted by frequency)1-10 11-20 21-30 31-40
R1_the R1_Beijing R2_in R3_a
R2_on L1_had R3_, R4_a
L1_who R2_to R2_for R4_the
L2_, R3_on R3_for L3_,
L1_, L4_. R2_from R1_a
L3_. R3_the R4_for L1_they
R2_, R1_New R3_to R4_to
L1_he R2_. L3_the R1_Moscow
L1_has L2_when R3_capital L5_.
L1_have R4_, L2_the L3_The
FeatureWeight
Use Frequency
?
FeatureWeight
Use tf-idf
tf the number of corpus instances of P
having feature f divided by the number of instances of P
idf the total number of phrases in the corpus divided by the number of
phrases with at least one instance with feature f
FeatureWeight
Use tf-idf
Features for traveled to (sorted by tf-idf)1-10 11-20 21-30 31-40
L1_had L1_He L1_then R1_Beijing
L1_who R1_New L1_she R1_London
L1_he L2_who L1_also R2_for
L1_has R1_China R2_York R2_in
L1_have R2_, R1_Afghanistan L2_when
R2_to L1_recently L1_Zerhouni R1_Baghdad
L1_, R1_Thenia L1_Clinton R1_Mexico
R1_the L1_and L1_they L2_He
R1_Washington R1_Europe L3_Nouredine R4_to
L2_, R1_Cuba R2_and R2_United
Features for arrived in (sorted by tf-idf)1-10 11-20 21-30 31-40
L1_who R1_Baghdad L1_they R1_Seoul
R2_on R1_Moscow R2_Sunday L5_.
R1_Beijing L1_delegation R2_Tuesday R1_Damascus
L1_has R3_capital R1_Washington R2_,
L1_he R1_New R3_Monday R3_Wednesday
L1_have L3_. R2_Wednesday R3_Thursday
L1_had L2_, R3_Sunday R2_from
L1_, L1_He R2_Thursday R1_Amman
R1_Cairo L2_when R3_Tuesday L3_Minister
R1_the R2_Monday R2_York R1_Belgrade
• Compute distributional similarities
traveled to his familyPhrase Sim. Phrase Sim.visited 0.779 his staff 0.792
arrived in 0.763 his brother 0.789
worked in 0.751 his friends 0.780
lived in 0.719 his children 0.769
served in 0.686 their families 0.753
consulted with 0.672 his teammates 0.746
played for 0.670 his wife 0.725
Sample of similar phrases.
• The global view classifiertraveled to his family
Phrase Sim. Phrase Sim.visited 0.779 his staff 0.792
arrived in 0.763 his brother 0.789
worked in 0.751 his friends 0.780
lived in 0.719 his children 0.769
served in 0.686 their families 0.753
consulted with 0.672 his teammates 0.746
played for 0.670 his wife 0.725
k-nearest neighbor classifier:classify an unlabeled example based on closest labeled examples
<e1>President Clinton</e1> traveled to <e2>the Irish border</e2> PHYS-LocatedIn
<e1>Ang Sun</e1> arrived in <e2>Seattle</e2> on Wednesday. ?PHYS-LocatedIn
… <e2><e1>his</e1> brother</e2> said that … PER-SOC
sim(arrived in, traveled to) = 0.763 sim(arrived in, his brother) = 0.012
• LGCo-Testing Procedure in Detail
Use KL-divergence to quantify the disagreement between the two classifiers
KL-divergence: 0 for identical distributions max when distributions are peaked and prefer different class labels Rank instances by descending order of KL-divergence Pick the top 5 instances to request human labels during a single iteration
Active Learning Baselines• RandomAL• UncertaintyAL– Local view classifier– Sample selection:
• UncertaintyAL+– Local view classifier (with phrase cluster features)– Sample selection:
• SPCo-Testing– Co-Testing (sequence view classifier and parsing view classifier)– Sample selection: KL-divergence
( ) ( ) log ( )i ii
h p p c p c
( ) ( ) log ( )i ii
h p p c p c
Annotation speed: 4 instances per
minute 200 instances per
hour (annotator takes a 10-mins break in each hour)
Supervised36K instances
180 HoursLGCo-Testing300 instances
1.5 Hour
PER-SOC
# Labeled Instances
0 100 200 300 400 500 600 700 800 9001000
F1
0
10
20
30
40
50
60
70
80
Supervised
PER-SOC
# Labeled Instances
0 100 200 300 400 500 600 700 800 9001000
F1
0
10
20
30
40
50
60
70
80
LGCo-TestingSupervised
PER-SOC
# Labeled Instances
0 100 200 300 400 500 600 700 800 9001000
F1
0
10
20
30
40
50
60
70
80
LGCo-TestingSPCo-TestingSupervised
PER-SOC
# Labeled Instances
0 100 200 300 400 500 600 700 800 9001000
F1
0
10
20
30
40
50
60
70
80
LGCo-TestingSPCo-TestingUncertaintyALSupervisedRandomALUncertaintyAL+
Results for PER-SOC (Multi-class Setting)
Results for other types of relations have similar trends (in both binary and multiclass settings)
Precision-recall Curve of LGCo-Testing (Multi-class setting)
Recall
15 20 25 30 35 40 45 50 55 60 65 70 75 80
Pre
cis
ion
0
10
20
30
40
50
60
70
80
90
EMP-ORGPER-SOCARTOTHER-AFFGPE-AFFPHYS
Comparing LGCo-Testing with the Two Settings
#Labels
0 200 400 600 800 1000
F1
Dif
fere
nc
e
-50
-40
-30
-20
-10
0
GPE-AFF BinaryGPE-AFF Multi-class OTHER-AFF BinaryOTHER-AFF Multi-class
F1 difference (in percentage) = F1 of active learning
minus F1 of supervised learning
the reduction of annotation cost by incorporating auxiliary types is more pronounced in early learning stages (#labels < 200) than in later ones
Part I
Part IIIRelation Type Extension: Bootstrapping with Local and Global Data Views
Basic Idea
• Consider a bootstrapping procedure to discover semantic patterns for extracting relations between named entities
Basic Idea• It starts from some seed patterns which are used to extract named
entity (NE) pairs , which in turn result in more semantic patterns learned from the corpus.
Basic Idea• Semantic drift occurs because
1) a pair of names may be connected by patterns belonging to multiple relations
2) the bootstrapping procedure is looking at the patterns in isolation
Named Entity 1
Pattern Named Entity 2
Bill Clintonvisit
Arkansasborn infly to governor ofarrive incampaign in… …
Unguided Bootstrapping Guided Bootstrapping
NE Pair Ranker
Use local evidenceLook at the patterns in isolation
NE Pair Ranker
Use global evidenceTake into account the clusters (Ci) of patterns
Unguided Bootstrapping
• Initial Settings:– The seed patterns for the target relation R have precision 1
and all other patterns 0.– All NE pairs have confidence of 0
Unguided Bootstrapping
• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs– if many of the k patterns connecting the two names are high-
precision patterns – then the name pair should have a high confidence.
– The confidence of NE pairs is estimated as
– Problem: over-rate NE pairs which are connected by patterns belonging to multiple relations
1
( ) 1 (1 ( ))k
i jj
Conf N Prec p
Unguided Bootstrapping
• Step 2: Use NE pairs to search for new patterns and rank patterns– Similarly, for a pattern p, – if many of the NE pairs it matches are very confident – then p has many supporters and should have a high ranking
– Estimation of the confidence of patterns
( )( ) log ( )
| |
Sup pConf p Sup p
H
the number of unique NE pairs matched by p
sum of the support from the |H| pairs
Unguided Bootstrapping
• Step 2: Use NE pairs to search for new patterns and rank patterns– Sup(p) is the sum of the support it can get from the |H| pairs
– The precision of p is given by the average confidence of the NE pairs matched by p
• It normalizes the precision to range from 0 to 1• As a result the confidence of each NE pair is also normalized to
between 0 and 1
| |
1
( ) ( )H
jj
Sup p Conf N
( )( )
| |
Sup pPrec p
H
Unguided Bootstrapping
• Step 3: Accept patterns– accept the K top ranked patterns in Step 2
• Step 4: Loop or stop– The procedure now decides whether to repeat
from Step 1 or to terminate.– Most systems simply do NOT know when to stop
Guided Bootstrapping
• Pattern Clusters--Clustering steps:I. Extract features for patterns
II. Compute the tf-idf value of extracted features
III. Compute the cosine similarity between patterns
IV. Build a pattern hierarchy by complete linkage
Sample features for “X visited Y” as in “Jordan visited China”
Guided Bootstrapping
• Pattern Clusters– We use 0.005 to cut the pattern hierarchy to generate
clusters
– This ‘cutoff’ is decided by • trying a series of thresholds • searching for the maximal one that is capable of placing the
seed patterns for each relation into a single cluster
– We define target cluster Ct as the one containing the seeds
Guided Bootstrapping
• Pattern cluster example– Top 15 patterns in the Located-in Cluster
Guided Bootstrapping
• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs
the total number of pattern instances matching Ni
the number of times
p matches NiDegree of
association between Ni
and Ct
Guided Bootstrapping
• Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs
Why it gives better confidence estimation?
<Clinton, Arkansas> for the Located-in relation Local_Conf(Ni) is very high Global_Conf(Ni) is very low (less than 0.1)Conf(Ni) is low, high Local_Conf(Ni) is discounted by low Global_Conf(Ni)
Guided Bootstrapping
• Step 2: Use NE pairs to search for new patterns and rank patterns.– All the measurement functions are the same as
those used in the unguided bootstrapping. – However, with better ranking of NE pairs in Step 1– the patterns are also ranked better
• Step 3: Accept patterns– We also accept the K top ranked patterns
Guided Bootstrapping
• Step 4: Loop or stopSince each pattern in our corpus has a cluster membership, we can
– monitor the semantic drift easily – and naturally stop• it drifts when the procedure tries to accept patterns
which do not belong to the target cluster • we can stop when the procedure tends to accept more
patterns outside of the target cluster
Experiments• Pattern clusters:
– Computed from a corpus of 1.3 billion tokens
• Evaluation data:– ACE 2004 training data (no relation annotation between each pair of
names)– We take advantage of entity co-reference information to automatically
re-annotate the relations – Annotation was reviewed by hand
• Evaluation method:– direct evaluation– strict pattern match
Experiments
Red: guided bootstrappingBlue: unguided bootstrapping
drift : the percentage of false positives belonging to ACE relations
other than the target relation
Experiments
Red: guided bootstrappingBlue: unguided bootstrapping
drift : the percentage of false positives belonging to ACE relations
other than the target relation
Experiments
Red: guided bootstrappingBlue: unguided bootstrapping
drift : the percentage of false positives belonging to ACE relations
other than the target relation
Experiments
• Guided bootstrapping terminates when the precision is still high while maintaining a reasonable recall
• It also effectively prevented semantic drift
Part I
Part IVCross-Domain Bootstrapping for
Named Entity Recognition
NER Semi-supervised learning NER
Source Domain Target Domain
NER Model Maximum Entropy Markov Model (McCallum et al., 2000)
Split a name type into two classes B_PER (beginning of PERSON) I_PER (continuation of PERSON)
11 (,...,|,...,) nn PSSTT
1 1 11
( ,..., | ,..., ) ( | , )n
n n i i ii
P S S T T P S S T
1( | , )i i iP S S T
U.S. Defense Secretary Donald H. Rumsfeld
B_GPE B_ORG O B_PER I_PER I_PERT1 T2 T3 T4 T5 T6
S1 S2 S3 S4 S5 S6
Goal
MEMM
Maximum Entropy Classifier
11
( ) max ( ) ( | , )N
t t j i ti
j i P s s o Viterbi Algorithm
NER Model
Estimate the name class of each individual token ti
Extract a feature vector from the local context window (ti-2, ti-1, ti, ti+1, ti+2)
Learn feature weights using Maximum Entropy model
U.S. Defense Secretary Donald H. Rumsfeld
B_GPE B_ORG O B_PER I_PER I_PER
Feature ValuecurrentToken DonaldwordType_currentToken initial_capitalizedpreviousToken_-1 SecretarypreviousToken_-1_class OpreviousToken_-2 DefensenextToken_+1 H.
… …
NER Model
Estimate the name classes of the token sequence Search the most likely path argmax ( ) Use dynamic programming ( possible paths)
N := number of name classes L := length of the token sequence
U.S. Defense Secretary Donald H. RumsfeldB-PERI-PER
B-ORGI-ORGB-GPEI-GPE
O
LN
U.S. Defense Secretary Donald H. Rumsfeld
B_GPE B_ORG O B_PER I_PER I_PER
Domain Adaptation Problems
Source domain(news articles)
George Bush Donald H. Rumsfeld
…Department of Defense
…
Target domain(reports on terrorism)
Abdul Sattar al-RishawiFahad bin Abdul Aziz bin Abdul Rahman Al-Saud
…Al-Qaeda in Iraq
…
Q(Target domain): What is the weight of the feature currentToken=AbdulA(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data
Domain Adaptation ProblemsSource domain(news articles)
George Bush Donald H. Rumsfeld
…Department of Defense
…
Target domain(reports )
Abdul Sattar al-RishawiFahad bin Abdul Aziz bin Abdul Rahman Al-Saud
…Al-Qaeda in Iraq
…
1. Many words are out-of-vocabulary2. Naming conventions are different:
1. Length: short vs long2. Capitalization: weaker in target
3. Name variation occurs often in targetShaikh, Shaykh, Sheikh, Sheik, …
We want to automatically adapt the source-domain tagger to the target domain
without annotating target domain data
The Benefits of Incorporating Global Data View -- Feature Generalization
Q(Target domain): What is the weight of the feature currentToken=AbdulA(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data
Bit string Examples
110100011 John, James, Mike, Steven
11010011101 Abdul, Mustafa, Abi, Abdel
11010011111 Shaikh, Shaykh, Sheikh, Sheik
111111110 Qaeda, Qaida, qaeda, QAEDA
00011110000 FBI, FDA, NYPD
000111100100 Taliban
Global Data View Comes to the Rescue!Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm
The Benefits of Incorporating Global Data View -- Feature Generalization
• Add an additional layer of features that include word clusters• currentToken = John• currentPrefix3 = 100• currentPrefix3 = 100 fires also for target words!
• To avoid commitment to a single cluster: cut word hierarchy at different levels
The Benefits of Incorporating Global Data View -- Feature Generalization
Performance on the target domainModel P R F1
Source_Model 70.02 61.86 65.69Source_Model
+ Word Clusters 72.82 66.61 69.58
The Benefits of Incorporating Global Data View -- Instance Selection
Cross-domain Bootstrapping Algorithm:1. Train a tagger from labeled source data2. Tag all unlabeled target data with current tagger3. Select good tagged words and add these to labeled data4. Re-train the tagger
Trained tagger
Unlabeled target data
Instance Selection
Labeled Source data
President Assad
FeatureGeneralization
MultipleCriteria
The Benefits of Incorporating Global Data View -- Instance Selection
• Multiple criteria– Criterion 1: Novelty– prefer target-specific
instances • Promote Abdul instead of John
– Criterion 2: Confidence - prefer confidently labeled instances
The Benefits of Incorporating Global Data View -- Instance Selection
Criterion 2: Confidence - prefer confidently labeled instances
Local confidence: based on local features
1) maximum: 1. when one name class is predicted with probability 1, e.g., p(ci|v) = 1
2) minimum: when the predictions are evenly distributed over all the name classes.
3) The higher the value, the more confident the instance is.
( ) 1 ( | ) log ( | )i
i ic
LocalConf I p c v p c v
I := instancev := feature vector for I
ci := name class i
The Benefits of Incorporating Global Data View -- Instance Selection
Criterion 2: Confidence Global confidence: based on corpus statistics
1 Prime Minister Abdul Karim Kabariti PER2 warlord General Abdul Rashid Dostum PER3 President A.P.J. Abdul Kalam will PER4 President A.P.J. Abdul Kalam has PER5 Abdullah bin Abdul Aziz , PER6 at King Abdul Aziz University ORG7 Nawab Mohammed Abdul Ali , PER8 Dr Ali Abdul Aziz Al PER9 Nayef bin Abdul Aziz said PER
10 leader General Abdul Rashid Dostum PER
P( Abdul is a PER) = 0.9
The Benefits of Incorporating Global Data View -- Instance Selection
Criterion 2: Confidence Global confidence
Combined confidence: product of local and global confidence
( ) 1 ( ) log ( )i
i ic
GlobalConf I p c p c
The higher the value, the more confident the instance is.
The Benefits of Incorporating Global Data View -- Instance Selection
Criterion 3: Density - prefer representative instances which can be seen as centroid instances
1
( , )
( )1
N
j j i
Sim i j
Density iN
average similarity between i and all other instances j
Jaccard Similarity between the feature vectors of the two instances
the total number of instances in the corpus
The Benefits of Incorporating Global Data View -- Instance Selection
Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances
“, said * in his” Highly confident instance High density, representative instance BUT, continuing to promote such instance would not gain additional
benefit
( , ) ( ) ( )diff i j Density i Density j
diff(i, j) := difference between instances i and j Use a small value for diff(i, j) dense instances still have a higher chance to be selected while a certain degree of diversity is achieved at the same time.
The Benefits of Incorporating Global Data View -- Instance Selection
Putting all criteria together1. Novelty: filter out source-dependent instances
2. Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set
3. Density: rank instances in the candidate set in descending order of density
4. Diversity: 1. accepts the first instance (with the highest density) in the candidate set 2. and selects other candidates based on the diff measure.
The Benefits of Incorporating Global Data View -- Instance Selection
Results
+ Novelty + CombinedConf + Diversity+ Novelty + CombinedConf + Density + Novelty + CombinedConf + Novelty + LocalConf Generalized seed model (SourceModel + WordCluster)- Novelty + LocalConf +/- := with/without
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Iteration
0 5 10 15 20 25 30 35
F1
68
69
70
71
72
73
74
Part V
Conclusion
Contribution• The main contribution is the use of both local and global evidence
for fast system development
• The co-testing procedure reduced annotation cost by 97%
• The use of pattern clusters as the global view in bootstrapping– not only greatly improved the quality of learned patterns – but also contributed to a natural stopping criterion
• Feature generalization and instance selection in the cross-domain bootstrapping were able to improve the source model's performance on the target domain by 7% F1 without annotating any target domain data
Future Work
• Active Learning for Relation Type Extension– conduct real world active learning– combine semi-supervised learning with active learning to further
reduce annotation cost
• Semi-supervised Learning for Relation Type Extension– better seed selection strategy
• Cross-domain Bootstrapping for Named Entity Recognition– extract dictionary-based features to further generalize lexical features– combine with distantly annotated data to further improve
performance
Thanks!
?
• Backup slides
Experimental Setup for Active Learning
• ACE 2004 data– 4.4K relation instances– 45K non-relation instances
• 5-fold cross validation– Roughly 36K unlabeled instances (45K ÷ 5 X 4)– Random initialization (repeated 10 times)– Totally 50 runs – Each iteration selects 5 instances for annotation– 200 iterations are performed
EMP-ORG
# Labeled Instances
0 100 200 300 400 500 600 700 800 900 1000
F1
0
10
20
30
40
50
60
70
80
90ART
# Labeled Instances
0 100 200 300 400 500 600 700 800 900 1000F1
0
10
20
30
40
50
60
70
80OTHER-AFF
# labeled examples
0 100 200 300 400 500 600 700 800 900 1000
F1
0
10
20
30
40
50
60
PHYS
# Labeled Instances
0 100 200 300 400 500 600 700 800 900 1000
F1
0
10
20
30
40
50
60
70
80 GPE-AFF
# Labeled Instances
0 100 200 300 400 500 600 700 800 900 1000
F1
0
10
20
30
40
50
60
70
LGCo-TestingSPCo-TestingUncertaintyALSupervisedRandomALUncertaintyAL+