Experiments in Graph-based Semi-Supervised Learning for ...Partha Pratim Talukdar * (Search Labs,...
Transcript of Experiments in Graph-based Semi-Supervised Learning for ...Partha Pratim Talukdar * (Search Labs,...
Partha Pratim Talukdar* (Search Labs, MSR)
Fernando Pereira (Google)
Experiments in Graph-based Semi-Supervised Learning
for Class-Instance Acquisition
ACL, 2010 (Uppsala, Sweden) *Work done while at the Univ. of Pennsylvania
Class-Instance Acquisition
2
Class-Instance Acquisition
• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese
company, multinational company, ...
2
Class-Instance Acquisition
• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese
company, multinational company, ...
• Large scale, open domain
2
Class-Instance Acquisition
• Given an entity, assign human readable descriptors to it– e.g., Toyota is a car manufacturer, japanese
company, multinational company, ...
• Large scale, open domain
• Applications– Web search, Advertising, etc.
2
Graph-based Class-Instance Acquisition
3
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
PatternTable Mining
3
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
PatternTable Mining
3
Musician Singer
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
PatternTable Mining
Cluster ID
3
Musician Singer
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
PatternTable Mining
Cluster ID
ExtractionConfidence
3
Musician Singer
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
PatternTable Mining
Cluster ID
ExtractionConfidence
3
Musician Singer
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
SeedClasses
PatternTable Mining
Cluster ID
ExtractionConfidence
3
Musician Singer
Graph-based Class-Instance Acquisition
Bob Dylan (0.95)Johnny Cash (0.87)
Billy Joel (0.82)
Set 2Set 1Billy Joel (0.75)
Johnny Cash (0.73)
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
SeedClasses
PatternTable Mining
Cluster ID
ExtractionConfidence
3
Musician Singer
Can we infer that Bob Dylan is also a Musician, as that is missing in current
extractions?
Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
4
Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]
InitializationSet 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Musician 1.0
Musician 1.0
Musician 1.0
SeedLabels
4
PredictedLabels
Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]
Iteration 1Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Musician 1.0
Musician 0.8Musician 1.0
Musician 1.0
SeedLabels
4
PredictedLabels
Graph-based Class-Instance Acquisition [Talukdar et al., EMNLP 2008]
Iteration 2Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Musician 1.0
Musician 0.8Musician 1.0
Musician 1.0
SeedLabels
Musician 0.6
4
PredictedLabels
• Graph-based Class-Instance Acquisition– Overview & previous work
• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)
– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)
• Additional Semantic Constraints
• Summary
Outline
5
• Graph-based Class-Instance Acquisition– Overview & previous work
• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)
– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)
• Additional Semantic Constraints
• Summary
Outline
5
Comments on the Constructed Graph
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
6
Comments on the Constructed Graph
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.
6
Comments on the Constructed Graph
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.
Initial Clusters
6
Comments on the Constructed Graph
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.
Initial Clusters
6
Coupling Node:Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.
Comments on the Constructed Graph
Set 2
Set 1
Bob Dylan
Johnny Cash
Billy Joel
0.95
0.87
0.82
0.73
0.75
Musician 1.0
Musician 1.0
Smoothness:Nodes connected by an edge should be assigned similar classes, as enforced by edge weight.
Initial Clusters
Seed classes can be different from the cluster IDs
6
Coupling Node:Force (softly) all instance nodes connected to it to have similar class labels, exploiting the Smoothness requirement.
LP-ZGL[Zhu et al., ICML 2003]
• m labels
• W : edge weight matrix
• Yul: weight of label l on node u
• Yul: seed weight of label l on node u
• S: diagonal matrix, non-zero for seed nodes
LP-ZGL[Zhu et al., ICML 2003]
• m labels
• W : edge weight matrix
• Yul: weight of label l on node u
• Yul: seed weight of label l on node u
• S: diagonal matrix, non-zero for seed nodes
LP-ZGL[Zhu et al., ICML 2003]
arg minY
m!
l=1
!
u,v
Wuv(Yul ! Yvl)2; s.t. SYl = SYl
• m labels
• W : edge weight matrix
• Yul: weight of label l on node u
• Yul: seed weight of label l on node u
• S: diagonal matrix, non-zero for seed nodes
LP-ZGL[Zhu et al., ICML 2003]
arg minY
m!
l=1
!
u,v
Wuv(Yul ! Yvl)2; s.t. SYl = SYl
Smooth
• m labels
• W : edge weight matrix
• Yul: weight of label l on node u
• Yul: seed weight of label l on node u
• S: diagonal matrix, non-zero for seed nodes
LP-ZGL[Zhu et al., ICML 2003]
arg minY
m!
l=1
!
u,v
Wuv(Yul ! Yvl)2; s.t. SYl = SYl
Match Seeds (hard)Smooth
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
argminY
m+1�
l=1
��SY l − SY l�2 + µ1
�
u,v
Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2
�
• m labels, +1 dummy label
• M = W� +W is the symmetrized weight matrix
• Y vl: weight of label l on node v
• Y vl: seed weight for label l on node v
• S: diagonal matrix, nonzero for seed nodes
• Rvl: regularization target for label l on node v
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
argminY
m+1�
l=1
��SY l − SY l�2 + µ1
�
u,v
Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2
�
• m labels, +1 dummy label
• M = W� +W is the symmetrized weight matrix
• Y vl: weight of label l on node v
• Y vl: seed weight for label l on node v
• S: diagonal matrix, nonzero for seed nodes
• Rvl: regularization target for label l on node v
Match Seeds (soft)
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
argminY
m+1�
l=1
��SY l − SY l�2 + µ1
�
u,v
Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2
�
• m labels, +1 dummy label
• M = W� +W is the symmetrized weight matrix
• Y vl: weight of label l on node v
• Y vl: seed weight for label l on node v
• S: diagonal matrix, nonzero for seed nodes
• Rvl: regularization target for label l on node v
Match Seeds (soft) Smooth
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
argminY
m+1�
l=1
��SY l − SY l�2 + µ1
�
u,v
Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2
�
• m labels, +1 dummy label
• M = W� +W is the symmetrized weight matrix
• Y vl: weight of label l on node v
• Y vl: seed weight for label l on node v
• S: diagonal matrix, nonzero for seed nodes
• Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
8
argminY
m+1�
l=1
��SY l − SY l�2 + µ1
�
u,v
Muv(Y ul − Y vl)2 + µ2�Y l −Rl�2
�
• m labels, +1 dummy label
• M = W� +W is the symmetrized weight matrix
• Y vl: weight of label l on node v
• Y vl: seed weight for label l on node v
• S: diagonal matrix, nonzero for seed nodes
• Rvl: regularization target for label l on node v
MAD has extra regularization compared
to LP-ZGL
Match Seeds (soft) SmoothMatch Priors(Regularizer)
MAD (contd.)
9
Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�
Zv ← Svv + µ1�
u �=v Mvu + µ2 ∀v ∈ Vrepeat
for all v ∈ V doY v ← 1
Zv
�(SY )v + µ1Mv·Y + µ2Rv
�
end foruntil convergence
MAD (contd.)
9
Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�
Zv ← Svv + µ1�
u �=v Mvu + µ2 ∀v ∈ Vrepeat
for all v ∈ V doY v ← 1
Zv
�(SY )v + µ1Mv·Y + µ2Rv
�
end foruntil convergence
• Easily Parallelizable: Scalable
MAD (contd.)
9
Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�
Zv ← Svv + µ1�
u �=v Mvu + µ2 ∀v ∈ Vrepeat
for all v ∈ V doY v ← 1
Zv
�(SY )v + µ1Mv·Y + µ2Rv
�
end foruntil convergence
• Easily Parallelizable: Scalable• Importance of a node can be discounted
MAD (contd.)
9
Inputs Y ,R : |V |× (|L|+ 1), W : |V |× |V |, S : |V |× |V | diagonalY ← YM = W +W�
Zv ← Svv + µ1�
u �=v Mvu + µ2 ∀v ∈ Vrepeat
for all v ∈ V doY v ← 1
Zv
�(SY )v + µ1Mv·Y + µ2Rv
�
end foruntil convergence
• Easily Parallelizable: Scalable• Importance of a node can be discounted• Convergence guarantee under mild conditions
Experiments with Public Datasets
10
Experiments with Public Datasets
• Previous research used proprietary datasets– difficult to reproduce and extend
10
Experiments with Public Datasets
• Previous research used proprietary datasets– difficult to reproduce and extend
• Public datasets are used in the paper– Freebase: relational tables from multiple sources
– TextRunner (Ritter+, 2009): Open-domain IE System
– YAGO (Suchanek+, 2007): KB curated from Wikipedia and WordNet
– Gold standard hypernyms: [Pantel+, 2009], WordNet
10
Experiments with Public Datasets
• Previous research used proprietary datasets– difficult to reproduce and extend
• Public datasets are used in the paper– Freebase: relational tables from multiple sources
– TextRunner (Ritter+, 2009): Open-domain IE System
– YAGO (Suchanek+, 2007): KB curated from Wikipedia and WordNet
– Gold standard hypernyms: [Pantel+, 2009], WordNet
• Available at: http://www.talukdar.net/datasets/class_inst/10
Graph Stats
11
Statistics of Graphs used in Experiments
Evaluation Metric: Mean Reciprocal Rank (MRR)
12
MRR =1
|test-set|�
v∈test-set
1
rankv(class(v))
Evaluation Metric: Mean Reciprocal Rank (MRR)
12
MRR =1
|test-set|�
v∈test-set
1
rankv(class(v))
Billy JoelLinguist 0.4Musician 0.3
...
Gold Label: MusicianMRR: 0.5 (1/2)
Graph-based SSL Comparisons (1)
13
Graph-based SSL Comparisons (1)
13
0.15
0.2
0.25
0.3
0.35
170 x 2 170 x 10
TextRunner Graph, 170 WordNet Classes
Me
an
Re
cip
roca
l R
an
k (
MR
R)
Amount of Supervision
LP-ZGL Adsorption MAD
Graph with175k nodes,529k edges.
Graph-based SSL Comparisons (2)
14
0.25
0.285
0.32
0.355
0.39
192 x 2 192 x 10
Freebase-2 Graph, 192 WordNet Classes
Me
an
Re
cip
roca
l R
an
k (
MR
R)
Amount of Supervision
LP-ZGL Adsorption MAD
Graph with303k nodes,2.3m edges.
When is MAD most effective?
15
When is MAD most effective?
15
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
When is MAD most effective?
15
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
MAD is most effective in dense graphs, where there is greater need for
regularization.
• Graph-based Class-Instance Acquisition– Overview & previous work
• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)
– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)
• Additional Semantic Constraints
• Summary
Outline
16
• Graph-based Class-Instance Acquisition– Overview & previous work
• Review & comparison of three SSL algorithms– LP-ZGL (Zhu+, 2003)
– Adsorption (Baluja+, 2008)– Modified Adsorption (MAD) (Talukdar and Crammer, 2009)
• Additional Semantic Constraints
• Summary
Outline
16
Can Additional Semantic Constraints Help ?
17
Can Additional Semantic Constraints Help ?
17
Set 2
Set 1
Isaac Newton
Johnny Cash
Bob Dylan
Can Additional Semantic Constraints Help ?
17
Instances with shared
attributes are likely to be
from the same class.
Set 2
Set 1
Isaac Newton
Johnny Cash
Bob Dylan
has_attribute-albums
Can Additional Semantic Constraints Help ?
17
Instances with shared
attributes are likely to be
from the same class.
Set 2
Set 1
Isaac Newton
Johnny Cash
Bob Dylan
has_attribute-albums
Can Additional Semantic Constraints Help ?
17
Instances with shared
attributes are likely to be
from the same class.
Set 2
Set 1
Isaac Newton
Johnny Cash
Bob Dylan
Graph-based representation makes it easy to incorporate such constraints!
Better Classes with YAGO Attributes
18
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Graph constructed from TextRunner (UWash) output, 175k nodes, 529k edges
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Graph constructed from output of YAGO Knowledge Base, 142k nodes, 777k edges
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Combined graph, with 237k nodes, 1.3m edges
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Better Classes with YAGO Attributes
18 0.3
0.338
0.375
0.413
0.45
170 WordNet Classes, 10 Seeds per Class, using Adsorption
Me
an
Re
cip
roca
l R
an
k (
MR
R)
TextRunner GraphYAGO GraphTextRunner + YAGO Graph
Additional semantic constraints help improve performance significantly.
Classes for Attributes
• Classes assigned to attribute nodes
19
has_isbn
wordnet_bookwordnet_maganize
Classes for Attributes
• Classes assigned to attribute nodes
19
has_isbn
wordnet_bookwordnet_maganize
Classes for Attributes
• Classes assigned to attribute nodes
19
has_isbn
wordnet_bookwordnet_maganize
Evidence that attributes propagate right classes.
Summary
20
Summary
• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in
dense graphs
20
Summary
• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in
dense graphs
• Better class-instance acquisition with additional semantic constraints– easy to add such constraints in graph-based
representation
20
Summary
• Compared three graph-based SSL algorithms for class-instance acquisition– MAD is consistently most effective, particularly in
dense graphs
• Better class-instance acquisition with additional semantic constraints– easy to add such constraints in graph-based
representation
• Datasets available: www.talukdar.net/datasets/class_inst/
– for reproducibility and extension
20
Thank You!http://www.talukdar.net/datasets/class_inst/