Weighted bipartite matching and its application in data ... · Weighted bipartite matching and its...

1
Weighted bipartite matching and its application in data linking Sina Ahmadi [email protected] Supervisors: John McCrae Mihael Arcan Introduction Galway Galway (city) Gaillimh Fairgreen (Galway) Galway, Ireland Galway harbour Galway, Republic of Ireland The weather in Galway Galway city Figure 1: Wikipedia redirected pages to "Galway" Data linking is used to bring together information from different sources in order to create a new richer data set by identifying and combining information from corresponding entities on each of the different source data sets. One of the main applications of data linking is in e-lexicography. In a lexicon, one entry can be linked to one or more entries in another lexicon based on different types of relationship such as synonymy, antonymy or equivalence. In this study, we are interested in different variants of the bipartite matching problem with a new application in linking cross-lingual lexical data based on similarity scores. We are particularly interested in matching labels and textual description of concepts in Wikipedia where different entries may be juxtaposed to one or more other entries with the same content. Such links are known as redirects. Figure 1 shows the redirects to "Galway". Objective Our ongoing research focuses on one of the main challenges in e-lexicography which is to find more efficient matching solutions with respect to complex relationships between two sets of entries. We are aiming at analyzing the extent to which graph matching algorithms can improve cross-lingual data linking and, particularly, redirects linking. Figure 2: Frequency of redirects in Wikipedia (August 2018) Figure 2 illustrates proportional amount of redirection links in Wikipedia. Although there is a vast range of frequencies between 1 and 7073, most of the pages that have been redirected to have a frequency of less than 5. Since the target pages are not interconnected, we can model our problem as matching a bipartite undirected weighted graph in which the two disjoint sets of vertices are fully connected with edges which are weighted based on similarity scores. Methodology There are various methods for matching weighted bipartite graphs. One strategy is to select exhaustively all edges over a specific threshold. On the other hand, maximum-weight matching suggests an optimal solution as a matching where the sum of weights have a maximal value. This method is known as the assignment problem. The Hungarian algorithm is one of the solutions that solves the assignment problem in polynomial time. Although the maximum weight matching is more efficient than the exhaus- tive method, it only yields one-to-one links between the vertices. In more recent studies, the weighted bipartite b-matching (WBM) algo- rithm has been shown to be more efficient in matching vertices based on the capacity of a vertex. The capacity of a vertex, denoted by b is the number of the vertices that can be matched to that vertex. WBM finds a subgraph which maximizes the sum of the edges having every vertex adjacent to at most b edges. Gold-standard GB FM Gaillimh Galway Ireland GBFM Galway Bay FM Galway Great Britain 0.7 0.6 0.9 0.8 0.6 GB FM Bipartite b-matching GB FM Gaillimh Galway Ireland GBFM Galway Bay FM Galway Great Britain Exhaustive GB FM Gaillimh Galway Ireland GBFM Galway Bay FM Galway Great Britain Bipartite matching GB FM Gaillimh Galway Ireland GBFM Galway Bay FM Galway Great Britain GB FM Figure 3: Matching strategies Experiments At this stage of our project, our experiments suggest a considerable improve- ment using the WBM algorithm in terms of precision and recall. Our final results will be reported in an article in the near future. As a part of the ELEXIS project (https://elex.is/), the outcome of our study will be integrated in NAISC, a semi-automated system for link discovery and data linking in lexicography. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015.

Transcript of Weighted bipartite matching and its application in data ... · Weighted bipartite matching and its...

Page 1: Weighted bipartite matching and its application in data ... · Weighted bipartite matching and its application in data linking Sina Ahmadi sina.ahmadi@insight-centre.org Supervisors:

Weighted bipartite matching andits application in data linkingSina Ahmadi [email protected]

Supervisors: John McCrae Mihael Arcan

Introduction

Galway

Galway (city)

Gaillimh Fairgreen (Galway)

Galway, Ireland

Galway harbour

Galway, Republic of Ireland

The weather in Galway

Galway city

Figure 1: Page URLs redirected to ”Galway”

1

Figure 1: Wikipedia redirected pages to "Galway"

Data linking is used to bring together information from different sources in order to create a new richerdata set by identifying and combining information from corresponding entities on each of the different sourcedata sets. One of the main applications of data linking is in e-lexicography. In a lexicon, one entry canbe linked to one or more entries in another lexicon based on different types of relationship such as synonymy,antonymy or equivalence.In this study, we are interested in different variants of the bipartite matching problem with a new applicationin linking cross-lingual lexical data based on similarity scores. We are particularly interested in matching labelsand textual description of concepts in Wikipedia where different entries may be juxtaposed to one or moreother entries with the same content. Such links are known as redirects. Figure 1 shows the redirects to"Galway".

ObjectiveOur ongoing research focuses on one of the main challenges in e-lexicographywhich is to find more efficient matching solutions with respect to complexrelationships between two sets of entries. We are aiming at analyzing theextent to which graph matching algorithms can improve cross-lingual datalinking and, particularly, redirects linking.

Figure 2: Frequency of redirects in Wikipedia (August 2018)

Figure 2 illustrates proportional amount of redirection links in Wikipedia.Although there is a vast range of frequencies between 1 and 7073, mostof the pages that have been redirected to have a frequency of less than 5.Since the target pages are not interconnected, we can model our problemas matching a bipartite undirected weighted graph in whichthe two disjoint sets of vertices are fully connected with edges which areweighted based on similarity scores.

MethodologyThere are various methods for matching weighted bipartite graphs. Onestrategy is to select exhaustively all edges over a specific threshold. Onthe other hand, maximum-weight matching suggests an optimal solution asa matching where the sum of weights have a maximal value. This method isknown as the assignment problem. The Hungarian algorithm is oneof the solutions that solves the assignment problem in polynomial time.Although the maximum weight matching is more efficient than the exhaus-tive method, it only yields one-to-one links between the vertices. In morerecent studies, the weighted bipartite b-matching (WBM) algo-rithm has been shown to be more efficient in matching vertices based on thecapacity of a vertex. The capacity of a vertex, denoted by b is the numberof the vertices that can be matched to that vertex. WBM finds a subgraphwhich maximizes the sum of the edges having every vertex adjacent to atmost b edges.

Gold-standard

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

0.7

0.6

0.9

0.8 0.6

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

3.3 Matching strategies: experiments

Let G1 be a complete weighted graph in which the vertices are unique entities and the edgesare weighted based on the similarity score. The exhaustive methods selects all the edgeswith a weight over a specific threshold.Given G1, the bipartite method states that an optimal solution is the one that has themaximum sum of the weights in a way that there is only one edge between two vertices andeach vertex is of degree one. The result is a bipartite graph G2 for which the vertex set canbe split into two disjoint sets V1 and V2 so that each edge joins a vertex of V1 and a vertexof V2. Alternatively, G2 has the following properties:

• If v 2 V1 then it may only be adjacent to vertices in V2.

• If v 2 V2 then it may only be adjacent to vertices in V1.

• V 1 \ V 2 = ;

9

Bipartite b-matching

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

0.7

0.6

0.9

0.8 0.6

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

3.3 Matching strategies: experiments

Let G1 be a complete weighted graph in which the vertices are unique entities and the edgesare weighted based on the similarity score. The exhaustive methods selects all the edgeswith a weight over a specific threshold.Given G1, the bipartite method states that an optimal solution is the one that has themaximum sum of the weights in a way that there is only one edge between two vertices andeach vertex is of degree one. The result is a bipartite graph G2 for which the vertex set canbe split into two disjoint sets V1 and V2 so that each edge joins a vertex of V1 and a vertexof V2. Alternatively, G2 has the following properties:

• If v 2 V1 then it may only be adjacent to vertices in V2.

• If v 2 V2 then it may only be adjacent to vertices in V1.

• V 1 \ V 2 = ;

9

Exhaustive

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

0.7

0.6

0.9

0.8 0.6

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

3.3 Matching strategies: experiments

Let G1 be a complete weighted graph in which the vertices are unique entities and the edgesare weighted based on the similarity score. The exhaustive methods selects all the edgeswith a weight over a specific threshold.Given G1, the bipartite method states that an optimal solution is the one that has themaximum sum of the weights in a way that there is only one edge between two vertices andeach vertex is of degree one. The result is a bipartite graph G2 for which the vertex set canbe split into two disjoint sets V1 and V2 so that each edge joins a vertex of V1 and a vertexof V2. Alternatively, G2 has the following properties:

• If v 2 V1 then it may only be adjacent to vertices in V2.

• If v 2 V2 then it may only be adjacent to vertices in V1.

• V 1 \ V 2 = ;

9

Bipartite matching

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

0.7

0.6

0.9

0.8 0.6

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

GB FM

Gaillimh

Galway Ireland

GBFM

Galway Bay FM

Galway

Great Britain

3.3 Matching strategies: experiments

Let G1 be a complete weighted graph in which the vertices are unique entities and the edgesare weighted based on the similarity score. The exhaustive methods selects all the edgeswith a weight over a specific threshold.Given G1, the bipartite method states that an optimal solution is the one that has themaximum sum of the weights in a way that there is only one edge between two vertices andeach vertex is of degree one. The result is a bipartite graph G2 for which the vertex set canbe split into two disjoint sets V1 and V2 so that each edge joins a vertex of V1 and a vertexof V2. Alternatively, G2 has the following properties:

• If v 2 V1 then it may only be adjacent to vertices in V2.

• If v 2 V2 then it may only be adjacent to vertices in V1.

• V 1 \ V 2 = ;

9

Figure 3: Matching strategies

ExperimentsAt this stage of our project, our experiments suggest a considerable improve-ment using the WBM algorithm in terms of precision and recall. Our finalresults will be reported in an article in the near future.As a part of the ELEXIS project (https://elex.is/), the outcome ofour study will be integrated in NAISC, a semi-automated system for linkdiscovery and data linking in lexicography.

This project has received funding from the European Union’s Horizon 2020 researchand innovation programme under grant agreement No 731015.