Popularity Agnostic Evaluation of Knowledge Graph Embeddings · 2020-06-29 · Popularity Agnostic...

Popularity Agnostic Evaluation of Knowledge Graph Embeddings

Aisha Mohamed F Shameem A. Parambath F Zoi Kaoudi �∗ Ashraf Aboulnaga F

FQatar Computing Research Institute, HBKU �Technische Universitat Berlin

Abstract

In this paper, we show that the distribution ofentities and relations in common knowledgegraphs is highly skewed, with some entitiesand relations being much more popular thanthe rest. We show that while knowledge graphembedding models give state-of-the-art perfor-mance in many relational learning tasks suchas link prediction, current evaluation metricslike hits@k and mrr are biased towards pop-ular entities and relations. We propose twonew evaluation metrics, strat-hits@k andstrat-mrr, which are unbiased estimators ofthe true hits@k and mrr when the items fol-low a power-law distribution. Our new metricsare generalizations of hits@k and mrr thattake into account the popularity of the entitiesand relations in the data, with a tuning param-eter determining how much emphasis the met-ric places on popular vs. unpopular items. Us-ing our metrics, we run experiments on bench-mark datasets to show that the performance ofembedding models degrades as the popularityof the entities and relations decreases, and thatcurrent reported results overestimate the per-formance of these models by magnifying theiraccuracy on popular items.

1 INTRODUCTION

Recently, a number of large-scale knowledge graphs likeDBpedia [1], YAGO [29], WordNet [17], Freebase [3],and Google Knowledge Graph [25] have been createdand deployed in many real world applications includ-ing search systems, virtual assistants, and question an-swering. A knowledge graph is a dataset of general∗Work done while at QCRI.

Proceedings of the 36th Conference on Uncertainty in ArtificialIntelligence (UAI), PMLR volume 124, 2020.

relational facts about entities in the world in the formof (subject, relation, object) triples. Subjects and ob-jects represent the set of real-world entities (nodes) inthe graph and each triple represents an edge of typerelation between the subject and the object. Whilecurrent knowledge graphs contain a huge amount of in-formation with relatively high accuracy, they are still farfrom complete. Relational learning models reason overthe rich semantic structure of knowledge graphs to addmissing correct facts to the graph and detect incorrectfacts to remove from it.

Embedding models for knowledge graphs have shownstate-of-the-art performance in relational learning taskssuch as link prediction (a.k.a. knowledge graph comple-tion) and entity resolution [18, 24]. These models embedthe entities and relations in a latent subspace and then usethe embeddings to infer unobserved links between theentities. To predict whether a missing triple (ei, rk, ej)is true, a score function f((ei, rk, ej); . . . ) is calculatedbased on the embeddings of the constituent entities andrelation. The score indicates the confidence of the modelthat the entities ei and ej are related by rk [23]. Themain evaluation criteria for these models is: given a triplewith a missing subject or object entity, how well does themodel rank the correct missing entity compared to otherentities in the graph?

A fundamental problem with the current entity rankingevaluation metrics is their bias towards popular entitiesand relations. There is a popularity bias in knowledgegraphs (many facts about few popular entities and fewfacts about most of the other long-tail entities) and theevaluation metrics do not correct for this bias in thedata. The popularity bias in knowledge graphs can beattributed to the fact that most of these graphs are au-tomatically constructed in a best-effort way from onlinesources that suffer from intrinsic biases (e.g., Wikipedia).This bias towards popular entities and relations in theweb sources used in knowledge graph construction is re-flected and amplified in the created knowledge graphs.

Current evaluation metrics like hits@k and mrr calcu-late the accuracy of a model as follows:

1/|Test|∑

(ei,rk,ej)∈Test

metric(ei, rk, ?) + metric(?, rk, ej)

2(1)

That is, for every test triple (ei, rk, ej), how well does themodel predict the subject if it is missing and the objectif it is missing? This per-triple metric is averaged overthe entire test set. In Equation 1, the metric implicitlyscales the contribution of each entity or relation to theoverall accuracy of the model proportionally to its fre-quency in the test data (popularity), since the summationranges over all test triples. Hence, such metrics providea biased estimation of the accuracy of the model sincethey under-emphasize the value of predictions about theunpopular, long-tail entities and relations. Using thesebiased metrics can lead to the development of modelsthat perform well on popular entities and relations andignore the rest. This contradicts the principal motivationfor using embedding models, which is to propagate in-formation through the structure of the graph to unfamil-iar entities and relations and use this information to infernew knowledge that cannot be easily extracted by on-linetext mining or inferred using simple graphical models.

Though the problem of long-tail entities and relationsis well-studied in many machine learning contexts (e.g.,minority classes in classification and popularity biasin recommendation systems), the traditional solutionsused to de-bias the evaluation metrics, such as micro-averaging, do not directly apply to knowledge graph em-beddings. The training instance in knowledge graph em-beddings is the triple, which has three components: asubject entity, an object entity, and a relation. In mostreal knowledge graphs, popular entities do not co-occurwith popular relations in the same triple (e.g., it is possi-ble for only the subject to be popular while the object andrelation are not popular). Hence, a triple cannot be char-acterized in its entirety as popular or unpopular. Thus, asimple re-weighting of triples cannot address popularitybias in this context.

Another problem with traditional approaches to de-biasing evaluation metrics is that these approaches quan-tify the accuracy of a model by one number thatshows accuracy under a specific re-weighting (e.g., equalweights to all items). An alternate, and better, approachis to use a series of numbers (together making a trendline) showing the accuracy of the model as the weightis gradually shifted from the most popular items to-wards the least popular items. The unpopular items in aknowledge graph can be divided into two sub-categories:(1) distant-tail items that occur in so few triples (some-times one triple) that reliable predictions about them are

unexpected, and (2) long-tail items that occur in enoughtriples that a good model can learn a meaningful embed-ding of these items. Measuring accuracy via a trend linewould be helpful in discriminating between models thatdo well on most items and perform poorly only on thedistant-tail items, and models that perform well only onthe most popular items and perform poorly on the long-tail and distant-tail items, the majority of the items in theknowledge graph. A good embedding model should per-form well on popular and long-tail items.

Guided by these insights, we propose a new class of eval-uation metrics, strat-hits@k and strat-mrr (strati-fied Hits@k and stratified MRR), and use it to show thatstate-of-the-art embedding models are biased towardspopular entities and relations. These metrics can exposethe popularity bias in the embedding models. Our met-rics do not assume that popular entities co-occur withpopular relations in the same triples, and so can be usedwith knowledge graphs with any level of correlation be-tween entity and relation popularity. We use our met-rics to show that the accuracy of embedding models de-creases on triples representing facts about unpopular en-tities or relations. This can be attributed to the train-ing procedure of these models, since it iterates over allthe triples once every epoch regardless of the popularitybias in the data. Popular items get more focus duringtraining. Hence, the popularity bias in the input knowl-edge graphs makes the inference of new facts that includeunpopular entities and relations more challenging, eventhough these facts are more valuable for enriching thegraph compared to facts about popular items. Note that,while prior work has criticized the benchmark datasetsused for training and evaluation of knowledge graph em-bedding [24, 30], no work has investigated the impact ofpopularity bias on the evaluation metrics.

The contribution of this paper lies in highlighting acommon but previously unexplored problem that affectsknowledge graph embedding models and presenting anovel metric to quantify the effect of this problem on ac-curacy. We introduce the knowledge graph embeddingmodels that we use in Section 2. In Section 3, we em-pirically show that in real knowledge graphs: (1) thereis a selectivity bias towards popular entities and rela-tions, and (2) popular entities and relations are not cor-related (i.e., do not co-occur in the same triple). Wethen show that current evaluation metrics are mathemat-ically biased towards popular entities and relations andpropose our new evaluation metrics to correct this biasin Section 4. We present an experimental evaluation inSection 5, showing that current evaluation of embeddingmodels over-estimates their accuracy; these models areaccurate only on popular entities and relations. We dis-cuss related work in Section 6 and conclude in Section 7.

2 PRELIMINARIES: KNOWLEDGEGRAPH EMBEDDING MODELS

Let E = {e1, e2, . . . , eN} be the set of entities andR = {r1, r2, . . . , rK} be the set of all relation types ina knowledge graph. A triple is of the form (ei, rk, ej),where ei ∈ E is the subject, ej ∈ E is the object, andrk ∈ R is the relation between them. Let T = E×R×Ebe the set of all possible triples. A knowledge graphG ⊆ T is a subset of all possible triples with N =|E| ≥ 2 entities, K = |R| ≥ 1 relations, and M = |G|triples. G can be represented by a third-order binarytensor Y ∈ {0, 1}N×N×K where the random variableyijk = 1 iff (ei, rk, ej) ∈ G, where i, j, and k are thepositions of ei, ej , and rk in the tensor, respectively.

Given a knowledge graph G, link prediction is the prob-lem of finding the probability that any triple t ∈ T existsin the graph. By setting a threshold on the probability,one can determine whether a triple is true or not (andhence assign it a label from {−1, 1}). Link prediction isequivalent to estimating the joint probability distributionP (Y) of correctness of all triples in T given the set oflabeled observed triples D ⊆ E ×R× E × {−1, 1}. Wedenote asD+ ⊆ D the set of positive labeled triples (truetriples) and D− ⊆ D the set of negative labeled triples(false triples). It holds that D = D+ ∪D−. The positivetriples in D+ are a subset of the triples in G. Negativetriples are constructed by different heuristics as we dis-cuss later.

Knowledge graph embedding models learn an embed-ding of all entities and relations in the graph in a low-dimensional space. These models predict the existenceof a triple t = (ei, rk, ej) via a scoring function f(t; Θ)which represents the model’s confidence that t existsgiven the model parameters Θ. Θ consists of the learnedlatent representations of the entities ei, ej and relationrk of t. We denote these representations as ei ∈ Rl,ej ∈ Rl, and rk ∈ Rl, respectively, where l ∈ N is thesize of the model. Embedding models aim to representthe semantics of each entity and relation in its latent rep-resentation and to use this embedding to correctly predictthe scores of true and false triples. Below, we outline thescoring functions of the four popular embedding modelsthat we consider in this paper.

TransE: TransE [4] is a translation-based model in-spired by the Word2vec algorithm [16]. It representsa relation as a translation operation on the entities anduses a scoring function that measures the distance ofthe two entities with respect to the relation of the triple:fTransEt = −d(ei + rk, ej) where d(x, y) can be anydistance measure, e.g., L1 or L2 norm.

DistMult: DistMult [2] can be seen as a more com-pact and less expressive variant of RESCAL, the firstfactorization-based embedding model [20]. It adds a di-agonality constraint on the relation matrix and can thus,model only symmetric relations. Its scoring function is:fDistMultt = eTi diag(rk)ej .

HolE: Leveraging the idea of associative memory,HolE [19] uses a circular correlation operation betweenthe two entities’ vectors in its scoring function: fHolEt =

rTk (ei?ej) where (ei?ej)k =l∑t=1

eitej((k+t−2 mod l)+1).

ComplEx: ComplEx [31] extends DistMult by usingcomplex numbers and the Hermitian dot product. Itsscoring function is: fCompIEt = Real(eTi diag(rk)ej)where Real() is a function that returns the real part of acomplex number. ComplEx has been shown to be equiv-alent to HolE.

3 POPULARITY BIAS INKNOWLEDGE GRAPHS

Most of the popular knowledge graphs (e.g. DBpe-dia, Wikidata, and YAGO) are automatically constructedfrom web sources like Wikipedia or by mining text doc-uments using information extractors contributed by de-velopers [1, 29]. These web sources are subject to pop-ularity bias. This is attributed to the fact that popularknowledge is readily available in many online sourcesand more often completed by users of these platforms, ascompared to unpopular or rare knowledge. Hence, factsabout popular entities are easier to extract with high ac-curacy compared to the same types of facts about unpop-ular entities. Moreover, simple facts that describe simplerelations between entities, such as the type relation be-tween each entity and its class, are available for moreentities compared to the more complex relations that de-scribe complicated and meaningful interactions betweenentities in the graph. For example, though Nollywood isthe second biggest movie industry in the world in termsof the number of movies produced, second only to Bolly-wood, Wikipedia contains information (both partial andfull) regarding only 83 Nollywood movies, whereas fullinformation regarding thousands of Hollywood moviescan be found in Wikipedia. Such data bias is inheritedand can be detected in almost all the commonly usedknowledge graphs. As argued by [6], the problem is am-plified by the use of crowdsourcing in knowledge graphconstruction.

Table 1: Statistics of Benchmark Datasets

NUMBER OF TRIPLES ALL ALL MIX OF POPULAR(train/validation/test) POPULAR UNPOPULAR AND UNPOPULAR

FB15K 483K / 50K / 59K 70K 45K 477KWN18 141.4K / 5K / 5K 7.6K 26.6K 117K

YAGO3-10 1.079M / 5K / 5K 7K 120K 962K

Entity0

2

4

6

8

10

Fre

quen

cy(i

nth

ousa

nds)

(a) FB15K Entities

Entity0.0

0.2

0.4

0.6

0.8

1.0

Fre

quen

cy(i

nth

ousa

nds)

(b) WN18 Entities

Entity0

10

20

30

40

50

60

Fre

quen

cy(i

nth

ousa

nds)

(c) YAGO3-10 Entities

Relation0

5

10

15

Fre

quen

cy(i

nth

ousa

nds)

(d) FB15K Relations

Relation0

10

20

30

Fre

quen

cy(i

nth

ousa

nds)

(e) WN18 Relations

Relation0

100

200

300

Fre

quen

cy(i

nth

ousa

nds)

(f) YAGO3-10 Relations

Figure 1: Frequency Distribution of Entities and Relations in the Triples of Benchmark Datasets

3.1 POPULARITY BIAS IN BENCHMARKDATASETS

Researchers evaluate embedding models on small bench-mark datasets that are constructed from the most fre-quently occurring entities and relations in their originalknowledge graphs rather than the full knowledge graphs.This is because representing huge real-world knowledgegraphs that contain hundreds of millions of entities andtens of thousands of relations with an embedding di-mensionality that can capture all the information in suchgraphs is still computationally infeasible on commonlyavailable machines [21]. Three of the most popularbenchmark datasets are FB15K [4], extracted from Free-Base, WN18 [4], extracted from WordNet, and YAGO3-10 [7], extracted from YAGO. Each of these datasets isdivided into subsets for training, validation, and testing.The statistics of these datasets are given in Table 1.

Figure 1 presents histogram plots for the frequency of

entities and relations for these three benchmark datasets.While these datasets were constructed from the mostfrequently occurring entities and relations, they clearlystill exhibit a skewed, power-law distribution even withinthese most frequent entities and relations. The frequencydistribution of the entities is more skewed compared torelations as the number of entities is much larger thanthe number of relations in all the knowledge graphs.

3.2 ENTITY POPULARITY VS. RELATIONPOPULARITY

Popularity bias, a.k.a. data imbalance, is a very commonproblem that has traditionally been tackled in machinelearning by one of the following methods: artificially re-sampling the dataset, adjusting the cost associated witherrors to be biased towards minority classes, or applyingfeature selection to reduce the data imbalance [14]. Ap-plying these methods to knowledge graphs is challeng-

ing due to the structural differences between tabular andgraph data. The training instance for knowledge graphembedding models is the triple. Simply applying thesemethods assumes that each triple can be characterized aseither popular or unpopular. However, this is not the casesince the popularity of the subject and object entities andthe relation in the same triple are not necessarily corre-lated.

To illustrate this on the benchmark datasets, Table 1splits the triples in the datasets based on the popular-ity of their subject, object, and relation. We assume a(90/10)% split where the most popular 10% of entitiesand relations are considered popular and the rest are con-sidered unpopular. The third column of the table showsthe number of triples in which the subject, predicate, andrelation are all popular. The fourth column shows thenumber of triples in which the subject, predicate, and re-lation are all unpopular. The last column of the tableshows the number of triples containing a mix of popularand unpopular subject, object, and/or relation. The bulkof the triples in all the datasets is in this last column.This demonstrates that identifying popular or unpopularsubjects, objects, and relations is more meaningful thanidentifying popular or unpopular triples. We tried dif-ferent thresholds for popularity (e.g., 20% and 25%) andthe same phenomenon persists. Another measure of thisphenomenon is the correlation between the popularity ofthe subjects and relations in the triples, and between thepopularity of the objects and relations. These correla-tions are all close to zero. The highest in absolute valueis between objects and relations in YAGO3-10, at −0.3.

4 POPULARITY STRATIFIEDEVALUATION METRICS

The popularity bias in knowledge graphs can affect thetraining process of knowledge graph embedding modelsand can lead to models that perform well only on pop-ular entities or relations. Therefore, the evaluation pro-cess should take the popularity bias into account and usean evaluation metric that considers both popular and un-popular items. In this section, we define the stratifiedhits@k and stratified mrr metrics, which generalize thepopular hits@k and mrr metrics by introducing popu-larity weights on entities and relations.

A common approach used in the literature to reducethe popularity bias in recommendation systems is to re-weight the recommendations based on the popularity ofthe item [27]. In knowledge graphs, as popularity biascan happen in entities and in relations, we need to re-weight both entities and relations. We assume that theprobability that a triple (ei, rk, ej) appears in a knowl-

edge graph depends on the popularity of the constituententities and relation. Furthermore, we assume that theprobabilities of entities and relations occurring in thegraph are independent of each other. We define the pop-ularity of an entity or relation x by the number of timesit appears in the graph either as a subject, an object,or a relation. Let N(x) be the popularity of x in thecomplete knowledge graph. In practical settings, N(x)cannot be computed due to many missing edges in theobserved knowledge graph. The empirical probabilityof observing a triple containing x is p(x) = N(x)/N(x)

where N(x) is the number of times x appears in the ob-served knowledge graph.

From the empirical evidence provided in Figure 1, wecan see that the entities and relations follow a power-lawdistribution. In case of power-law distributed variables,we can say that p(x) ∝ [N(x)]α where α ∈ <+ is the tailindex. A smaller value of α indicates that the distribu-tion contains more extreme values with a very heavy tail(many unpopular items) and a larger value of α indicatesa weak tail with few unpopular items. The probability ofentity or relation x being observed is:

p(x) = c[N(x)]α (scale invariance of power law)

= c[N(x)

p(x)]α (using the definition of p)

[p(x)]α+1 = c[N(x)]α (re-arranging the terms)

p(x) ∝[N(x)]β (scale invariance; β = α/α+1) (2)

The two most common evaluation metrics used in theknowledge graph literature are: hits@k and mrr. Fora triple (ei, rk, ej), hits@k(ei, rk, ej) =

1[ej ∈ top-k(ei, r, ∗)] + 1[ei ∈ top-k(∗, r, ej)]2

(3)

where 1 is the indicator function, top-k(ei, rk, ∗) is theset of top k ranked entities for the relation rk with ei asthe subject entity, and top-k(∗, rk, ej) is the similar setwith ej as the object entity. The mean reciprocal rank forthe same triple mrr(ei, rk, ej) =

1

2

(1

rank(ej)in(ei, rk, ∗)+

1

rank(ei)in(∗, rk, ej)

)(4)

where rank(ej)in(ei, rk, ∗) is the rank of ej in the setof ranked predictions for the relation rk with ei as thesubject entity, and conversely for rank(ei)in(∗, rk, ej).

For a given relation r, we define hitsr@k and mrrr forrelation r, which gives equal weight to all pairs of entities

as:

hitsr@k =1

|E(r)|∑

(ei,ej)∈E(r)

hits@k(ei, r, ej)

mrrr =1

|E(r)|∑

(ei,ej)∈E(r)

mrr(ei, r, ej)

(5)

where E(r) is the set of corresponding entity pairs thatoccur with r in the test data. The overall hits@k or mrrscore for the dataset is defined as:

metric =

∑r∈R metricr × |E(r)|∑

r∈R|E(r)| (6)

where metric is hits@k or mrr.

As pointed out earlier, a common and well establishedapproach for de-biasing the popularity in the estimatedquantity is to re-weight the predictions [11, 27]. Modify-ing Equation 5 to take into account the weight of the en-tities, we define the stratified metrics as strat-hits@krand strat-mrrr for relation r as follows:

strat-metricr

=1

|E(r)|∑

(ei,ej)∈E(r)

strat-metric

(ei, r, ej) (7)

where strat-hits@k(ei, rk, ej) =

wej1[ej ∈ top-k(ei, r, ∗)] + wei1[ei ∈ top-k(∗, r, ej)]wei + wej

(8)and strat-mrr(ei, rk, ej) =

1

wei + wej

(wej

rank(ej)in(ei,rk,∗)+

weirank(ei)in(∗,rk,ej)

)(9)

where wei and wej are the weights of entities ei and ej ,respectively.

Finally, we define strat-hits and strat-mrr for theentire dataset as:

strat-metric = a∑r

wr × strat-metricr (10)

where wr is the weight of relation r and a is a nor-malization factor. Following [26], we assume wr ∝s(r) and wei ∝ s(ei) where s(x) = 1

p(x) ∝ 1[N(x)]β

and the weights are normalized such that the finalstrat-metric score ∈ [0, 1]. Note that there are sepa-rate β parameters for entities and relations, respectively,βe and βr. With the normalization constant, our finalequation takes the form:

strat-metric =

∑r∈R w

r × strat-metricr∑r∈R w

r(11)

We note that hits@k applied to a single triple is ex-actly the same as the recall metric used in informa-tion retrieval. Following the expected utility maximiza-tion based definition of recall [22], one can see that

strat-recall [26] and strat-hits are proportionalto the true positive rate and thus equivalent. Sincestrat-recall on the observed data provides a nearlyunbiased estimate of the true recall [26], we can arguethat strat-hits@k is an unbiased estimate of hits@k.

Macro-averaging vs. Micro-averaging: In the classi-fication literature, the macro-average evaluation metricgives equal weight to each class when computing theclassification accuracy, whereas the micro-average givesequal weight to each instance or classification decision.Thus, micro-average is a measure of the classifier’s over-all performance on the most frequently occurring classeswhile macro-average gives a sense of the model’s perfor-mance by giving equal weight to each class. The com-monly used hits@k and mrr metrics can be viewed asmicro-average metrics of the performance of the modelwith respect to relations and are dominated by the mostpopular relations. A macro-average would find the aver-age of hitsr@k or mrrr for all the relations giving equalweight to the relations regardless of their popularity. Alarge difference between the two metrics indicates a largepopularity bias.

Effect of the Hyperparameters βe and βr: The hy-perparameter β we have defined in strat-hits@k con-trols the weighting factor for entities and relations. Whenβ = −1, the weight of each item is proportional to thefrequency of that item. Thus, the metric is biased towardspopular items. As we increase β, we shift the focus moretowards unpopular items. For β ∈ [−1, 0), the weight ofeach item is proportional to the frequency of that itemwhere higher β values downgrade all the weights andhence decrease the variance between popular and unpop-ular items. A value β = 0 gives equal weight to theitems. For β ∈ (0, 1], the weight of each item is in-versely proportional to its frequency and higher beta val-ues increase the variance between popular and unpopularweights. Plotting the curve of strat-hits@k vs. β us-ing an equally separated range of values of β gives valu-able insights into the performance of a model. For exam-ple, models that perform well on popular and moderatelyunpopular items will be differentiated from models thatperform well only on very popular items.

A very useful feature of these evaluation metrics is thatthey can be used to focus on popularity bias in eitherentities or relations by fixing one of the β values at 0 andvarying the other one. To evaluate of the susceptibility ofa model to popularity bias in, say, entities, we fix βr at 0and vary βe from −1 to 1.

Notice that some specific values of β result in commonlyused biased performance metrics, as stated in the follow-ing two lemmas.

Lemma 1 For βr = −1 and βe = 0, strat-hits@kcalculates the commonly used micro-averaged hits@kand strat-mrr calculates the commonly used micro-averaged mrr .

Lemma 2 For βe = βr = 0, strat-hits@k calculatesthe macro-averaged hits@k and strat-mrr calculatesthe macro-averaged mrr.

5 EXPERIMENTAL EVALUATION

In this section we use our strat-hits@k andstrat-mrr metrics to experimentally evaluate the im-pact of popularity bias on the performance of state-of-the-art knowledge graph embedding models.

Datasets: We use two popular link prediction bench-marks: FB15K and YAGO3-10 (Table 1). FB15K is asubset of Freebase [3], a knowledge graph that containsgeneral information about the world, and YAGO3-10 isa subset of YAGO3 [15], a knowledge graph constructedfrom Wikipedia. YAGO3-10 was constructed from enti-ties that have at least 10 relations each. We choose thesetwo datasets as they are big enough and contain enoughrelations and entities to adequately evaluate the effect ofpopularity bias on model accuracy.

Experimental Setup: We evaluate the four embed-ding models described in Section 2: TransE, DistMult,HolE, and ComplEx. These embedding models containtranslational and factorization based models and haveshown competitive performance on most of the rela-tional learning tasks. We use the implementation pro-vided by AmpliGraph [5], an open source Python librarythat uses TensorFlow to implement knowledge graphembeddings, the loss functions, and optimizers com-monly used to optimize them. We minimize the reg-ularized logistic loss as described in [24], which hasbeen shown to improve results significantly comparedto the pair-wise ranking loss [13, 31]. We use theAdam optimizer as it requires substantially fewer itera-tions to converge compared to Adagrad [10]. We per-formed a grid search over the hyperparameters to max-imize the filtered mean reciprocal rank (MRR) on thevalidation data. The grid search was done over the fol-lowing range of hyperparameters: embedding size d ∈{150, 200, 300, 350, 400}, batch size ∈ {50, 64, 100},learning rate ∈ {0.001, 0.0005, 0.0001, 0.00005}, andnumber of negative examples η ∈ {10, 20, 30, 50}. Wetrained all models up to 2000 epochs.

Generation of Negative Examples: We use the stan-dard approach for generating negative examples pro-posed in [4]: Let D+ be the set of positive training ex-

amples in the dataset. For each triple (s, p, o) in D+, wegenerate η negative triples by replacing the subject s orthe object o with a random entity and filter the correcttriples from these generated examples. Thus, the set ofnegative examples D− contains the triples:

{(s′, p, o)| s′ ∈ E ∧ s′ 6= s ∧ (s, p, o) ∈ D+} ∪{(s, p, o′)| o′ ∈ E ∧ o′ 6= o ∧ (s, p, o) ∈ D+}

The training set is D = D+ ∪ D−. Following the linkprediction literature, we randomly sample a different setof negative examples of size η × N from D− in eachepoch during training.

Evaluation Protocol: Following previous works onknowledge graph embedding models, we use the linkprediction task for evaluation. For each true triple(ei, rk, ej) in the test dataset, we replace ei and ej witheach entity e ∈ E and score both the true and thecorrupted triples using the model. Then we sort thetriples by the value of their scores with higher scoresranked first. We compute the stratified mrr and strati-fied hits@k for k ∈ {1, 3, 10} according to the equa-tions described in Section 4. To show the effect of pop-ularity bias on the accuracy of the model, we reportthe stratified metrics for gradually increasing values ofβ ∈ [−1.0, 1.0] for both the entities and the relations.

Results: Figures 2 and 3 show the strat-hits@k ofthe four knowledge graph embedding models on FB15Kand YAGO3-10, respectively, while varying the β hyper-parameters that control the bias towards popular entitiesand relations. In sub-figures (a)–(c), we fix the weightof all relations to 1 by setting βr = 0. We then plotstrat-hits@k by varying βe. The leftmost point ineach figure, with βe = −1, focuses on popular entities.As we move towards the right and increase βe, we shiftthe focus of the metric towards less popular entities. Weobserve from the plots that the performance of all embed-ding models degrades as βe increases. This means thatknowledge graph embedding models derive their accu-racy from doing well on popular entities while ignoringless popular ones. However, the less popular entities areoften the more interesting ones in a link prediction task.Our strat-hits@k metric exposes this deficiency andprovides a simple way to decide how much to focus onpopular entities vs. less popular ones.

In sub-figures (d)–(f) we fix the weight of all entities to1 by setting βe = 0. We then plot strat-hits@k byvarying βr. Recall from Lemma 1 that the leftmost pointon each figure, with βr = −1, corresponds to the micro-averaged hits@k, which is the metric used in all knowl-edge graph embedding papers. The accuracy at this pointis derived from performance on the popular relations. Aswe increase βr, all knowledge graph embedding models

−1.0 −0.5 0.0 0.5 1.0βe

0.6

0.7

0.8

0.9

Str

at-H

its@

10

TransE

DistMult

HolE

ComplEx

(a) strat-hits@10, βr = 0

−1.0 −0.5 0.0 0.5 1.0βe

0.4

0.5

0.6

0.7

0.8

Str

at-H

its@

3

TransE

DistMult

HolE

ComplEx

(b) strat-hits@3, βr = 0

−1.0 −0.5 0.0 0.5 1.0βe

0.3

0.4

0.5

0.6

0.7

Str

at-H

its@

1

TransE

DistMult

HolE

ComplEx

(c) strat-hits@1, βr = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.65

0.70

0.75

0.80

0.85

0.90

Str

at-H

its@

10

TransE

DistMult

HolE

ComplEx

(d) strat-hits@10, βe = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.5

0.6

0.7

0.8

Str

at-H

its@

3

TransE

DistMult

HolE

ComplEx

(e) strat-hits@3, βe = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.2

0.4

0.6

Str

at-H

its@

1

TransE

DistMult

HolE

ComplEx

(f) strat-hits@1, βe = 0

Figure 2: strat-hits@k of Knowledge Graph Embedding Models on FB15K

except ComplEx on FB15K suffer a degradation in per-formance. ComplEx performs better for unpopular rela-tions. As before, we see that these state-of-the-art knowl-edge graph embedding models derive their accuracy inFB15K from doing well only on popular items. OnYAGO3-10, the performance of the models degrades asthe focus on popularity decreases, then it improves again.This means that the accuracy of the models is high on thevery popular and the very unpopular relations and lowerin the middle. This is rarely the desired performanceof knowledge graph embedding models, since the mostvaluable predictions are the ones in the middle. However,we suspect that this behavior is specific to YAGO3-10 be-cause it contains a small number of relations: YAGO3-10contains only 37 relations, hardly representative of largereal-world knowledge graphs. With more relations, thetrend would likely be similar to FB15K.

Figure 4 shows the strat-mrr on the same two datasets.Sub-figures (a) and (c) fix βr = 0 and vary the βe onFB15K and YAGO3-10, respectively. Sub-figures (b)and (d) fix βe = 0 and vary βr on FB15K and YAGO3-10, respectively. The results show the same trends as thestrat-hits@k figures, which confirms the popularitybias in knowledge graph embeddings and the ability ofour metrics to consistently capture it.

This clear bias in embedding models can be explainedby the training procedure of these models. During train-

ing, the popular entities and relations have more back-ground and context information available in the trainingdata and get more focus during optimization. Also, Thepopular entities and relations occur in more triples andget updated more frequently. Hence, the model infersnew facts about them with higher accuracy while unpop-ular entities and relations get overlooked.

6 RELATED WORK

Recently, multiple works have discussed the inherent bi-ases in knowledge graphs and their impact on relationallearning tasks. Janowicz et al. [9] argue that cultural,geographical, and statistical biases in knowledge graphscan arise from the source of the data, its ontology, or thereasoning and rule learning systems used to enrich thegraphs. Demartini [6] argues that collaborative crowd-sourcing methods adopted in knowledge graph creationintroduce the implicit biases of the contributors into theknowledge graph. Stepanova et al. [28] state that in rulemining systems, the popularity bias might lead to the ex-traction of erroneous biased rules. Li et al. [12] proposenew evidence-collecting techniques to improve knowl-edge verification for long-tail entities that are not sup-ported by enough information in the graph. Guo et al. [8]argue that triple-level learning in knowledge graph em-beddings gives limited attention to long-tail entities, andthus their embeddings suffer from low expressiveness.

−1.0 −0.5 0.0 0.5 1.0βe

0.3

0.4

0.5

0.6

Str

at-H

its@

10

TransE

DistMult

HolE

ComplEx

(a) strat-hits@10, βr = 0

−1.0 −0.5 0.0 0.5 1.0βe

0.2

0.3

0.4

0.5

0.6

Str

at-H

its@

3

TransE

DistMult

HolE

ComplEx

(b) strat-hits@3, βr = 0

−1.0 −0.5 0.0 0.5 1.0βe

0.1

0.2

0.3

Str

at-H

its@

1

TransE

DistMult

HolE

ComplEx

(c) strat-hits@1, βr = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.4

0.5

0.6

Str

at-H

its@

10

TransE

DistMult

HolE

ComplEx

(d) strat-hits@10, βe = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.2

0.3

0.4

0.5

Str

at-H

its@

3

TransE

DistMult

HolE

ComplEx

(e) strat-hits@3, βe = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.1

0.2

0.3

0.4

Str

at-H

its@

1

TransE

DistMult

HolE

ComplEx

(f) strat-hits@1, βe = 0

Figure 3: strat-hits@k of Knowledge Graph Embedding Models on YAGO3-10

−1.0 −0.5 0.0 0.5 1.0βe

0.4

0.5

0.6

0.7

Str

at-M

RR

TransE

DistMult

HolE

ComplEx

(a) FB15K, βr = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.4

0.5

0.6

0.7

0.8

Str

at-M

RR

TransE

DistMult

HolE

ComplEx

(b) FB15K, βe = 0

−1.0 −0.5 0.0 0.5 1.0βe

0.2

0.3

0.4

Str

at-M

RR

TransE

DistMult

HolE

ComplEx

(c) YAGO3-10, βr = 0

−1.0 −0.5 0.0 0.5 1.0βr

0.2

0.3

0.4

0.5

Str

at-M

RR

TransE

DistMult

HolE

ComplEx

(d) YAGO3-10, βe = 0

Figure 4: strat-mrr of Knowledge Graph Embedding Models on FB15K and YAGO3-10

They propose using path-level embeddings to improveperformance on entity alignment and knowledge graphcompletion especially on long-tail entities but they do notquantify this improvement. Our work is the first to dis-cuss bias in the evaluation metrics. Our proposed metricsoffer unbiased estimations of accuracy and encourage de-velopment of unbiased embedding models.

7 CONCLUSION

In this paper, we study the effect of popularity bias onthe performance of knowledge graph embedding mod-els. We observe that knowledge graphs have a skewedpopularity distribution for entities and relations, and thepopularity of entities and relations is not necessarily cor-related. We note that this popularity bias can have a detri-

mental effect on the training of knowledge graph embed-ding models and is not captured by current evaluationmetrics. We propose the stratified Hits@k and stratifiedMRR metrics, which evaluate the accuracy of models onboth popular and unpopular items, and have tuning pa-rameters β that control the focus on popular vs. unpop-ular items. Using these metrics, we demonstrate that re-cent knowledge graph embedding models are indeed bi-ased towards popular items, and we quantify this bias.Thus, we provide useful evaluation metrics that subsumethe currently used ones and enable better understandingof the accuracy of embedding models on the long tailof the frequency distribution. Our future work is to useour novel evaluation metrics as a starting point for devel-oping knowledge graph embedding models that do notsuffer from popularity bias.

References

[1] Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives. DBpe-dia: A nucleus for a web of open data. In The SemanticWeb, pages 722–735. 2007.

[2] Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. Freebase: A collaborativelycreated graph database for structuring human knowledge.In SIGMOD, pages 1247–1250, 2008.

[3] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,Jason Weston, and Oksana Yakhnenko. Translating em-beddings for modeling multi-relational data. In NeurIPS,pages 2787–2795, 2013.

[4] Luca Costabello, Sumit Pai, Chan Le Van, Rory McGrath,Nick McCarthy, and Pedro Tabacof. AmpliGraph: A Li-brary for Representation Learning on Knowledge Graphs,2019.

[5] Gianluca Demartini. Implicit bias in crowdsourcedknowledge graphs. In WWW, pages 624–630, 2019.

[6] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, andSebastian Riedel. Convolutional 2d knowledge graph em-beddings. In AAAI, 2018.

[7] Lingbing Guo, Zequn Sun, and Wei Hu. Learning toexploit long-term relational dependencies in knowledgegraphs. In ICML, pages 2505–2514, 2019.

[8] Krzysztof Janowicz, Bo Yan, Blake Regalia, Rui Zhu, andGengchen Mai. Debiasing knowledge graphs: Why fe-male presidents are not like female popes. In ISWC, 2018.

[9] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst.Knowledge base completion: Baselines strike back. InProc. Workshop on Representation Learning for NLP(RepL4NLP), pages 69–74, 2017.

[10] Jae Kwang Kim and Jay J. Kim. Nonresponse weightingadjustment using estimated response probability. Cana-dian Journal of Statistics, 35:501–514, 2007.

[11] Furong Li, Xin Luna Dong, Anno Langen, and Yang Li.Knowledge verification for long-tail verticals. PVLDB,10:1370–1381, 2017.

[12] Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogi-cal inference for multi-relational embeddings. In ICML,pages 2168–2178, 2017.

[13] Rushi Longadge and Snehlata Dongre. Class imbalanceproblem in data mining review. International Journal ofComputer Science and Network, 2, 2013.

[14] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M.Suchanek. YAGO3: A knowledge base from multilingualWikipedias. In CIDR, 2015.

[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In NeurIPS, pages3111–3119, 2013.

[16] George A. Miller. WordNet: A lexical database for En-glish. Comm. ACM, 38:39–41, 1995.

[17] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Ev-geniy Gabrilovich. A review of relational machine learn-ing for knowledge graphs. In Proc. IEEE, 2016.

[18] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A.Poggio. Holographic embeddings of knowledge graphs.In AAAI, 2016.

[19] Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. A three-way model for collective learning onmulti-relational data. In ICML, 2011.

[20] Maximillian Nickel and Douwe Kiela. Poincare em-beddings for learning hierarchical representations. InNeurIPS, pages 6338–6347, 2017.

[21] Shameem P. Parambath, Nicolas Usunier, and YvesGrandvalet. Optimizing F-measures by cost-sensitiveclassification. In NeurIPS, pages 2123–2131, 2014.

[22] Heiko Paulheim. Knowledge graph refinement: A surveyof approaches and evaluation methods. Semantic web,8:489–508, 2017.

[23] Jay Pujara, Eriq Augustine, and Lise Getoor. Sparsity andnoise: Where knowledge graph embeddings fall short. InEMNLP, 2017.

[24] Amit Singhal. Introducing the knowledge graph: Things,not strings, May 2012.

[25] Harald Steck. Training and testing of recommender sys-tems on data missing not at random. In SIGKDD, pages713–722, 2010.

[26] Harald Steck. Item popularity and recommendation accu-racy. In RecSys, pages 125–132, 2011.

[27] Daria Stepanova, Mohamed H. Gad-Elrab, andVinh Thinh Ho. Rule induction and reasoning overknowledge graphs. In Reasoning Web InternationalSummer School, pages 142–172, 2018.

[28] Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. YAGO: A core of semantic knowledge. InWWW, pages 697–706, 2007.

[29] Kristina Toutanova and Danqi Chen. Observed vs. latentfeatures for knowledge base and text inference. In Proc.Workshop on Continuous Vector Space Models and theirCompositionality, 2015.

[30] Theo Trouillon and Maximilian Nickel. Complex andholographic embeddings of knowledge graphs: a compar-ison. arXiv preprint arXiv:1707.01475, 2017.

[31] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao,and Li Dengi. Embedding entities and relations for learn-ing and inference in knowledge bases. In ICLR, 2015.

Popularity Agnostic Evaluation of Knowledge Graph Embeddings · 2020-06-29 · Popularity Agnostic...

Documents

Transcript of Popularity Agnostic Evaluation of Knowledge Graph Embeddings · 2020-06-29 · Popularity Agnostic...