Statistical Relation Learning for Knowledge Graphssrihari/CSE674/Chap22/22.2...RESCAL as an MLP 18...

26
Knowledge Graphs Srihari 1 Statistical Relation Learning for Knowledge Graphs Sargur N. Srihari [email protected]

Transcript of Statistical Relation Learning for Knowledge Graphssrihari/CSE674/Chap22/22.2...RESCAL as an MLP 18...

  • Knowledge Graphs Srihari

    1

    Statistical Relation Learning for Knowledge Graphs

    Sargur N. [email protected]

  • Knowledge Graphs Srihari

    Topics• Knowledge Graphs (KGs)• Statistical Relation Learning (SRL) for KGs• Latent Feature Models

    – RESCAL, ER-MLP, Latent distance– Training SRL

    • Markov Random Fields from KGs• References

    2

  • Knowledge Graphs Srihari

    Statistical Relation Learning (SRL)

    • Creation of statistical models for relational data• Triples are assumed to be incomplete and noisy• Entities and relation types may contain

    duplicates

    3

  • Knowledge Graphs Srihari

    Probabilistic Knowledge Graphs• Let E = {e1, e2, …, eNe} be the set of all entities • R = {r1, r2, …, rNr} the set of all relations in a KG

    – Each possible triple is defined as Xijk = (ei , rk , e j)• over set of entities E and relations R• as a binary r.v. yijk ∈{0,1} that indicates its existence

    • All possible triples in E x R x E can be grouped in an Adjacency tensor (a three-way array) – Y ∈ {0,1} Ne x Ne xNr, whose entries are set such that

    • Each possible realization of Y can be interpreted as a possible world.

  • Knowledge Graphs Srihari

    Tensor Representation

    • Binary relational data is represented as a tensor

    5

    Ne

    NeNr

    Fibers of a 3rd order tensor

    Slices of a 3rd order tensor

  • Knowledge Graphs Srihari

    Model for Knowledge Graph

    • We are interested in estimating the joint distribution P(Y), from a subset D⊆ExRxE x{0,1} of observed triples

    • In doing so, we are estimating a probability distribution over possible worlds, which allows us to predict the probability of triples based on the state of the entire knowledge graph

    6

  • Knowledge Graphs Srihari

    Size of the Adjacency Tensor• Y can be enormous for large knowledge graphs

    – E.g., in Freebase, with over 40 million entities and 35,000 relations, the no. of possible triples ExRxEexceeds 1019 elements

    • Of course, type constraints reduce this number considerably

    • Even among syntactically valid triples, only a tiny fraction are likely to be true. – For example, there are over 450,000 actors and

    over 250,000 movies stored in Freebase – But each actor stars only in a small no. of movies

  • Knowledge Graphs Srihari

    Statistical Properties of KGs• KGs typically adhere to some deterministic

    rules, such as type constraints and transitivity– Ex: if Leonard Nimoy was born in Boston, and

    Boston is located in the USA, then we can infer that Leonard Nimoy was born in the USA

    • Also various “softer” statistical patterns or regularities, which are not universally true but nevertheless have useful predictive power– Homophily: tendency of entities to be related to

    other entities with similar characteristics• US-born actors are more likely to star in US-made movies

    8

  • Knowledge Graphs Srihari

    Types of SRL Models• Triples correlated with certain other triples

    – i.e., random variables yijk correlated with each other • Three main ways to model these correlations:

    1.M1: Latent feature models• yijk are conditionally independent given latent features

    associated with SPO type and additional parameters2.M2: Graph feature models

    • yijk are conditionally independent given observed graph features and additional parameters

    3.M3: Markov random fields• Assume all yijk have local interactions

    9

  • Knowledge Graphs Srihari

    Probability Models M1 and M2• Model classes M1 and M2 predict the existence

    of a triple xijk via a score function f (xijk; 𝛳)– which represents the model’s confidence that a

    triple exists given the parameters 𝛳• The conditional independence assumptions of

    M1 and M2 allow the probability model to be written as follows:

    10

  • Knowledge Graphs Srihari

    Latent Feature Models• Variables yijk are conditionally independent

    given global latent features and parameters:

    • We discuss forms for score function f (xijk ; 𝛳) below – All models explain triples via latent features

    11

  • Knowledge Graphs Srihari

    Example of Latent Features• Fact: Alec Guinness received Academy Award• Possible Explanation: he is a good actor

    – Uses latent features of entities (good actor)• “Latent” because not directly observed in the data

    • Task: infer these features from data– Denote latent features of entity ei by vector ei ∈ RHe

    where He is no. of latent features in the model – A model for:

    • Alec Guinness is a good actor and • Academy Award is prestigious

    12

  • Knowledge Graphs SrihariTypes of Latent Feature Models• Intuition behind relational latent feature models:

    – Relationships between entities can be derived from interactions of their latent features

    • Methods to model interactions, and to derive the existence of a relationship from them1.Bilinear model2.Other tensor factorization models3.Matrix factorization methods4.Multi-layer perceptrons

    • Based on entities (E) and entity relationships (ER)

    13

    ha is an additive hidden layer

  • Knowledge Graphs Srihari

    RESCAL

    • RESCAL is an embedding method for learning from knowledge graphs– For tasks like link prediction and entity resolution– Scalable to KGs with millions of entities and billions

    of facts– Provide access to relational information for deep

    learning methods• RESCAL = Relational + Scalable

    14

  • Knowledge Graphs Srihari

    RESCAL (Bilinear Model)• Explains triples via pairwise latent features

    – The score of a triple xijk is modeled as

    – where Wk ∈ RHe X He with entries wabk specify how latent features a and b interact in the kth relation

    • It is bilinear, since interactions between entity vectors are multiplicative

    – Ex: Model the pattern that good actors receive prestigious awards via

    15

  • Knowledge Graphs Srihari

    RESCAL as Tensor Factorization

    16

    Equation can be written compactly as

    Fk=EWkET

    where Fk ∈ RNe X Ne is the matrix holding all scores for the kth relation and theith row of E∈ RNe X He holds the latent representation of ei

    RESCAL is similar to methods used in recommendation systems, and to traditional tensor factorization methods

  • Knowledge Graphs Srihari

    Multi-Layer Perceptrons

    • We can interpret RESCAL as creating composite representations of triples and predicting their existence from this representation

    • In particular, we can rewrite RESCAL as

    – Where wk=vec(Wk)

    17

  • Knowledge Graphs Srihari

    RESCAL as an MLP

    18

    MLP based on entities (E)

    One disadvantage of the E-MLP is that it has to define a vector wk and a matrix Ak for every possible relation,which requires Ha+(Ha x 2He) parameters per relation.An alternative is to embed the relation itself, using a Hr-dimensional vector rk

    MLP based on entities and relationships (ER)

    ER-MLP uses a global weight vector for all relations. This model was used in the KV project since it has many fewer parameters than the E-MLP; the reason is that C is independent of the relation k

  • Knowledge Graphs Srihari

    RESCAL and ER-MLP as Neural Nets

    19He=Hr=3 and Ha=3. Note, that the inputs are latent features.The symbol g denotes the application of the function g (.)

  • Knowledge Graphs Srihari

    Embedding of relations in ER-MLP

    • Nearest neighbors of latent representations of selected relations computed with a 60 dimensional model on Freebase – Nos. represent squared Euclidean distances.

    • Semantically related relations near each other – Closest relations to children relation are:

    • parents, spouse, and birth-place

    20

  • Knowledge Graphs Srihari

    Neural Tensor Networks (NTNs)• NTN is a combination of traditional neural

    networks with bilinear models– The NTN model:

    • Here Bk is a tensor, where the lth slice Blk has size He x He, and there are Hb slices

    • hbijk is a bilinear hidden layer, since it is derived from a weighted combination of multiplicative terms

    – With more parameters than E-MLP or RESCAL models it tends to overfit 21

  • Knowledge Graphs Srihari

    Latent Distance Models• Derive probability relationships from distance

    between latent representations of entities– Entities are likely related if representations are near– model probability xij via score f (ei,ej) = - d (ei,ej)

    1.Structure Embeddings Model• Ask,Aok transform global latent features of entities to model

    relationships specifically for the kth relation

    2.TransE Model– Translates latent features via a relation-specific

    offset instead of matrix multiplications. • Score of a triple xijk is defined as: f TransEijk := - d(ei+rk , ej)

  • Knowledge Graphs Srihari

    Summary of Latent Feature Models

    23

    Best model will be dataset dependentER-MLP model outperformed the NTN model on a particular datasetRESCAL worked best on two link prediction tasks.

  • Knowledge Graphs Srihari

    Graph Feature Models

    • Existence of an edge predicted by extracting features from observed edges in the graph– Consider existence of the path

    John-- parentOf àAnne ßparentOf -- Maryrepresenting a common child

    – So we could predict the triple (John, marriedTo, Mary)• Explains triples directly from observed triples in

    the KG

    24

  • Knowledge Graphs Srihari

    Training Statistical Relation Learning• We have a set of Nd observed triples and let the

    nth triple be denoted by xn• Each observed triple is either true (denoted

    yn=1) or false (denoted yn=0) • The labeled dataset is D= {(xn,yn)} |n=1,...,Nd}• Given this, a natural way to estimate the

    parameters 𝛳 is to compute the maximum a posteriori (MAP) estimate:

    • where λ controls the strength of the prior 25

  • Knowledge Graphs Srihari

    Loss Function• We can equivalently state this as a regularized

    loss minimization problem

    • Where L(p,y) =-log Ber(y|p) is the log loss

    • Another loss is squared loss, L(p,y) = (p –y)2– Using the squared loss can be especially efficient in

    combination with a closed-world assumption (CWA)– Minimization for RESCAL becomes

    26