Incorporating Chains of Reasoning over Knowledge Graph for...

Incorporating Chains of Reasoning over Knowledge Graph for Distantly

Supervised Biomedical Knowledge Acquisition

Qin Dai1, Naoya Inoue

1,2, Paul Reisert

2, Ryo Takahashi

1and Kentaro Inui

1,2

1Tohoku University, Japan2RIKEN Center for Advanced Intelligence Project, Japan

{daiqin, naoya-i, preisert, ryo.t, inui}@ecei.tohoku.ac.jp

Abstract

The increased demand for structured scien-tific knowledge has attracted considerable at-tention on extracting scientific relation fromthe ever growing scientific publications. Dis-tant supervision is a widely applied approachto automatically generate large amounts of la-belled sentences for scientific Relation Ex-traction (RE). However, the brevity of thelabelled sentences would hinder the perfor-mance of distantly supervised RE (DS-RE).Specifically, authors always omit the Back-ground Knowledge (BK) that they assume iswell known by readers, but would be essen-tial for a machine to identify relationships. Toaddress this issue, in this work, we assumethat the reasoning paths between entity pairsover a knowledge graph could be utilized asBK to fill the “gaps” in text and thus facili-tate DS-RE. Experimental results prove the ef-fectiveness of the reasoning paths for DS-RE,because the proposed model that incorporatesthe reasoning paths achieves significant andconsistent improvements as compared with astate-of-the-art DS-RE model.

1 Introduction

Scientific Knowledge Graph (KG), such as UnifiedMedical Language System (UMLS) 1, is extremelycrucial for many scientific Natural Language Pro-cessing (NLP) tasks such as Question Answering(QA), Information Retrieval (IR) and Relation Ex-traction (RE). Scientific KG provides large collec-tions of relations between entities, typically storedas (h, r, t) triplets, where h = head entity, r =

1https://www.nlm.nih.gov/research/umls/

relation and t = tail entity, e.g., (acetaminophen,may treat, pain). However, KGs are often highlyincomplete (Min et al., 2013). Scientific KGs, aswith general KGs such as Freebase (Bollacker et al.,2008) and DBpedia (Lehmann et al., 2015), are farfrom complete and this would impede their useful-ness in real-world applications. Scientific KGs, onthe one hand, face the data sparsity problem. On theother hand, scientific publications have become thelargest repository ever for scientific KGs and con-tinue to increase at an unprecedented rate (Munroe,2013). Therefore, it is an essential and fundamentaltask to turn the unstructured scientific publicationsinto well organized KG, and it belongs to the task ofRE.

One obstacle that is encountered when buildinga RE system is the generation of training instances.For coping with this difficulty, (Mintz et al., 2009)proposes distant supervision to automatically gen-erate training samples via leveraging the alignmentbetween KGs and texts. They assume that if two en-tities are connected by a relation in a KG, then allsentences that contain those entity pairs will expressthe relation. For instance, (ketorolac tromethamine,may treat, pain) is a fact triplet in UMLS. Distantsupervision will automatically label all sentences,such as Example 1, Example 2 and Example 3, aspositive instances for the relation may treat. Al-though distant supervision could provide a largeamount of training data at low cost, it always suf-fers from wrong labelling problem. For instance,comparing to Example 1, Example 2 and Example 3should not be seen as the convincing evidences tosupport the may treat relationship between ketoro-lac tromethamine and pain, but will still be anno-tated as positive instances by the distant supervision.

19 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), pages 19-28, Hakodate, Japan, September 13-15, 2019

Copyright © 2019 Qin Dai, Naoya Inoue, Paul Reisert, Ryo Takahashi and Kentaro Inui

(1) The analgesic effectiveness of ketoro-lac tromethamine was compared withhydrocodone and acetaminophen for painfrom an arthroscopically assisted patellar-tendon autograft anterior cruciate ligamentreconstruction.

(2) This double-blind, split-mouth, and random-ized study was aimed to compare the efficacy ofdexamethasone and ketorolac tromethamine,through the evaluation of pain, edema, and lim-itation of mouth opening.

(3) A loading dose of parental ketoro-lac tromethamine was administered andsubjects were later given two staged dosesof the same “unknown” drug with painevaluations conducted after each dose.

To automatically alleviate the wrong labellingproblem, (Riedel et al., 2010; Hoffmann et al., 2011)apply multi-instance learning. In order to avoidthe handcrafted features and errors propagated fromNLP tools, (Zeng et al., 2015) proposes a Convo-lutional Neural Network (CNN), which incorporatemutli-instance learning with neural network model,and achieves significant improvement in distantlysupervised RE (DS-RE). Recently, attention mech-anism is applied to effectively extract features fromall collected sentences, rather than from the most in-formative one that previous work has focused on.(Lin et al., 2016) proposes a relation vector basedattention mechanism for DS-RE. (Han et al., 2018)proposes a novel joint model that leverages a KG-based attention mechanism and achieves significantimprovement than (Lin et al., 2016).

Although the KG-based model outperforms sev-eral state-of-the-art DS-RE models, the brevity oftextual information would inevitably hinder its per-formance. Specifically, authors always leave out in-formation that they assume is known to their readers.For instance, Example 2 omits the background con-nection between ketorolac tromethamine and painand implicitly conveys that the former may treat thelatter. Human readers could easily make this infer-ence based on their Background Knowledge (BK)about the target entity pair. However, for a machine,

ketorolac_tromethamine

pain

Sign_or_Symptomphotophobia

has_nichd_parent

may_treat has_nichd_parent

Figure 1: An example of reasoning path.

it would be extremely difficult to identify the rela-tionship just from the given sentence without the im-portant BK.

To address the issue of textual brevity, in thiswork, we assume that the paths (or reasoning paths)between an entity pair over a KG could be appliedas the BK to fill the “gaps” and thereby improve theperformance of DS-RE. For instance, one reason-ing path between ketorolac tromethamine and painover UMLS is shown in Figure 1. By observingthe path, we may infer with some likelihood that(ketorolac tromethamine, may treat, pain), be-cause ketorolac tromethamine could be prescribedto treat some Sign or Symptom such as photopho-bia, and pain is a Sign or Symptom, therefore ke-torolac tromethamine might be used to treat pain.By comprehensively considering the path in Figure 1and the sentence in Example 2, we could furtherprove the inference. To this end, we propose theDS-RE model that not only encodes the sentencescontaining target entity pairs, but also the reasoningpaths between them over a KG.

We conduct evaluation on biomedical datasets inwhich KG is collected from UMLS and textual datais extracted from Medline corpus. The experimen-tal results prove the effectiveness of the incorpora-tion of reasoning paths for improving DS-RE frombiomedical datasets.

2 Related Work

RE is a fundamental task in the NLP community.In recent years, Neural Network (NN)-based mod-els have been the dominant approaches for non-scientific RE, which include Convolutional Neu-ral Network (CNN)-based frameworks (Zeng et al.,2014; Xu et al., 2015; Santos et al., 2015) RecurrentNeural Network (RNN)-based frameworks (Zhangand Wang, 2015; Miwa and Bansal, 2016; Zhou etal., 2016). NN-based approaches are also used inscientific RE. For instance, (Gu et al., 2017) uti-lizes a CNN-based model for identifying chemical-disease relations from Medline corpus. (Hahn-

20

Powell et al., 2016) proposes an LSTM-based modelfor identifying causal precedence relationship be-tween two event mentions in biomedical papers.(Ammar et al., 2017) applies (Miwa and Bansal,2016)’s model for scientific RE.

Although remarkably good performances areachieved by the models mentioned above, they stilltrain and extract relations on sentence-level and thusneed a large amount of annotation data, which is ex-pensive and time-consuming. To address this issue,distant supervision is proposed by (Mintz et al.,2009). To alleviate the noisy data from the distantsupervision, many studies model DS-RE as a Multi-ple Instance Learning (MIL) problem (Riedel et al.,2010; Hoffmann et al., 2011; Zeng et al., 2015), inwhich all sentences containing a target entity pair(e.g., ketorolac tromethamine and pain) are seen asa bag to be classified. To make full use of all thesentences in the bag, rather than just the most in-formative one in the bag, researchers apply attentionmechanism in deep NN-based models for DS-RE.(Lin et al., 2016) proposes a relation vector basedattention mechanism to extract feature from the en-tire bag and outperforms the prior approaches. (Duet al., 2018) proposes multi-level structured self-attention mechanism. (Han et al., 2018) proposes ajoint model that adopts a KG-based attention mech-anism and achieves significant improvement than(Lin et al., 2016) on DS-RE.

The attention mechanism in deep NN-based mod-els has achieved significant progress on DS-RE.However, the brevity of input sentences could stillnegatively affect the performance. To addressthis issue, we assume that the reasoning paths be-tween target entity pairs over a KG could be ap-plied as BK to fill the “gaps” of input sentencesand thus promote the efficiency of DS-RE. (Rolleret al., 2015) uses some inference pattern learnedfrom UMLS for eliminating potentially related en-tity pairs from negative training data for DS-RE.(Ji et al., 2017) applies entity descriptions gen-erated form Freebase and Wikipedia as BK, (Linet al., 2017) utilizes multilingual text as BK and(Vashishth et al., 2018) uses relation alias infor-mation (e.g., founded and co-founded are aliasesfor the relation founderOfCompany) as BK forDS-RE. However, none of these existing approachesmentioned above comprehensively consider multi-

...

C&P

...

×a1

...

C&P

...

×a2

...

C&P

...

×a3

...

C&P

...

×am

...

+

RC

... ... ...

KGS

(e1, e2) (e1, r, e2)

s1 s2 s3 sm

s1 s2 s3 sm

h r t

ATS

sfinal

Sentence Encoding Part KG Encoding Part

RC

C&P

ATS

KGS KnowledgeGraph Scoring

AttentionScoring

PointwiseOperation

RelationClassification

Convolution &Pooling

...

...

...

Figure 2: Overview of the base model.

ple sentences containing entity pairs and multiplereasoning paths between them for DS-RE.

3 Base Model

The success of the joint model proposed by (Han etal., 2018) inspires us to choose the strong model asour base model for biomedical DS-RE. The architec-ture of the base model is illustrated in Figure 2. Inthis section, we will introduce the base model pro-posed by (Han et al., 2018) in two main parts: KGEncoding part and Sentence Encoding part.

3.1 KG Encoding Part

Suppose we have a KG containing a set of facttriplets O = {(e1, r, e2)}, where each fact tripletconsists of two entities e1, e2 2 E and their relationr 2 R. Here E and R stand for the set of entities andrelations respectively. KG Encoding Part then en-codes e1, e2 2 E and their relation r 2 R into low-dimensional vectors h, t 2 R

d and r 2 Rd respec-

tively, where d is the dimensionality of the embed-ding space. The base model adopts two KnowledgeGraph Completion (KGC) models Prob-TransE andProb-TransD, which are based on TransE (Bordes etal., 2013) and TransD (Ji et al., 2015) respectively,to score a given fact triplet. Specifically, given anentity pair (e1, e2), Prob-TransE defines its latent re-lation embedding rht via the Equation 1.

rht = t� h (1)

Prob-TransD is an extension of Prob-TransE and in-troduces additional mapping vectors hp, tp 2 R

d

and rp 2 Rd for e1, e2 and r respectively. Prob-

TransD encodes the latent relation embedding via

21

the Equation 2, where Mrh and Mrt are projectionmatrices for mapping entity embeddings into rela-tion spaces.

rht = tr � hr, (2)

hr = Mrhh,

tr = Mrtt,

Mrh = rph>p + Id⇥d

,

Mrt = rpt>p + Id⇥d

The conditional probability can be formalized overall fact triplets O via the Equations 3 and 4,where fr(e1, e2) is the KG scoring function, whichis used to evaluate the plausibility of a givenfact triplet. For instance, the score for (aspirin,may treat, pain) would be higher than that for (as-pirin, has ingredient, pain), because the former ismore plausible than the latter. ✓E and ✓R are param-eters for entities and relations respectively, b is a biasconstant.

P(r|(e1, e2), ✓E , ✓R) =exp(fr(e1, e2))P

r02R exp(fr0(e1, e2))(3)

fr(e1, e2) = b� krht � rk (4)

3.2 Sentence Encoding Part

Sentence Representation Learning. Given a sen-tence s with n words s = {w1, ..., wn} includinga target entity pair (e1, e2), CNN is used to gener-ate a distributed representation s for the sentence.Specifically, vector representation vt for each wordwt is calculated via Equation 5, where Ww

emb is aword embedding projection matrix (Mikolov et al.,2013), Wwp

emb is a word position embedding pro-jection matrix, xw

t is a one-hot word representa-tion and xwp

t is a one-hot word position represen-tation. The word position describes the relative dis-tance between the current word and the target entitypair (Zeng et al., 2014). For instance, in the sen-tence “Patients recorded pain

e2and aspirin

e1con-

sumption in a daily diary”, the relative distance ofthe word “and” is [1, -1].

vt = [vwt ;v

wp1t ;vwp2

t ], (5)

vwt = Ww

embxwt ,

vwp1t = Wwp

embxwp1t ,

vwp2t = Wwp

embxwp2t

The distributed representation s is formulated via theEquation 6, where, [s]i and [ht]i are the i-th value ofs and ht, M is the dimensionality of s, W is theconvolution kernal, b is a bias vector, and k is theconvolutional window size.

[s]i = maxt

{[ht]i}, 8i = 1, ...,M (6)

ht = tanh(Wzt + b),

zt = [vt�(k�1)/2; ...;vt+(k�1)/2]

KG-based Attention. Suppose for each facttriplet (e1, r, e2), there might be multiple sentencesSr = {s1, ..., sm} in which each sentence containsthe entity pair (e1, e2) and is assumed to imply therelation r, m is the size of Sr. As discussed be-fore, the distant supervision inevitably collect noisysentences, the base model adopts a KG-based atten-tion mechanism to discriminate the informative sen-tences from the noisy ones. Specifically, the basemodel uses the latent relation embedding rht fromEquation 1 (or Equation 2) as the attention over Sr

to generate its final representation sfinal. sfinal iscalculated via Equation 7, where Ws is the weightmatrix, bs is the bias vector, ai is the weight for si,which is the distributed representation for the i-thsentence in Sr.

sfinal =mX

i=1

aisi, (7)

ai =exp(hrht,xii)Pm

k=1 exp(hrht,xki),

xi = tanh(Wssi + bs)

Finally, the conditional probability P (r|Sr, ✓s) isformulated via Equation 8 and Equation 9, where,✓S is the parameters in Sentence Encoding Part, Mis the representation matrix of relations, d is a biasvector, o is the output vector containing the predic-tion probabilities of all target relations for the inputsentences set Sr, and nr is the total number of rela-tions.

P (r|Sr, ✓) =exp(or)Pnrc=1 exp(oc)

(8)

o = Msfinal + d (9)

22

3.3 Optimization

The base model defines the optimization function asthe log-likelihood of the objective function in Equa-tion 10.

P (G,D|✓) = P (G|✓E , ✓R) + P (D|✓S) (10)

where, G and D are KG and textual data respec-tively. The base model applies Stochastic GradientDescent (SGD) and L2 regularization. In practice,the base model optimizes the KG Encoding Part andSentence Encoding Part in parallel.

4 Proposed Model

As discussed before, the sentences containing theentity pairs of interest tend to omit the BK that theauthors assume is known to the readers. However,the omitted BK would be extremely important for amachine to identify the relation between the entitypairs. To fill the “gaps” and improve the efficacy ofDS-RE, we assume that the reasoning paths betweenthe entity pairs over a KG could be utilized as BK tocompensate for the brevity of the sentences. Moti-vated by this issue, we propose the DS-RE modelthat integrates both reasoning paths and sentences.

4.1 Architecture

The proposed model consists of three parts: KGEncoding Part, Sentence Encoding Part and PathEncoding Part, as shown in Figure 3. The KGEncoding Part and Sentence Encoding Part areidentical to the base model, except that the finalinput to the relation classification layer. The PathEncoding Part takes as input a set of reasoningpaths, Pr = {p1, ..., pm}, between two entities ofinterest (e1, e2), and encodes them into the finalrepresentation of paths, pfinal. Specifically, letp = {e1, r1, er1 , r2, er2 , ..., ri, eri ..., e2} denote apath between (e1, e2). To express the semanticmeaning of a relation in a path, we represent ri

by its component words, rather than treat it as anunit. Therefore, a path will be represented as p ={e1, w

r11 , w

r12 , ..., er1 , , w

r21 , w

r22 , ..., er2 , ..., e2},

where wr12 denotes the second word of r1 (e.g.,

treat in may treat relation).Since a path is represented as a sequence of

words, or a special sentence, we apply the simi-lar CNN model used in the Sentence Encoding Part

...

C&P

...

×a1

...

C&P

...

×a2

...

C&P

...

×a3

...C&P

...

×am

...

+

... ... ...

KGS

(e1, e2) (e1, r, e2)

s1 s2 s3 sm

s1 s2 s3 sm

h r t

ATS

sfinal

Sentence Encoding Part KG Encoding Part

RC

C&P

ATS

KGS KnowledgeGraph Scoring

AttentionScoring

PointwiseOperation

RelationClassification

Convolution &Pooling

...

...

...

...

C&P

...

×a'1

...

C&P

...

×a'2

...

C&P

...

×a'3

...

C&P

...

×a'm

...

+

(e1, e2)

p1 p2 p3 pm

p1 p2 p3 pm

pfinal

Path Encoding Part

...

...

...

... ...

RC

concat(pfinal , sfinal)

Figure 3: Overview of the proposed model.

to encode the path into vector representation pi.The Path Encoding Part and Sentence Encoding Partshare the word embedding projection matrix Ww

emb,and word position projection matrix Wwp

emb in Equa-tion 5 except the convolutional kernal W and itscorresponding bias vector b in Equation 6. To uti-lize evidence from all the paths between target entitypair, we also adopt the KG-based attention mecha-nism applied in Sentence Encoding Part to calculatethe final representation of paths pfinal. We calcu-late pfinal via Equation 11, where Ws is the weightmatrix, bs is the bias vector, a0i is the weight for pi,which is the distributed representation for the i-thpath in Pr.

pfinal =mX

i=1

a0ipi, (11)

a0i =

exp(hrht,x0ii)Pm

k=1 exp(hrht,x0ki)

,

x0i = tanh(Wspi + bs)

Finally, we concatenate the resulting representationsfinal and pfinal for Sr (the set of input sentences)and Pr (the set of reasoning paths) respectively asthe input to the relation classification layer. Theconditional probability P (r|Sr, Pr, ✓S , ✓P ) is for-mulated via Equation 12 and Equation 13, where, ✓Pis the parameters in Path Encoding Part, M is therepresentation matrix of relations, d is a bias vec-tor, o is the output vector containing the predictionprobabilities of all target relations for both input sen-tences set Sr and input paths set Pr. nr is the totalnumber of relations.

P (r|Sr, Pr, ✓S , ✓P ) =exp(or)Pnrc=1 exp(oc)

(12)

23

ketorolac_tromethamine(C0064326)

pain(C0030193)

Sign_or_Symptom(C3540840)

photophobia(C0085636)

has_nichd_parentmay_treat has_nichd_parent

Iatopic_conjunctivitis(C0009766)

may_be_treated_by promethazine_hcl(C0546876)

may_be_treated_by may_be_treated_by

renal_failure(C0035078)

has_contraindicated_drug has_contraindicated_drug metaxalone(C0163055)

may_be_treated_by

... ... ...

Figure 4: Multiple reasoning paths between ketoro-lac tromethamine and pain.

o = M[sfinal;pfinal] + d (13)

Similar to the base model, we define the optimiza-tion function as the log-likelihood of the objectivefunction in Equation 14.

P (G,D|✓) = P (G|✓E , ✓R) + P (D|✓S , ✓P ) (14)

4.2 Reasoning Paths Generation

Let (e1, e2) be an entity pair of interest. The setof reasoning paths Pr is obtained by computing allshortest paths in a KG starting from e1 till e2. Forsimulating the situation where the direct relation be-tween a target entity pair is unavailable in a sparseKG, we remove the triplet that directly connect thetarget entity pair of interest from the KG. Each rea-soning path, thus, is at least a two-hop path, namelyp = {e1, r1, er1 , r2, e2}. However, if the shortestpath is not found due to the sparsity of KG, we willuse a padding path to represent the missing pathp = {rpadding}. Figure 4 shows the generated pathsbetween ketorolac tromethamine and pain.

5 Experiments

Our experiments aim to demonstrate the effective-ness of the proposed model, which is discussed inSection 4, for DS-RE from biomedical datasets.

5.1 Data

The biomedical datasets used for evaluation consistof knowledge graph, textual data and reasoning path,which will be detailed as follows.

Knowledge Graph. We choose the UMLS as theKG. UMLS is a large biomedical knowledge basedeveloped at the U.S. National Library of Medicine.UMLS contains millions of biomedical concepts andrelations between them. We follow (Wang et al.,2014), and only collect the fact triplet with RO rela-tion category (RO stands for “has Relationship Other

#Entity #Relation #Train (triplet) #Test (triplet)

16,049 295 34,378 12,502

Table 1: Statistics of KG in this work.

than synonymous, narrower, or broader”), whichcovers the interesting relations such as may treatand my prevent. From the UMLS 2018 release, weextract about 50 thousand such RO fact triplets (i.e.,(e1, r, e2)) under the restriction that their entity pairs(i.e., e1 and e2) should coexist within a sentence inMedline corpus. They are then randomly dividedinto training and testing sets for KGC. Following(Weston et al., 2013), we keep high entity overlapbetween training and testing set, but zero fact tripletoverlap. The statistics of the extracted KG is shownin Table 1.

Textual Data. Medline corpus is a collectionof bimedical abstracts maintained by the NationalLibrary of Medicine. From the Medline corpus,by applying the UMLS entity recognizer, Quick-UMLS (Soldaini and Goharian, 2016), we extract682, 093 sentences that contain UMLS entity pairsas our textual data, in which 485, 498 sentences arefor training and 196, 595 sentences for testing. Foridentifying the NA relation, besides the “related”sentences, we also extract the “unrelated” sentencesbased on a closed world assumption: pairs of enti-ties not listed in the KG are regarded to have NArelation and sentences containing them consideredto be the “unrelated” sentences. By this way, we ex-tract 1, 394, 025 “unrelated” sentences for the train-ing data, and 598, 154 “unrelated” sentences for thetesting data. Table 2 presents some sample sentencesin the training data.

Reasoning Path. Following the Section 4.2, weextract 197, 396 paths for not NA triplets (139, 224/ 58, 172 for training / testing) and 679, 408 for NAtriplets (474, 263 / 205, 145 for training / testing),under the restriction that each entity in a path shouldbe observed in Medline corpus.

5.2 Parameter Settings

We base our work on (Han et al., 2018) and itsimplementation available at https://github.com/thunlp/JointNRE, and thus adopt identi-cal optimization process. We use the default settings

24

Fact Triplet Textual Data

(insulin,gene product plays role in biological process,energy expenditure)

s1 : These results indicate that hyperglucagonemia during insuline1 deficiencyresults in an increase in energy expendituree2 , which may contribute to thecatabolic state in many conditions.s2 : It was hypothesized that the waxy maize treatment would result ina blunted and more sustained glucose and insuline1 response, as well asenergy expendituree2 and appetitive responses.s3 : ...

(IRI, NA, insulin)

s1 : Plasma insulin immunoreactivity (IRIe1) results from high molecular weightsubstances with insulin immunoreactivity (HWIRI), proinsulin (PI) and insuline2 (I).s2 : The beads method demonstrated high IRIe1 values in both insuline2 fractionsand the fractions containing serum proteins bigger than 40,000 molecular weight.s3 : ...

Table 2: Examples of textual data extracted from Medline corpus.

of parameters 2 provided by the base model. Sincewe address the DS-RE in biomedical domain, we usethe Medline corpus to train the domain specific wordembedding projection matrix Ww

emb in Equation 5.

5.3 Result and Discussion

We investigate the effectiveness of our proposedmodel with respect to enhancing the DS-RE frombiomedical datasets. We follow (Mintz et al., 2009;Weston et al., 2013; Lin et al., 2016; Han et al.,2018) and conduct the held-out evaluation, in whichthe model for DS-RE is evaluated by comparing thefact triplets identified from textual data (i.e., the setof sentences containing the target entity pairs) withthose in KG. Following the evaluation of previousworks, we draw Precision-Recall curves and reportthe micro average precision (AP) score, which isa measure of the area under the Precision-Recallcurve (higher is better), as well as Precision@N(P@N) metrics, which gives the percentage of cor-rect triplets among top N ranked candidates.

Precision-Recall Curves. The Precision-Recall(PR) curves are shown in Figure 5, where“CNN+AVE” represents that the DS-RE model usesthe average vector of sentences as sfinal in Equa-tion 7. “JointE+KATT” (or “JointD+KATT”) rep-resents that the DS-RE model applies Prob-TransE(or Prob-TransD) as its KG Encoding Part for at-tention calculation. “(TEXT)” indicates that themodel only takes the textual data as input (i.e.,the set of sentences containing target entity pairs).“(PATH)” indicates the DS-RE model only takes thereasoning paths between entity pairs as its input.“(TEXT+PATH)” indicates the DS-RE model takesboth the textual data and reasoning paths as its input.

2As a preliminary study, we only adopt the default hyperpa-rameters, but we will tune them for our task in the furture.

Figure 5: Precision-Recall curves.

The results show that:(1) The proposed model (i.e.,

“JointE+KATT(PATH+TEXT)”) signifi-cantly outperform the base model (i.e.,“JointE+KATT(TEXT)”), proving that reason-ing paths are useful BK for biomedical DS-RE. Thisresult inspires us to explore other reasoning strategysuch as by reasoning across multiple documents.(2) “JointE+KATT(PATH+TEXT)” achieves betteroverall performance than “JointE+KATT(PATH)”,demonstrating the mutual complementary relation-ship between the sentences containing entity pairsand the reasoning paths between them. Specifically,on the one hand, as discussed in Section 1, reasoningpaths could provide BK for interpreting the implic-itly expressed relation in sentences. On the otherhand, due to the sparsity of KG, it is by no meanscertain that all entity pairs are fully connected byplausible reasoning paths in the KG. In that case, thesentences could provide the informative evidence toidentify the relation between them.

25

AP and P@N Evaluation. The results interms of P@1k, P@2k, P@3k, P@4k, P@5k,the mean of them and AP are shown in Ta-ble 3. From the table, we have similar ob-servation to the PR curves: (1) The proposedmodel (i.e., “JointE+KATT(TEXT+PATH)”) signif-icantly outperforms the base model for all mea-sures. (2) “JointE+KATT(TEXT+PATH)” outper-forms “JointE+KATT(PATH)” in most of the met-rics.

Model P@1k P@2k P@3k P@4k P@5k Mean AP

CNN+AVE 0.852 0.751 0.685 0.640 0.602 0.706 0.098JointD+KATT(TEXT) 0.628 0.614 0.552 0.495 0.446 0.547 0.186JointE+KATT(TEXT) 0.835 0.759 0.692 0.629 0.564 0.696 0.272JointE+KATT(PATH) 0.945 0.911 0.881 0.842 0.796 0.875 0.432

JointE+KATT(TEXT+PATH) 0.941 0.922 0.897 0.865 0.818 0.889 0.496

Table 3: P@N and AP for different RE models,where k=1000.

Case Study. Table 4 shows the com-parison of the attention distribution be-tween “JointE+KATT(TEXT)” (Base) and“JointE+KATT(TEXT+PATH)” (Proposed). Thefirst and second columns represent the attentiondistribution (the highest and the lowest) over inputsentences. From the Table 4, we can see that theproposed model that incorporates reasoning pathsis more capable of selecting informative sentencesthan the base model, because it “focuses” onthe second sentence that explicitly describes themay treat relation via the word “reduction”, incontrast, the base model “ignores” such informativesentence. Table 5 shows the attention allocated byour proposed model for given reasoning paths. Thefirst path generally means if two chemicals shouldnot be used in the case of (or contraindicated with)drug allergy, they will treat lung tumor. In contrast,the second path generally means if two chemicalstreat Histiocytoses (an excessive number of cells),they will also treat lung tumor. Apparently thesecond one that our proposed model focused on ismore plausible. This indicates that our proposedmodel has the capacity of identifying the plausiblereasoning path.

6 Conclusion and Future Work

In this work, we tackle the task of DS-RE frombiomedical datasets. However, the biomedicalDS-RE could be negatively affected by the brevity

Base Proposed Sentences for(Mitomycin C (MCC), may treat, stomach/gastric tumor)

High Low The additive effect in the combination of TNF and Mitomycin Cwas observed against two Mitomycin C resistant gastric tumors.

Low HighOne-quarter or one-half maximum tolerated doses ( MTDs ) of5-FU or MMC resulted in a significant reduction of stomach tumor

growth, ...

Table 4: Comparison of attention between basemodel and proposed model, where High (or Low)represents the highest (or lowest) attention.

Attention Paths for (etoposide, may treat, lung tumor)

Low

etoposide has contraindicated drug ��

drug allergyhas contraindicated drug��!

S-Liposomal Doxorubicinmay treat��!

lung tumor

Highetoposide may be treated by

��!Histiocytoses

may be treated by ��

Vinblastine may treat��!

lung tumor

Table 5: Some examples of attentiondistribution over reasoning paths from“JointE+KATT(TEXT+PATH)”.

of text. Specifically, authors always omit the BKthat would be important for a machine to identifyrelationships between entities. To address thisissue, in this work, we assume that the reasoningpaths over a KG could be utilized as the BK tofill the “gaps” in text and thus facilitate DS-RE.Experimental results prove the effectiveness ofthe combination, because our proposed modelachieves significant and consistent improvementsas compared with a state-of-the-art DS-RE model.Although the reasoning paths over KG are usefulfor DS-RE, the sparsity of KG would hinder theireffectiveness. Therefore, in the future, beside thereasoning paths over KG, we will also utilize thereasoning paths across multiple documents for ourtask. For instance, reasoning across Document1and Document2, shown below, would facilitatethe relation identification between “Aspirin” and“inflammation”.Document1: “Aspirin and other nonsteroidalanti-inflammatory drugs (NSAID) show ...”Document2: “Nonsteroidal anti-inflammatorydrugs reduce inflammation by ...”

Acknowledgement

This work was supported by JST CREST GrantNumber JPMJCR1513, Japan and KAKENHI GrantNumber 16H06614.

26

References

Waleed Ammar, Matthew Peters, Chandra Bhagavatula,and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-endentity and relation extraction. In Proceedings of the11th International Workshop on Semantic Evaluation(SemEval-2017), pages 592–596.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Management ofdata, pages 1247–1250. AcM.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,Jason Weston, and Oksana Yakhnenko. 2013. Trans-lating embeddings for modeling multi-relational data.In Advances in neural information processing systems,pages 2787–2795.

Jinhua Du, Jingguang Han, Andy Way, and Dadong Wan.2018. Multi-level structured self-attentions for dis-tantly supervised relation extraction. In Proceedingsof the 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 2216–2225.

Jinghang Gu, Fuqing Sun, Longhua Qian, and GuodongZhou. 2017. Chemical-induced disease relation ex-traction via convolutional neural network. Database,2017.

Gus Hahn-Powell, Dane Bell, Marco A Valenzuela-Escarcega, and Mihai Surdeanu. 2016. This beforethat: Causal precedence in the biomedical domain.arXiv preprint arXiv:1606.08089.

Xu Han, Zhiyuan Liu, and Maosong Sun. 2018. Neuralknowledge acquisition via mutual attention betweenknowledge graph and text. In Thirty-Second AAAIConference on Artificial Intelligence.

Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction ofoverlapping relations. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies-Volume1, pages 541–550. Association for Computational Lin-guistics.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and JunZhao. 2015. Knowledge graph embedding via dy-namic mapping matrix. In Proceedings of the 53rdAnnual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: Long Pa-pers), volume 1, pages 687–696.

Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.

In Thirty-First AAAI Conference on Artificial Intelli-gence.

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,Dimitris Kontokostas, Pablo N Mendes, SebastianHellmann, Mohamed Morsey, Patrick Van Kleef,Soren Auer, et al. 2015. Dbpedia–a large-scale, multi-lingual knowledge base extracted from wikipedia. Se-mantic Web, 6(2):167–195.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, andMaosong Sun. 2016. Neural relation extraction withselective attention over instances. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 2124–2133.

Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017. Neu-ral relation extraction with multi-lingual attention. InProceedings of the 55th Annual Meeting of the Associ-ation for Computational Linguistics (Volume 1: LongPapers), pages 34–43.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositionality.In Advances in neural information processing systems,pages 3111–3119.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang, andDavid Gondek. 2013. Distant supervision for rela-tion extraction with an incomplete knowledge base.In Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 777–782.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.2009. Distant supervision for relation extraction with-out labeled data. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of the ACL and the4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 2-Volume 2,pages 1003–1011. Association for Computational Lin-guistics.

Makoto Miwa and Mohit Bansal. 2016. End-to-end rela-tion extraction using lstms on sequences and tree struc-tures. arXiv preprint arXiv:1601.00770.

Randall Munroe. 2013. The rise of open access. Science,342(6154):58–59.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases, pages 148–163. Springer.

RA Roller, E Agirre, A Sorora, and M Stevenson. 2015.Improving distant supervision using inference learn-ing. In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the 7th

27

International Joint Conference on Natural LanguageProcessing. Association for Computational Linguis-tics.

Cicero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying relations by rankingwith convolutional neural networks. arXiv preprintarXiv:1504.06580.

Luca Soldaini and Nazli Goharian. 2016. Quickumls:a fast, unsupervised approach for medical concept ex-traction. In MedIR workshop, sigir.

Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga,Chiranjib Bhattacharyya, and Partha Talukdar. 2018.Reside: Improving distantly-supervised neural relationextraction using side information. In Proceedings ofthe 2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 1257–1266.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In AAAI, volume 14, pages1112–1119.

Jason Weston, Antoine Bordes, Oksana Yakhnenko, andNicolas Usunier. 2013. Connecting language andknowledge bases with embedding models for relationextraction. arXiv preprint arXiv:1307.7973.

Kun Xu, Yansong Feng, Songfang Huang, and DongyanZhao. 2015. Semantic relation classification via con-volutional neural networks with simple negative sam-pling. arXiv preprint arXiv:1506.07650.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, JunZhao, et al. 2014. Relation classification via convolu-tional deep neural network. In COLING, pages 2335–2344.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In Proceed-ings of the 2015 Conference on Empirical Methods inNatural Language Processing, pages 1753–1762.

Dongxu Zhang and Dong Wang. 2015. Relation classi-fication via recurrent neural network. arXiv preprintarXiv:1508.01006.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, BingchenLi, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networksfor relation classification. In Proceedings of the 54thAnnual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), volume 2, pages207–212.

28

Incorporating Chains of Reasoning over Knowledge Graph for...

Documents

Transcript of Incorporating Chains of Reasoning over Knowledge Graph for...