Learning with Noise: Enhance Distantly Supervised Relation Extraction with … · 2017. 4. 24. ·...

Learning with Noise: Enhance Distantly Supervised Relation Extractionwith Dynamic Transition Matrix

Bingfeng Luo1, Yansong Feng∗1, Zheng Wang2, Zhanxing Zhu3,Songfang Huang4, Rui Yan1 and Dongyan Zhao1

1ICST, Peking University, China2School of Computing and Communications, Lancaster University, UK

3Peking University, China4IBM China Research Lab, China

{bf luo,fengyansong,zhanxing.zhu,ruiyan,zhaody}@[email protected]@cn.ibm.com

Abstract

Distant supervision significantly reduceshuman efforts in building training data formany classification tasks. While promis-ing, this technique often introduces noiseto the generated training data, which canseverely affect the model performance. Inthis paper, we take a deep look at the appli-cation of distant supervision in relation ex-traction. We show that the dynamic transi-tion matrix can effectively characterize thenoise in the training data built by distantsupervision. The transition matrix can beeffectively trained using a novel curricu-lum learning based method without any di-rect supervision about the noise. We thor-oughly evaluate our approach under a widerange of extraction scenarios. Experimen-tal results show that our approach consis-tently improves the extraction results andoutperforms the state-of-the-art in variousevaluation scenarios.

1 Introduction

Distant supervision (DS) is rapidly emerging as aviable means for supporting various classificationtasks – from relation extraction (Mintz et al., 2009)and sentiment classification (Go et al., 2009) tocross-lingual semantic analysis (Fang and Cohn,2016). By using knowledge learned from seed ex-amples to label data, DS automatically prepareslarge scale training data for these tasks.

While promising, DS does not guarantee per-fect results and often introduces noise to the gener-ated data. In the context of relation extraction, DSworks by considering sentences containing boththe subject and object of a triple

as its supports. However, the generated data arenot always perfect. For instance, DS could matchthe knowledge base (KB) triple, in false positive contextslike Donald Trump worked in New York City. Priorworks (Takamatsu et al., 2012; Ritter et al., 2013)show that DS often mistakenly labels real posi-tive instances as negative (false negative) or versavice (false positive), and there could be confu-sions among positive labels as well. These noisescan severely affect training and lead to poorly-performing models.

Tackling the noisy data problem of DS is non-trivial, since there usually lacks of explicit super-vision to capture the noise. Previous works havetried to remove sentences containing unreliablesyntactic patterns (Takamatsu et al., 2012), designnew models to capture certain types of noise oraggregate multiple predictions under the at-least-one assumption that at least one of the alignedsentences supports the triple in KB (Riedel et al.,2010; Surdeanu et al., 2012; Ritter et al., 2013;Min et al., 2013). These approaches represent asubstantial leap forward towards making DS morepractical. however, are either tightly couple to cer-tain types of noise, or have to rely on manual rulesto filter noise, thus unable to scale. Recent break-through in neural networks provides a new wayto reduce the influence of incorrectly labeled databy aggregating multiple training instances atten-tively for relation classification, without explicitlycharacterizing the inherent noise (Lin et al., 2016;Zeng et al., 2015). Although promising, modelingnoise within neural network architectures is still inits early stage and much remains to be done.

In this paper, we aim to enhance DS noise mod-eling by providing the capability to explicitly char-acterize the noise in the DS-style training data

within neural networks architectures. We showthat while noise is inevitable, it is possible to char-acterize the noise pattern in a unified frameworkalong with its original classification objective. Ourkey insight is that the DS-style training data typi-cally contain useful clues about the noise pattern.For example, we can infer that since some peo-ple work in their birthplaces, DS could wrongly la-bel a training sentence describing a working placeas a born-in relation. Our novel approach tonoisy modeling is to use a dynamically-generatedtransition matrix for each training instance to (1)characterize the possibility that the DS labeled re-lation is confused and (2) indicate its noise pat-tern. To tackle the challenge of no direct guidanceover the noise pattern, we employ a curriculumlearning based training method to gradually modelthe noise pattern over time, and utilize trace regu-larization to control the behavior of the transitionmatrix during training. Our approach is flexible –while it does not make any assumptions about thedata quality, the algorithm can make effective useof the data-quality prior knowledge to guide thelearning procedure when such clues are available.

We apply our method to the relation extractiontask and evaluate under various scenarios on twobenchmark datasets. Experimental results showthat our approach consistently improves both ex-traction settings, outperforming the state-of-the-art models in different settings.

Our work offers an effective way for tacklingthe noisy data problem of DS, making DS morepractical at scale. Our main contributions are to(1) design a dynamic transition matrix structure tocharacterize the noise introduced by DS, and (2)design a curriculum learning based framework toadaptively guide the training procedure to learnwith noise.

2 Problem Definition

The task of distantly supervised relation extractionis to extract knowledge triples, ,from free text with the training data constructedby aligning existing KB triples with a large cor-pus. Specifically, given a triple in KB, DS worksby first retrieving all the sentences containing bothsubj and obj of the triple, and then constructingthe training data by considering these sentences assupport to the existence of the triple. This taskcan be conducted in both the sentence and the baglevels. The former takes a sentence s containing

Encodersentences embeddings

Prediction

Noise Modeling

predicted distr.

transition matrix

Transformation

31

2

4 Observed distr.

Figure 1: Overview of our approach

both subj and obj as input, and outputs the rela-tion expressed by the sentence between subj andobj. The latter setting alleviates the noisy dataproblem by using the at-least-one assumption thatat least one of the retrieved sentences containingboth subj and obj supports the triple. It takes a bag of sentences S as input whereeach sentence s ∈ S contains both subj and obj,and outputs the relation between subj and obj ex-pressed by this bag.

3 Our approach

In order to deal with the noisy training data ob-tained through DS, our approach follows four stepsas depicted in Figure 1. First, each input sentenceis fed to a sentence encoder to generate an embed-ding vector. Our model then takes the sentenceembeddings as input and produce a predicted re-lation distribution, p, for the input sentence (orthe input sentence bag). At the same time, ourmodel dynamically produces a transition matrix,T, which is used to characterize the noise patternof sentence (or the bag). Finally, the predicteddistribution is multiplied by the transition matrixto produce the observed relation distribution, o,which is used to match the noisy relation labelsassigned by DS while the predicted relation dis-tribution p serves as output of our model duringtesting. One of the key challenges of our approachis on determining the element values of the transi-tion matrix, which will be described in Section 4.

3.1 Sentence-level Modeling

Sentence Embedding and Prediction In thiswork, we use a piecewise convolutional neural net-work (Zeng et al., 2015) for sentence encoding,but other sentence embedding models can also beused. We feed the sentence embedding to a fullconnection layer, and use softmax to generate thepredicted relation distribution, p.

Noise Modeling First, each sentence embeddingx, generated b sentence encoder, is passed to a fullconnection layer as a non-linearity to obtain thesentence embedding xn used specifically for noisemodeling. We then use softmax to calculate the

transition matrix T, for each sentence:

Tij =exp(wTijxn + b)∑|C|j=1 exp(w

Tijxn + b)

(1)

where Tij is the conditional probability for the in-put sentence to be labeled as relation j by DS,given i as the true relation, b is a scalar bias, |C| isthe number of relations, wij is the weight vectorcharacterizing the confusion between i and j.

Here, we dynamically produce a transition ma-trix, T, specifically for each sentence, but with theparameters (wij) shared across the dataset. By do-ing so, we are able to adaptively characterize thenoise pattern for each sentence, with a few pa-rameters only. In contrast, one could also pro-duce a global transition matrix for all sentences,with much less computation, where one need notto compute T on the fly (see Section 6.1).

Observed Distribution When we characterizethe noise in a sentence with a transition matrix T,if its true relation is i, we can assume that i mightbe erroneously labeled as relation j by DS withprobability Tij . We can therefore capture the ob-served relation distribution, o, by multiplying Tand the predicted relation distribution, p:

o = TT · p (2)

where o is then normalized to ensure∑

i oi = 1.Rather than using the predicted distribution p

to directly match the relation labeled by DS (Zenget al., 2015; Lin et al., 2016), here we utilize o tomatch the noisy labels during training and still usep as output during testing, which actually capturesthe procedure of how the noisy label is producedand thus protects p from the noise.

3.2 Bag Level Modeling

Bag Embedding and Prediction One of the keychallenges for bag level model is how to aggre-gate the embeddings of individual sentences intothe bag level. In this work, we experiment twomethods, namely average and attention aggrega-tion (Lin et al., 2016). The former calculates thebag embedding, s, by averaging the embeddings ofeach sentence, and then feed it to a softmax classi-fier for relation classification.

The attention aggregation calculates an atten-tion value, aij , for each sentence i in the bag with

respect to each relation j, and aggregates to thebag level as sj , by the following equations1:

sj =n∑i

aijxi; aij =exp(xTi rj)∑ni′ exp(x

Ti′rj)

(3)

where xi is the embedding of sentence i, n thenumber of sentences in the bag, and rj is the ran-domly initialized embedding for relation j. In sim-ilar spirit to (Lin et al., 2016), the resulting bagembedding sj is fed to a softmax classifier to pre-dict the probability of relation j for the given bag.

Noise Modeling Since the transition matrix ad-dresses the transition probability with respect toeach true relation, the attention mechanism ap-pears to be a natural fit for calculating the tran-sition matrix in bag level. Similar to attention ag-gregation above, we calculate the bag embeddingwith respect to each relation using Equation 3, butwith a separate set of relation embeddings r′j . Wethen calculate the transition matrix, T, by:

Tij =exp(sTi r

′j + bi)∑|C|

j=1 exp(sTi r′j + bi)

(4)

where si is the bag embedding regarding relationi, and r′j is the embedding for relation j.

4 Curriculum Learning based Training

One of the key challenges of this work is onhow to train and produce the transition matrixto model the noise in the training data withoutany direct guidance and human involvement. Astraightforward solution is to directly align the ob-served distribution, o, with respect to the noisylabels by minimizing the sum of the two terms:CrossEntropy(o)+Regularization. However,doing so does not guarantee that the prediction dis-tribution, p, will match the true relation distribu-tion. The problem is at the beginning of the train-ing, we have no prior knowledge about the noisepattern, thus, both T and p are less reliable, mak-ing the training procedure be likely to trap intosome poor local optimum. Therefore, we requirea technique to guide our model to gradually adaptto the noisy training data, e.g., learning somethingsimple first, and then trying to deal with noises.

1While (Lin et al., 2016) use bilinear function to calcu-late aij , we simply use dot product since we find these twofunctions perform similarly in our experiments.

Fortunately, this is exactly what curriculumlearning can do. The idea of curriculum learn-ing (Bengio et al., 2009) is simple: starting withthe easiest aspect of a task, and leveling up the dif-ficulty gradually, which fits well to our problem.We thus employ a curriculum learning frameworkto guide our model to gradually learn how to char-acterize the noise. Another advantage is to avoidfalling into poor local optimum.

With curriculum learning, our approach pro-vides the flexibility to combine prior knowledgeof noise, e.g., splitting a dataset into reliable andless reliable subsets, to improve the effectivenessof the transition matrix and better model the noise.

4.1 Trace RegularizationBefore proceeding to training details, we first dis-cuss how we characterize the noise level of thedata by controlling the trace of its transition ma-trix. Intuitively, if the noise is small, the transitionmatrix T will tend to become an identity matrix,i.e., given a set of annotated training sentences, theobserved relations and their true relations are al-most identical. Since each row of T sums to 1,the similarity between the transition matrix andthe identity matrix can be represented by its trace,trace(T). The larger the trace(T) is, the largerthe diagonal elements are, and the more similarthe transition matrix T is to the identity matrix,indicating a lower level of noise. Therefore, wecan characterize the noise pattern by controllingthe expected value of trace(T) in the form of reg-ularization. For example, we will expect a largertrace(T) for reliable data, but a smaller trace(T)for less reliable data. Another advantage of em-ploying trace regularization is that it could help re-duce the model complexity and avoid overfitting.

4.2 TrainingTo tackle the challenge of no direct guidance overthe noise patterns, we implement a curriculumlearning based training method to first train themodel without considerations for noise. In otherwords, we first focus on the loss from the predic-tion distribution p , and then take the noise model-ing into account gradually along the training pro-cess, i.e., gradually increasing the importance ofthe loss from the observed distribution o while de-creasing the importance of p. In this way, the pre-diction branch is roughly trained before the modelmanaging to characterize the noise, thus avoids be-ing stuck into poor local optimum. We thus design

to minimize the following loss function:

L =

N∑i=1

−((1− α)log(oiyi) + αlog(piyi))

− βtrace(Ti)

(5)

where 00 are two weighting param-eters, yi is the relation assigned by DS for the i-thinstance, N the total number of training instances,oiyi is the probability that the observed relation forthe i-th instance is yi, and piyi is the probability topredict relation yi for the i-th instance.

Initially, we set α=1, and train our model com-pletely by minimizing the loss from the predictiondistribution p. That is, we do not expect to modelthe noise, but focus on the prediction branch atthis time. As the training progresses, the predic-tion branch gradually learns the basic predictionability. We then decrease α and β by 0

where βm is the regularization weight for them-thdata subset, M is the total number of subsets, Nmthe number of instances in m-th subset, and Tmi,ymi and omi,ymi are the transition matrix, the re-lation labeled by DS and the observed probabilityof this relation for the i-th training instance in them-th subset, respectively. Note that different fromEquation 5, this loss function does not need to ini-tiate training by minimizing the loss regarding theprediction distribution p, since one can easily startby learning from the most reliable split first.

We also use trace regularization for the most re-liable subset, since there are still some noise anno-tations inevitably appearing in this split. Specifi-cally, we expect its trace(T) to be large (using apositive β) so that the elements of T will be cen-tralized to the diagonal and T will be more similarto the identity matrix. As for the less reliable sub-set, we expect the trace(T) to be small (using anegative β) so that the elements of the transitionmatrix will be diffusive and T will be less similarto the identity matrix. In other words, the transi-tion matrix is encouraged to characterize the noise.

Note that this loss function only works for sen-tence level models. For bag level models, sincereliable and less reliable sentences are all aggre-gated into a sentence bag, we can not determinewhich bag is reliable and which is not. However,bag level models can still build a curriculum bychanging the content of a bag, e.g., keeping re-liable sentences in the bag first, then graduallyadding less reliable ones, and training with Equa-tion 5, which could benefit from the prior knowl-edge of data quality as well.

5 Evaluation Methodology

Our experiments aim to answer two main ques-tions: (1) is it possible to model the noise in thetraining data generated through DS, even whenthere is no prior knowledge to guide us? and (2)whether the prior knowledge of data quality canhelp our approach better handle the noise.

We apply our approach to both sentence leveland bag level extraction models, and evaluate inthe situations where we do not have prior knowl-edge of the data quality as well as where such priorknowledge is available.

5.1 Datasets

We evaluate our approach on two datasets.

TIMERE We build TIMERE by using DSto align time-related Wikidata (Vrandečić andKrötzsch, 2014) KB triples to Wikipedia text. Itcontains 278,141 sentences with 12 types of re-lations between an entity mention and a time ex-pression. We choose to use time-related relationsbecause time expressions speak for themselves interms of reliability. That is, given a KB triple and its aligned sentences, the finer-grained the time expression t appears in the sen-tence, the more likely the sentence supports theexistence of this triple. For example, a sentencecontaining both Alphabet and October-2-2015 isvery likely to express the inception-time ofAlphabet, while a sentence containing both Al-phabet and 2015 could instead talk about manyevents, e.g., releasing financial report of 2015, hir-ing a new CEO, etc. Using this heuristics, wecan split the dataset into 3 subsets according todifferent granularities of the time expressions in-volved, indicating different levels of reliability.Our criteria for determining the reliability are asfollows. Instances with full date expressions, i.e.,Year-Month-Day, can be seen as the most re-liable data, while those with partial date expres-sions, e.g., Month-Year and Year-Only, areconsidered as less reliable. Negative data are con-structed heuristically that any entity-time pairs ina sentence without corresponding triples in Wiki-data are treated as negative data. During training,we can access 184,579 negative and 77,777 pos-itive sentences, including 22,214 reliable, 2,094and 53,469 less reliable ones. The validation setand test set are randomly sampled from the reli-able (full-date) data for relatively fair evaluationsand contains 2,776, 2,771 positive sentences and5,143, 5,095 negative sentences, respectively.

ENTITYRE is a widely-used entity relation ex-traction dataset, built by aligning triples in Free-base to the New York Times (NYT) corpus (Riedelet al., 2010). It contains 52 relations, 136,947 pos-itive and 385,664 negative sentences for training,and 6,444 positive and 166,004 negative sentencesfor testing. Unlike TIMERE, this dataset does notcontain any prior knowledge about the data qual-ity. Since the sentence level annotations in EN-TITYRE are too noisy to serve as gold standard,we only evaluate bag-level models on ENTITYRE,a standard practice in previous works (Surdeanuet al., 2012; Zeng et al., 2015; Lin et al., 2016).

5.2 Experimental Setup

Hyper-parameters We use 200 convolutionkernels with widow size 3. During training, weuse stochastic gradient descend (SGD) with batchsize 20. The learning rates for sentence-level andbag-level models are 0.1 and 0.01, respectively.

Sentence level experiments are performed onTIMERE, using 100-d word embeddings pre-trained using GloVe (Pennington et al., 2014) onWikipedia and Gigaword (Parker et al., 2011), and20-d vectors for distance embeddings. Each of thethree subsets of TIMERE is added after the previ-ous phase has run for 15 epochs. The trace regu-larization weights are β1 = 0.01, β2 = −0.01 andβ3 = −0.1, respectively, from the reliable to themost unreliable, with the ratio of β3 and β2 fixedto 10 or 5 when tuning.

Bag level experiments are performed on bothTIMERE and ENTITYRE. For TIMERE, we usethe same parameters as above. For ENTITYRE,we use 50-d word embeddings pre-trained onthe NYT corpus using word2vec (Mikolov et al.,2013), and 5-d vectors for distance embedding.For both datasets, α and β in Eq. 5 are initializedto 1 and 0.1, respectively. We tried various decayrates, {0.95, 0.9, 0.8}, and steps, {3, 5, 8}. Wefound that using a decay rate of 0.9 with step of 5gives best performance in most cases.

Evaluation Metric The performance is reportedusing the precision-recall (PR) curve, which is astandard evaluation metric in relation extraction.Specifically, the extraction results are first rankeddecreasingly by their confidence scores, then theprecision and recall are calculated by setting thethreshold to be the score of each extraction resultone by one.

Naming Conventions We evaluate our ap-proach under a wide range of settings for sentencelevel (sent ) and bag level (bag ) models: (1)mix: trained on all three subsets of TIMERE

mixed together; (2) reliable: trained usingthe reliable subset of TIMERE only; (3) PR:trained with prior knowledge of annotation qual-ity, i.e., starting from the reliable data and thenadding the unreliable data; (4) TM: trained withdynamic transition matrix; (5) GTM: trained witha global transition matrix. In bag level, we also in-vestigate the performance of average aggregation( avg) and attention aggregation ( att).

0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 8 0

0 . 8 5

0 . 9 0

0 . 9 5

1 . 0 0 s e n t _ m i x _ T M s e n t _ P R _ s e g 2 _ T M s e n t _ P R _ T M

Precis

ion

R e c a l l

s e n t _ m i x s e n t _ r e l i a b l e s e n t _ P R

Figure 2: Sentence Level Results on TIMERE

6 Experimental Results

6.1 Performance on TIMERE

Sentence Level Models The results of sentencelevel models on TIMERE are shown in Figure2. We can see that mixing all subsets together(sent mix) gives the worst performance, signif-icantly worse than using the reliable subset only(sent reliable). This suggests the noisy na-ture of the training data obtained through DS andproperly dealing with the noise is the key forDS for a wider range of applications. Whengetting help from our dynamic transition matrix,the model (sent mix TM) significantly improvessent mix, delivering the same level of perfor-mance as sent reliable in most cases. Thissuggests that our transition matrix can help to mit-igate the bad influence of noisy training instances.

Now let us consider the PR scenario where onecan build a curriculum by first training on the reli-able subset, then gradually moving to both reliableand less reliable data. We can see that, this simplecurriculum learning based model (sent PR) fur-ther outperforms sent reliable significantly,indicating that the curriculum learning frameworknot only reduces the effect of noise, but also helpsthe model learn from noisy data. When apply-ing the transition matrix approach into this cur-riculum learning framework using one reliablesubset and one unreliable subset generated bymixing our two less reliable subsets, our model(sent PR seg2 TM) further improves sent PRby utilizing the dynamic transition matrix tomodel the noise. It is not surprising that whenwe use all three subsets separately, our model(sent PR TM) significantly outperforms all othermodels by a large margin.

0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0

0 . 9 2

0 . 9 4

0 . 9 6

0 . 9 8

1 . 0 0

Precis

ion

R e c a l l

b a g _ a t t _ m i x b a g _ a t t _ r e l i a b l e b a g _ a t t _ P R b a g _ a t t _ m i x _ T M b a g _ a t t _ P R _ T M

(a) Attention Aggregation

0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0

0 . 9 2

0 . 9 4

0 . 9 6

0 . 9 8

1 . 0 0

Precis

ion

R e c a l l

b a g _ a v g _ m i x b a g _ a v g _ r e l i a b l e b a g _ a v g _ P R b a g _ a v g _ m i x _ T M b a g _ a v g _ P R _ T M

(b) Average Aggregation

Figure 3: Bag Level Results on TIMERE

Bag Level Models In this setting, we first lookat the performance of the bag level models withattention aggregation. The results are shown inFigure 3(a). Consider the comparison betweenthe model trained on the reliable subset only(bag att reliable) and the one trained onthe mixed dataset (bag att mix). In contrastto the sentence level, bag att mix outperformsbag att reliable by a large margin, becausebag att mix has taken the at-least-one assump-tion into consideration through the attention ag-gregation mechanism (Eq. 3), which can be seenas a denoising step within the bag. This may alsobe the reason that when we introduce either ourdynamic transition matrix (bag att mix TM) orthe curriculum of using prior knowledge of dataquality (bag att PR) into the bag level models,the improvement regarding bag att mix is notas significant as in the sentence level.

However, when we apply our dynamic transi-tion matrix into the curriculum built upon priorknowledge of data quality (bag att PR TM), theperformance gets further improved. This hap-pens especially in the high precision part com-pared to bag att PR. We also note that the baglevel’s at-least-one assumption does not alwayshold, and there are still false negative and falsepositive problems. Therefore, using our transi-tion matrix approach with or without prior knowl-edge of data quality, i.e., bag att mix TM andbag att PR TM, both improve the performance,and bag att PR TM performs slightly better.

The results of bag level models with average ag-gregation are shown in Figure 3(b), where the rel-ative ranking of various settings is similar to thosewith attention aggregation. A notable difference

0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0

0 . 9 2

0 . 9 4

0 . 9 6

0 . 9 8

1 . 0 0 s e n t _ P R s e n t _ P R _ G T M s e n t _ P R _ T M b a g _ a t t _ P R b a g _ a t t _ P R _ G T M b a g _ a t t _ P R _ T M

Precis

ion

R e c a l l

Figure 4: Global TM v.s. Dynamic TM

is that both bag avg PR and bag avg mix TMimprove bag avg mix by a larger margin com-pared to that in the attention aggregation setting.The reason may be that the average aggregationmechanism is not as good as the attention aggre-gation in denoising within the bag, which leavesmore space for our transition matrix approach orcurriculum learning with prior knowledge to im-prove. Also note that bag avg reliable per-forms best in the very-low-recall region but worstin general. This is because that it ranks higherthe sentences expressing either birth-date ordeath-date, the simplest but the most com-mon relations in the dataset, but fails to learn otherrelations with limited or noisy training instances,given its relatively simple aggregation strategy.

Global v.s. Dynamic Transition Matrix Wealso compare our dynamic transition matrixmethod with the global transition matrix method,which maintains only one transition matrix for alltraining instances. Specifically, instead of dynam-

ically generating a transition matrix for each da-tum, we first initialize an identity matrix T′ ∈R|C|×|C|, where |C| is the number of relations (in-cluding no-relation). Then the global transi-tion matrix T is built by applying softmax to eachrow of T′ so that

∑j Tij = 1:

Tij =eT

′ij∑|C|

j=1 eT ′ij

(7)

where Tij and T ′ij are the elements in the ith row

and jth column of T and T′. The element valuesof matrix T′ are also updated via backpropagationduring training. As shown in Figure 4, using oneglobal transition matrix ( GTM) is also beneficialand improves both the sentence level (sent PR)and bag level (bag att PR) models. However,since the global transition matrix only captures theglobal noise pattern, it fails to characterize individ-uals with subtle differences, resulting in a perfor-mance drop compared to the dynamic one ( TM).

Case Study We find our transition matrixmethod tends to obtain more significant im-provement on noisier relations. For exam-ple, time of spacecraft landing is noisier thantime of spacecraft launch since compared to thelaunching of a spacecraft, there are fewer sen-tences containing the landing time of a space-craft that talks directly about the landing. Instead,many of these sentences tend to talk about theactivities of the crew. Our sent PR TM modelimproves the F1 of time of spacecraft landingand time of spacecraft launch over sent PR by9.09% and 2.78%, respectively. The transitionmatrix makes more significant improvement ontime of spacecraft landing since there are morenoisy sentences for our method to handle, whichresults in more significant improvement on thequality of the training data.

6.2 Performance on ENTITYREWe evaluate our bag level models on ENTI-TYRE. As shown in Figure 5, it is not surpris-ing that the basic model with attention aggrega-tion (att) significantly outperforms the averageone (avg), where att in our bag embedding issimilar in spirit to (Lin et al., 2016), which has re-ported the-state-of-the-art performance on ENTI-TYRE. When injected with our transition matrixapproach, both att TM and avg TM clearly out-perform their basic versions.

0 . 0 0 . 1 0 . 2 0 . 3 0 . 40 . 20 . 30 . 40 . 50 . 60 . 70 . 80 . 91 . 0

Precis

ion

R e c a l l

a v g a t t a v g _ T M a t t _ T M

Figure 5: Results on ENTITYRE

Method P@R 10 P@R 20 P@R 30Mintz 39.88 28.55 16.81

MultiR 60.94 36.41 -MIML 60.75 33.82 -

avg 58.04 51.25 42.45avg TM 58.56 52.35 43.59

att 61.51 56.36 45.63att TM 67.24 57.61 44.90

Table 1: Comparison with feature-based methods.P@R 10/20/30 refers to the precision when recallequals 10%, 20% and 30%.

Similar to the situations in TIMERE, since atthas taken the at-least-one assumption into accountthrough its attention-based bag embedding mech-anism, thus the improvement made by att TM isnot as large as by avg TM.

We also include the comparison with threefeature-based methods: Mintz (Mintz et al.,2009) is a multiclass logistic regression model;MultiR (Hoffmann et al., 2011) is a probabilisticgraphical model that can handle overlapping rela-tions; MIML (Surdeanu et al., 2012) is also a prob-abilistic graphical model but operates in the multi-instance multi-label paradigm. As shown in Ta-ble 1, although traditional feature-based methodshave reasonable results in the low recall region,their performances drop quickly as the recall goesup, and MultiR and MIML did not even reachthe 30% recall. This indicates that, while human-designed featurs can effectively capture certain re-lation patterns, their coverage is relatively low.On the other hand, neural network models havemore stable performance across different recalls,and att TM performs generally better than othermodels, indicating again the effectiveness of ourtransition matrix method.

7 Related Work

In addition to relation extraction, distant supervi-sion (DS) is shown to be effective in generatingtraining data for various NLP tasks, e.g., tweetsentiment classification (Go et al., 2009), tweetnamed entity classifying (Ritter et al., 2011), etc.However, these early applications of DS do notwell address the issue of data noise.

In relation extraction (RE), recent works havebeen proposed to reduce the influence of wronglylabeled data. The work presented by (Takamatsuet al., 2012) removes potential noisy sentencesby identifying bad syntactic patterns at the pre-processing stage. (Xu et al., 2013) use pseudo-relevance feedback to find possible false nega-tive data. (Riedel et al., 2010) make the at-least-one assumption and propose to alleviate the noiseproblem by considering RE as a multi-instanceclassification problem. Following this assumption,people further improves the original paradigm us-ing probabilistic graphic models (Hoffmann et al.,2011; Surdeanu et al., 2012), and neural networkmethods (Zeng et al., 2015). Recently, (Lin et al.,2016) propose to use attention mechanism to re-duce the noise within a sentence bag. Insteadof characterizing the noise, these approaches onlyaim to alleviate the effect of noise.

The at-least-one assumption is often too strongin practice, and there are still chances that the sen-tence bag may be false positive or false negative.Thus it is important to model the noise pattern toguide the learning procedure. (Ritter et al., 2013)and (Min et al., 2013) try to employ a set of la-tent variables to represent the true relation. Ourapproach differs from them in two aspects. We tar-get noise modeling in neutral networks while theytarget probabilistic graphic models. We further ad-vance their models by providing the capability tomodel the fine-grained transition from the true re-lation to the observed, and the flexibility to com-bine indirect guidance.

Outside of NLP, various methods have beenproposed in computer vision to model the datanoise using neural networks. (Sukhbaatar et al.,2015) utilize a global transition matrix with weightdecay to transform the true label distribution to theobserved. (Reed et al., 2014) use a hidden layerto represent the true label distribution but try toforce it to predict both the noisy label and the in-put. (Chen and Gupta, 2015; Xiao et al., 2015) firstestimate the transition matrix on a clean dataset

and apply to the noisy data. Our model sharessimilar spirit with (Misra et al., 2016) in that weall dynamically generate a transition matrix foreach training instance, but, instead of using vanillaSGD, we train our model with a novel curriculumlearning training framework with trace regulariza-tion to control the behavior of transition matrix.In NLP, the only work in neural-network-basednoise modeling is to use one single global transi-tion matrix to model the noise introduced by cross-lingual projection of training data (Fang and Cohn,2016). Our work advances them through gener-ating a transition matrix dynamically for each in-stance, to avoid using one single component tocharacterize both reliable and unreliable data.

8 Conclusions

In this paper, we investigate the noise problem in-herent in the DS-style training data. We argue thatthe data speak for themselves by providing use-ful clues to reveal their noise patterns. We thuspropose a novel transition matrix based methodto dynamically characterize the noise underlyingsuch training data in a unified framework along theoriginal prediction objective. One of our key inno-vations is to exploit a curriculum learning basedtraining method to gradually learn to model theunderlying noise pattern without direct guidance,and to provide the flexibility to exploit any priorknowledge of the data quality to further improvethe effectiveness of the transition matrix. We eval-uate our approach in two learning settings of thedistantly supervised relation extraction. The ex-perimental results show that the proposed methodcan better characterize the underlying noise andconsistently outperform start-of-the-art extractionmodels under various scenarios.

Acknowledgement

This work is supported by the National High Tech-nology R&D Program of China (2015AA015403);the National Natural Science Foundation ofChina (61672057, 61672058); KLSTSPI KeyLab. of Intelligent Press Media Technol-ogy; the UK Engineering and Physical SciencesResearch Council under grants EP/M01567X/1(SANDeRs) and EP/M015793/1 (DIVIDEND);and the Royal Society International CollaborationGrant (IE161012).

ReferencesYoshua Bengio, Jérôme Louradour, Ronan Collobert,

and Jason Weston. 2009. Curriculum learning. InICML. ACM, pages 41–48.

Xinlei Chen and Abhinav Gupta. 2015. Webly super-vised learning of convolutional networks. In ICCV .pages 1431–1439.

Meng Fang and Trevor Cohn. 2016. Learning whento trust distant supervision: An application to low-resource pos tagging using cross-lingual projection.In CONLL. pages 178–186.

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-ter sentiment classification using distant supervision.CS224N Project Report, Stanford 1(12).

Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In Proceedings of ACL.pages 541–550.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In ACL. vol-ume 1, pages 2124–2133.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS. pages 3111–3119.

Bonan Min, Ralph Grishman, Li Wan, Chang Wang,and David Gondek. 2013. Distant supervision forrelation extraction with an incomplete knowledgebase. In HLT-NAACL. pages 777–782.

Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In ACL. pages 1003–1011.

Ishan Misra, C Lawrence Zitnick, Margaret Mitchell,and Ross Girshick. 2016. Seeing through the humanreporting bias: Visual classifiers from noisy human-centric labels. In CVPR. pages 2930–2939.

Robert Parker, David Graff, Junbo Kong, Ke Chen, andKazuaki Maeda. 2011. English gigaword fifth edi-tion, linguistic data consortium. Technical report,Linguistic Data Consortium, Philadelphia.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP. volume 14, pages 1532–1543.

Scott Reed, Honglak Lee, Dragomir Anguelov, Chris-tian Szegedy, Dumitru Erhan, and Andrew Rabi-novich. 2014. Training deep neural networks onnoisy labels with bootstrapping. arXiv preprintarXiv:1412.6596 .

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases. Springer, pages 148–163.

Alan Ritter, Alan Ritter, Sam Clark, Oren Etzioni, et al.2011. Named entity recognition in tweets: an exper-imental study. In EMNLP. Association for Compu-tational Linguistics, pages 1524–1534.

Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Et-zioni. 2013. Modeling missing data in distant super-vision for information extraction. TACL 1:367–378.

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri,Lubomir Bourdev, and Rob Fergus. 2015. Trainingconvolutional networks with noisy labels. In ICLR.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instancemulti-label learning for relation extraction. InEMNLP-CoNLL. pages 455–465.

Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa.2012. Reducing wrong labels in distant supervisionfor relation extraction. In ACL. pages 721–729.

Denny Vrandečić and Markus Krötzsch. 2014. Wiki-data: a free collaborative knowledgebase. Commu-nications of the ACM 57(10):78–85.

Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xi-aogang Wang. 2015. Learning from massive noisylabeled data for image classification. In CVPR.pages 2691–2699.

Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Gr-ishman. 2013. Filling knowledge base gaps fordistant supervision of relation extraction. In ACL.pages 665–670.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extractionvia piecewise convolutional neural networks. InEMNLP. pages 1753–1762.

Learning with Noise: Enhance Distantly Supervised Relation Extraction with … · 2017. 4. 24. ·...

Documents

Transcript of Learning with Noise: Enhance Distantly Supervised Relation Extraction with … · 2017. 4. 24. ·...