A Learning Corpus for Implicatives - Stanford …laurik/presentations/SemPrag...CSLI Language and...

CSLI Language and Natural Reasoning

A Learning Corpusfor Implicatives

Lauri Karttunen, Ignacio Cases and George SupaniratisaiStanford

December 5, 2016


Plan for the talkPart 1 (30 minutes by Lauri)

MotivationImplicative constructionsInvited inferencesThe corpusCollection of seed premisesFrom seed premises to entails, contradicts, permits tripletsResearch goals

Part 2 (15 minutes by George and Ignacio)Recurrent neural networkExperiments

The URL for this presentation ishttp://web.stanford.edu/~laurik/presentations/SemPrag-2016.pdf


MotivationNatural language inference is core to natural language understanding, and an essential component in the task of recognizing textual entailment.

In NLP this field has a long history from RTE competitions from 2005 onward to the present day.

Many data sets have been created for the RTE tasks, the latest one in this line we are aware of is the 2014 SemEval SICK corpus and our own Stanford Natural Languang Inference SNLI corpus of 2015.General shortcomings:

Mismash of various sources of information:WordNet, Anaphor resolution, General world knowledge, Reasoning,..

Not enough data:For machine learning 10K training data (SICK) is not enough. SNLI is big but hampered by the way it was constructed (contradicts = not the right caption for a picture).

Our contribution:We hope that the field can be advanced by data sets focusing on particular phenomena such as implicatives (our work), monotonicity reasoning (for which our work is a prerequisite for), presupposition (a stretch goal) etc.


Implicative constructionsImplicative constructions yield an entailment about the truth of their complement clause, at least under one polarity.

Some are simple verbs like manage (to) and fail (to), some are phrasal constructions like take the trouble (to), waste a chance (to).

There are seven different kinds of implicative constructions. Each of them has one the seven possible implicative signatures: +|- -|+ +|+ +|o -|o o|- o|+to be explained shortly.


Two-wayimplica-vesyieldanentailmentunderbothposi-veandnega-vepolarity.Verbslikemanage,bother,dare,deign,remember(to),happen,andturnoutetc.arepolarity-preserving;fail,neglect,andforget(to)reversethepolarity.

StanfailedtoproposetoCarole. fail:−|+

⊨Standidn’tproposetoCarole.

Standidn’tfailtoproposetoCarole.

⊨StanproposedtoCarole.

StanfailedtomanagetoproposetoCarole. manage:+|−

⊨Standidn’tproposetoCarole.

Simple two-way implicatives


Phrasal two-way implicatives+|- Julie had the chutzpah to ask the meter maid for a quarter. Randy didn’t take the opportunity to toot his own horn.

-|+Mr. Spitzer wasted the opportunity to drive a harder bargain.

She didn't waste the chance to smile back at him.

Two-way implicatives also trigger a presupposition about the protagonist and the VP. Stan failed to propose to Carole suggests that Stan tried or was expected to propose to Carol.We are not dealing with this aspect of meaning in this project.


New discovery+|+ take no time, waste no time ‘do something quickly’

She took no time to answer him back.The Senators didn't take no time to display their folly.

They were all hungry and took no time to get plates, put food on them and start eating the food.The ambulance crew didn’t take no time to get there.

Wright wasted no time to get points on the board.Hanna Baumel didn't waste no time to get her Football chair!

I wasted no time to delete you from Facebook. I didn’t waste no time to jump on board.


Thefourtypesofone-wayimplica-vesyieldanentailmentunderonepolarityand,inmanycases,aninvitedimplicatureundertheother.Theentailmentsofableandforcearepolarity-preserving,refuseandhesitatereversethepolarity.

Sallywasnotabletospeakup. ⊨Sallydidn’tspeakup. o|−

Sallywasforcedtospeakup. ⊨Sallyspokeup. +|o

Sallyrefusedtospeakup. ⊨Sallydidn’tspeakup. −|o

Sallydidn’thesitatetospeakup. ⊨Sallyspokeup. o|+

Invitedinference:

OnlySallywasabletospeakup(butshedidn’t).

One-way implicatives


Inaneutralcontextwhereithasnotbeenalreadymen-onedorotherwiseknownwhatactuallyhappened,alloftheone-wayimplica-vesarepushedtowardsbeingtwo-wayimplica-vesunlesstheauthorexplicitlyindicatesotherwise.

Sallywasabletospeakup. ↝Sallyspokeup. (+)|−

Sallywasnotforcedtospeakup. ↝Sallydidn’tspeakup. +|(−)

Sallydidnotrefusetospeakup. ↝Sallyspokeup. −|(+)

Sallyhesitatedtospeakup. ↝Sallydidn'tspeakupup. (−)|+

Thisisasystema-ceffectalthoughthestrengthoftheinvita-onvariesfromonelexicalitemtoanother:verystrongonable,weakonhesitate.

Forapragma-caccountofhowthiseffectarises,seemySALT26paper.

Invited inferences


Local Context, World KnowledgeInvited inferences do not arise where the local context indicates that the author does not intend them or where they would go against the known facts.

Jimmy Fallon had the chance to date Nicole Kidman and totally screwed up.Novak Djokovic had a chance to take a stand against sexism in tennis and somehow made it worse.I was able to speak but didn’t know what to say.

In the following cases, the non-veridicality of the VP is common knowledge.

FAA had chance to ground suicidal Germanwings pilot.Bill Clinton had a chance to kill Osama Bin Laden before September 11 attacks.


To better match the actual usage we assign to most of our simple one-way implicatives not the semantically correct signature, but one that is is deterministically + or - under the polarity that yields an entailment and a probability under the polarity that may generate an invited inference.

be able .9|-be forced +|.5be prevented -|.7get chance .9|-have chance .4|-have time .9|-

hesitate o|+

Probabilistic signatures


Phrasal one-way implicatives-|o She lost the chance to qualify for the final.o|-

The defendant had no ability to pay the fine. I have made no effort to check the accuracy of this blog.o|+ She did not have any hesitation to don the role of a seductress.

Fonseka displayed no reluctance to carry out his orders.


The CorpusThe data set consists of 66K triplets of a premise, a hypothesis and one of three labels:entails, contradicts, or permits (= neither entails nor contradicts)

she took the time to smile at us PREMISEentails RELATIONshe smiled at us HYPOTHESIS

she took the time to smile at uscontradictsshe did not smile at us

she took the time to smile at uspermitsshe listened to my needs same NP, different VP

she took the time to smile at uspermitshe did not smile at us different NP, same VP


Data collectionFor each of the 42 constructions we collected at least three dozen `seed’ premises from the Web and Google books. We did not make up any premises ourselves.

The seed premises we looked for were all positive sentences in the simple past tense, e.g.

George Osborne broke his pledge to protect the NHS budget.A negative counterpart was generated by a Python script:

George Osborne did not break his pledge to protect the NHS budget.The hypotheses, positive and negative, were generated from the seed premise:

George Osborne protected the NHS budget.George Osborne did not protect the NHS budget.

In this way each seed premise generates two entailments and contradictions.


George Osborne broke his pledge to protect the NHS budget.entailsGeorge Osborne did not protect the NHS budget.

George Osborne did not break his pledge to protect the NHS budget.entailsGeorge Osborne protected the NHS budget.

George Osborne broke his pledge to protect the NHS budget.contradictsGeorge Osborne protected the NHS budget.

George Osborne did not break his pledge to protect the NHS budget.contradictsGeorge Osborne did not protect the NHS budget.


Generating ‘permits’We two methods to generate hypotheses that were not entailed or contradicted by the premise.

1. Keep the original subject NP and swap in another unrelated VP found in the corpus:

George Osborne broke his pledge to protect the NHS budget.permitsGeorge Osborne did not stand down over hospital cuts.

2. Keep the original VP and swap in another subject NP that does not refer to or include the original subject.

George Osborne did not break his pledge to protect the NHS budget.permitsObama protected the NHS budget.

Since the polarity of the hypothesis make no difference for permits we randomly picked one.


Annotating premisesBecause switching subject NPs and VPs would easily produce nonsensical or ungrammatical sentences, we carefully annotated the seed premises with the necessary morphological information.

Subject NPs were marked for gender, number and person:Hillary ⇢ Hillary[F,S3], we ⇢ we[P1]

Reflexive, possessive, and other pronouns referring to the subject NP were replaced by place holders:

myself, yourself, themselves, etc. ⇢ [ReflPro]my, your, our, their, its ⇢ [PossPro]I, you, he, she, we, they ⇢ [SubjPro] when coreferential with the subject NP

Verbs were marked to indicate that their form depends on the tense and polarity of the sentence

broke ⇢ broke[V], protect ⇢ protect[V], etc.Conjunctions in places where and alternates with or depending on polarity were coded as such.

and ⇢ [AND,OR]


Annotation examples

you managed to get to into the university you wanted to.you[S2] managed[V] to get[V] to into the university [SubjPro] wanted[V] to.

University of Richmond alumna took time to cultivate her skills and find her passion.University of Richmond alumna[F,S3] took[V] time to cultivate[V] [PossPro] skills [AND,OR] find[V] [PossPro] passion

From the annotated form, we generated all simple past, perfect, and past perfect tenses of the sentence both positive and negative forms for the premise and the corresponding forms for the hypothesis for all three relations.


Data generationUniversity of Richmond alumna[F,S3] took[V] time to cultivate[V] [PossPro] skills [AND,OR] find[V] [PossPro] passion

University of Richmond alumna did not take time to cultivate her skills or find her passion.contradictsUniversity of Richmond alumna cultivated her skills and found her passion.

University of Richmond alumna took time to cultivate her skills and find her passion.permitsthey did not cultivate their skills or find their passion.


Current limitationsWe assume that it was best at this point to avoid phenomena whose inclusion might make the learning task significantly harder.

• We only include subject NPs that are either proper names, pronouns, or definite descriptions.

• No examples with quantifiers like some, every, few, many, etc. No subject NPs with the indefinite article.a man had a chance to make a decent profit in his business back thenSome man? A particular man? A generic man?

• Not and no are the only form of negation we currently include.• The data we now generate does not include any monotonicity reasoning.

There are no examples demonstrating the validity of dropping modifiers in monotone environments or adding them in antitone environments.

• We don’t include any examples of nested implicative constructions such asI did not manage to have time to eat during lunch hour.


Research GoalsWhen we started the project half-a-year ago I wasn’t at all sure that we could train a model that would reasonably accurately predict the relationship between a premise with an implicative construction and a hypothesis, hence the limitations that now appear to have been unnecessary.

If that would work out, the next question would be whether the model has learned the semantics of implicatives in a way that would come close to the way humans learn them. Has the model learned not just individual patterns but the general principles that they manifest? Has the model made the right generalizations?

There are hundreds of common phrasal implicatives in English but it seems evident that people don’t need to learn their implicative signatures one-by-one. Once you understand a construction such as take a chance you probably also understand what seize the opportunity means because take and seize are similar and so are chance and opportunity.


Question

how well does it do when presented with constructions

lose opportunitymiss opportunitytake opportunitywaste opportunity

If the model is trained on these chance constructions without ever seeing the word opportunity,

get chance .9|-have chance .4|-miss chance -|+take chance +|-waste chance -|+


Recurrent Neural Networks

2

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

ACL 2016 Submission ***. Confidential review copy. DO NOT DISTRIBUTE.

1 Introduction

2 Dynamics of Recurrent NeuralNetworks

In general terms, the dynamic evolution of a Re-current Neural Network (RNN) in discrete time isgoverned by a set of difference equations that canbe expressed in terms of a map parameterized byθ as follows

hθt+1 = F (ht,xt,θ) (1)

where ht and xt represent, at time t, the state ofthe network and external input respectively, and Fis a nonlinear vector field. We will make use of thecommon parametrization that uses a recurrent con-nection matrix R, an input matrix W , and biasesb together with a nonlinear function σ

ht+1 = σ (Rht +Wxt + b) (2)

First, let us consider the single-hidden unit modelRNN with no external input originally introducedby (Doya, 1993) and analyzed by (Nakahara andDoya., 1998; ?):

ht+1 = σ (aht + b) (3)

The parameters (a, b) govern the qualitative be-haviour of the dynamical system, including thenumber of fixed points and the stability (Nakaharaand Doya., 1998). When the network is trainedusing an updated rule such as stochastic gradi-ent descent, these parameters experience updatesthat may eventually result in sudden changes inthe qualitative behavior of the system. This phe-nomenon, known as bifurcation, is an importantcharacteristic of many non-linear dynamical sys-tems, and its importance in the context of RNNshas long been recognized (Doya and Yoshizawa,1991; ?; ?; ?).

2.1 Multidimensional Recurrent NeuralNetwork

Following the model introduced in the previoussection, the equations of evolution of a multidi-mensional RNN with external inputs expressed in?? are

hµ(t+1) = σ

!"

µν

Rµνhν(t) +"

ν

Wµνxν(t) + bµ

#

(4)

It is always possible to distinguish betweenthe self-recurrent connection and the rest of theweights:

hµ(t+ 1) = σ(Rµµhµ + bµ

+"

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t))

where the parameter a in the previous section isnow the scalar Rµµ, and b is the scalar bµ. Theelements Rµν with ν = µ represent the non-self-recurrent, lateral connections, and Wµν are theweights for the external input connections. As(Nakahara and Doya., 1998) shown, the effect ofthe sum of lateral connections and external inputscan be expressed as

hµ(t+ 1) = σ(Rµµhµ + bµ + uµ(t))

with

uµ(t) ="

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t) (5)

This last term enters the equations of evolution onequal foot to the bias, and thus the combined ef-fect of lateral connections and external inputs canbe interpreted as a change in the bias of a self-recurrent unit. As argued by (Nakahara and Doya.,1998), this reinterpretation allows us to analyze amultidimensional system subjected to external in-puts in terms of the dynamics of a single neuronwith a time dependent bias. The set of equations?? suggest that the coupling between lateral con-nections and external inputs can drive the dynam-ical evolution of a system. If a network is initial-ized with a state vector such that all the elementsare zero, the structure of the external inputs willdrive the changes of the bias, and therefore deeplyaffect the learning dynamics during the first stagesof the training process. From these observationsit follows that external inputs can be consideredas bifurcation parameters, and as such we can ex-pect that their structure will have deep impact inthe learning dynamics of the recurrent network.

2.1.1 Structure of External Inputs

uµ(t) = Rµνhν(t) +Wµνxν(t) (6)

2

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207


1 Introduction



hθt+1 = F (ht,xt,θ) (1)


ht+1 = σ (Rht +Wxt + b) (2)


ht+1 = σ (aht + b) (3)




hµ(t+1) = σ

!"

µν

Rµνhν(t) +"

ν

Wµνxν(t) + bµ

#

(4)



+"

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t))



with

uµ(t) ="

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t) (5)




2

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207


1 Introduction



hθt+1 = F (ht,xt,θ) (1)


ht+1 = σ (Rht +Wxt + b) (2)


ht+1 = σ (aht + b) (3)




hµ(t+1) = σ

!"

µν

Rµνhν(t) +"

ν

Wµνxν(t) + bµ

#

(4)



+"

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t))



with

uµ(t) ="

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t) (5)



uµ(t) = Rµνhν(t) +Wµνxν(t) (6)2

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207


1 Introduction



hθt+1 = F (ht,xt,θ) (1)


ht+1 = σ (Rht +Wxt + b) (2)


ht+1 = σ (aht + b) (3)




hµ(t+1) = σ

!"

µν

Rµνhν(t) +"

ν

Wµνxν(t) + bµ

#

(4)



+"

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t))



with

uµ(t) ="

ν,ν =µ

Rµνhν(t) +"

ν

Wµνxν(t) (5)




AuthorDepartmentStanford UniversityStanford, CA 94300, USA



The abstract paragraph should be indented 1/2inch (3picas) on both left and right-hand margins. Use 10point type, with a vertical spacing of 11points. The wordABSTRACT must be centered, in small caps, and in point size 12. Two line spacesprecede the abstract. The abstract must be limited to one paragraph.

ht+1 = F(Rht + W x+b) (1)

hµ(t+ 1) = σ(!

µν

Rµνhν(t) +!

ν

Wµνxν(t) + bµ) (2)

hµ(t+ 1) = σ(Rµµhµ + bµ +!

ν,ν =µ

Rµνhν(t) +!

ν

Wµνxν(t)) (3)

uµ(t) =!

ν,ν =µ

Rµνhν(t) +!

ν

Wµνxν(t) (4)


hµ(t+ 1) = σ(!

µν

Rµνhν(t) +!

ν

Wµνxν(t) + bµ) (6)

hµ(t+ 1) = σ(Rµµhµ + bµ +!

ν,ν =µ

Rµνhν(t) +!

ν

Wµνxν(t)) (7)

uµ(t) =!

ν,ν =µ

Rµνhν(t) +!

ν

Wµνxν(t) (8)


yt (10)

1

She took time to respond


Sentence Encoding

She took time to respond

entails

SM

:

She responded


Sequence to Sequence

She took time to respond She responded

entails

SM


Sequence to Sequence

She took time to respond She responded

entails

SM

LSTM illustration from Graf et al (2015)


ModelsTwo baseline models

- Lexicalized classifier- 6 features: 3 lexicalized, 3 unlexicalized

- LSTM sentence encoder

Experiments with a - sequence to sequence- two-layer LSTM


Models

Model F1 (dev)

Lexicalized classifier (b) 81

Sentence Encoder LSTM (b) ?

Seq2Seq LSTM 93


Experiments

Hyperparametersearch landscape

- Pre-trained GloVe embedding

- Searching for best learning rate and initialization scale factor


Experiment 1: training- Training using optimal hyperparameter pairs- Split data to 80% train, 10% validation, 10% test


Experiment 1: training- Training using optimal hyperparameter pairs- Split data to 80% train, 10% validation, 10% test

without probabilistic imp. with probabilistic imp.


Experiment 2 : Smaller data

- Training close to optimal hyperpara-meter pairs

- Randomly selectsome of data fromthe 80% train


Experiment 3 : Generalization- Hold offmiss opportunitywaste opportunitytake opportunitylose opportunity

- Test onmiss opportunitywaste opportunitytake opportunity- However, our model does not learn lose opportunity, not having seen any examples of lose chance.


Thanks!

We would welcome questions, comments, and suggestions from you.

We’d like to acknowledge Chris Manning, Chris Potts, Thang Luong, Dan Jurafsky, and Arun Chaganty for their comments and help.


One way implicatives: ○|−The join of negation and cover is entailment: ^ ⟗ ᴗ = ⊏

Sally spoke up

Sally was able to speak up⊨

Sally wasn’t able to speak up⊨

Sally didn’t speak up ¬ speak up

able⋀

Sally was able to speak upSally was not able to speak up

Sally didn’t speak up Sally spoke up

Universe

Sally was able to speak up ᴗ Sally didn’t speak up


One way implicatives: −|o

The join of negation and cover is entailment: ^ ⟗ ᴗ = ⊏

Sally spoke up

Sally didn’t refuse to speak up⊨

Sally refused to speak up⊨

Sally didn’t speak up ¬ refuse

¬ speak up⋀

Sally did not refuse to speak upSally refused to speak up

Sally didn’t speak up Sally spoke up

Universe

Sally didn’t refuse to speak up ᴗ Sally didn’t speak up

A Learning Corpus for Implicatives - Stanford …laurik/presentations/SemPrag...CSLI Language and...

Documents

Transcript of A Learning Corpus for Implicatives - Stanford …laurik/presentations/SemPrag...CSLI Language and...