000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
ECCV
#3850ECCV
#3850
—Supplementary Material—Detecting Human-Object Interactions with
Action Co-occurrence Priors
Anonymous ECCV submission
Paper ID 3850
The contents of this supplementary material include additional experimentalresults, implementation details of our model, more qualitative results and anal-ysis which have not been shown in the main paper due to the space limitation.
1 Additional Experimental Results
In this section, we provide additional qualitative results.We also analyze theeffect of using different numbers of anchor actions introduced in Section 3.2 of themain paper, performance across different classes, and analyze false predictionsin our model’s predictions.
1.1 Qualitative Results
Fig. 1 shows examples of HOI detection results that our model predicts correctlywith high probability. We show each image with the predicted HOI class followedby the probability computed by our model.
1.2 Number of Anchor Actions
In addition, we investigate the effect of using different numbers of anchor actions|D| in Fig. 2. We measure the relative performance improvement from the +Hi-erarchical model to the Modified Baseline model by changing the numberof anchor actions |D| at intervals of five.
In principle, the more anchor actions we use, the better performance that canbe attained. On one hand, the selected anchor actions can be more distinguish-able from one another with more anchor action categories and training samples.On the other hand, the remaining regular actions can also benefit from strongerco-occurrence priors. However, more anchor action will also result in more sub-networks to optimize, and this will cause over-fitting to a certain extent. Throughobservations, we found that an increase in the number of parameters of the HOIdetector often causes a severe performance decrease. Thus, there is a trade-offbetween a large and small number of anchors, which requires us to empiricallyselect the best anchor number. As depicted in Fig. 2, the hierarchical architec-ture shows the best overall mAP score (Full) with 15 anchors and the best mAPscore on rare classes with 10 anchors. We finally use an experimentally overallbest-performing choice of 15 anchor actions (maximum anchor action number is54).
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
ECCV
#3850ECCV
#3850
2 ECCV-20 submission ID 3850
sit_on-couch (80.48)sit_at-dining_table (76.02)fly-airplane (61.35)
straddle-bicycle (61.44)ride-boat (74.93)
hold-bird (96.17) carry-bottle (64.25)
drive-bus (67.74) lie_on-couch (74.00)
wield-baseball_bat (56.88) read-book (84.80) lift-fork (61.91) catch-frisbee (56.60)
operate-hair_drier (52.935) type_on-keyboard (67.2) pull-kite (73.59) cut_with-knife (65.17) control-mouse (93.4)
Fig. 1. Examples of bounding boxes and HOI detection scores from our model. Bound-ing boxes for humans are colored red, and bounding boxes for objects are coloreds blue.Each image is displayed with the predicted action+object class followed by the prob-ability computed by our model.
1.3 Performance across Different Classes
Fig. 3 shows the improvements in mAP scores across different object and actionclasses. We compare our model with the baseline model.In order to visualize onwhich classes our method improves performance, the figure is organized in theorder of the improvements of the scores of each class. It is found that our methodimproves mAP scores in 64 out of the 80 object classes (80%) and 90 out of the117 actions (77%).
1.4 Analysis of False Predictions
Fig. 4 shows examples of false predictions by our model. We found three commonreasons that the prediction is evaluated as false. First is the wrong prediction ofobject label from the object detector (e.g.,‘couch’ as ‘chair’ or ‘sheep’ as ‘cow’)which is shown in the left column of Fig. 4. Second, the prediction is correct butthe ground truth label for the prediction in a test image is missing (e.g.,‘ride-bus’ or ‘hold-horse’) which is shown in the middle column of Fig. 4. Last, aHOI detector could have been predicted correctly if there were a sophisticatedway to take context (the third object or the background) into account (e.g.,the
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
ECCV
#3850ECCV
#3850
ECCV-20 submission ID 3850 3
0 5 10 15 20 25 30 35 40 45 50 55
Number of Anchors
-2
0
2
4
6
8
10
12
14
Rel
ativ
e m
AP
Impr
ovem
ent (
%)
FullRareNon-rare
Fig. 2. Performance of the hierarchical architecture with different number of anchorsat intervals of five. The model with 15 and 10 anchors show the best performanceoverall and on rare classes, respectively.
background for predicting ‘repair-bicycle’ or an object that person carries forpredicting ‘load-bus’) which is shown in the right column of Fig. 4. The last issuecan be solved by devising a better network architecture for effectively encodingcontext, which is a direction orthogonal to our work. All of these issues (theerrors in the object detector, missing labels in datasets, and encoding context)are fundamental issues in HOI detection, which can be interesting topics forfuture work.
2 Implementation Details
In this section, we provide more implementation details of our method which arenot discussed in the main paper due to the space limitation.
2.1 Knowledge Distillation in Object-agnostic Level
In Knowledge Distillation via ACP Projection (Section 3.4 in the main paper),the action co-occurrence matrices Co and C
′
o used for ACP Projection are object-aware (described in Eq. (17) and Eq. (18) in the main paper). We also use theobject-agnostic action co-occurrence matrix C and C
′.
We thus can generate two more object-aware teacher labels when constructingteacher objectives:
Ŷproj = joint(Ĥ, Ô, project(Â, C,C′)) ∈ RM , (1)
Y gtproj = joint(Hgt, Ogt, project(Agt, C, C
′)) ∈ RM (2)
Finally, the total loss function we used to train the ACP model is expressed as:
Ltot =λ1L(Ŷ , Y gt) + λ2L(Ŷ , ŶprojO) + λ3L(Ŷ , Y gtprojO)
+ λ4L(Ŷ , Ŷproj) + λ5L(Ŷ , Y gtproj),(3)
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
ECCV
#3850ECCV
#3850
4 ECCV-20 submission ID 3850
giraf
fe
potte
d pla
nt
zebr
a tvbe
dbe
arsin
k
sand
wich
toas
ter
couc
h
mou
se
keyb
oard
knife
cake
bowl
wine
glas
s
carro
t
oran
gech
air
refri
gera
torpiz
zaclo
ck
hot d
og fork
skat
eboa
rd
pers
on
traffic
light do
g
lapto
p
bicyc
le
base
ball g
love
stop
signdo
nut
shee
pbo
ttle
tenn
is ra
cket
rem
ote
truck
dining
table
sciss
ors
tedd
y bea
r
hand
bag
boat tie sk
isbir
dcu
p
spoo
n
mot
orcy
cleov
en
suitc
aseho
rse
frisb
ee car
spor
ts ba
ll
benc
hbu
s
bana
na
snow
boar
dtra
in
base
ball b
at
surfb
oard
broc
coli
fire
hydr
ant
airpla
ne
micr
owav
ekit
eap
ple
eleph
ant
umbr
ella
toile
tco
wbo
okva
se
toot
hbru
sh
back
pack
cell p
hone ca
t
park
ing m
eter
hair
drier
Object Class
-20
-10
0
10
20A
P Im
prov
emen
t
light
brea
k
sque
eze zip
cont
roltu
rn lickblo
ckfe
ed cut
stop
atco
ok
scra
tchlie
onto
astpe
t
boar
dsli
depe
el
shea
rra
cesti
ck hit
oper
atedr
ag stirwa
vesign fill
pick u
phe
rd tiesta
bea
t atbu
y
cut w
ithloseex
itgr
eetloa
d
no in
tera
ctionte
ach sip
serv
e
adjus
t
relea
sedir
ectpa
rk
strad
dlekicksa
ilkis
s
watchca
rrytrainpo
inthu
nt
type
on setpu
ll
hop
onpa
ckm
ovewa
lkjum
pgr
indre
pairpo
urm
ilk eat
stand
und
ersit
onop
en
inspe
ct
mak
ecle
anspinins
tall
stand
onru
n
brus
h wi
thwa
sh liftrowho
ld
launc
hsw
ingridepu
shth
rowch
eckho
sehugwe
arsit
at
dribb
lere
ad flipca
tch
drink
withpic
k
groo
m dry
asse
mblesm
ellblo
wwi
elddr
ive
text
on tag fly
talk
onlas
so
chas
epa
yflu
shpa
int
Action Class
-20
-10
0
10
20
30
40
AP
Impr
ovem
ent
Fig. 3. The AP score improvements across different object and action classes sorted bythe amount of improvement compared to the baseline model [3]. Our method improvesmAP scores in 64 out of the 80 object classes (80%) and 90 out of the 117 actions(77%).
lie_on-chair (58.87) ride-bus (87.03) hop_on-bicycle (1.48)
feed-cow (13.85) hold-horse (43.39) inspect-bus (7.36)
Fig. 4. Examples of the false predictions from our model. The false cases include wrongprediction of object label from the object detector (left column), correct prediction butmissing ground truth labels (middle column), and predictions that could have beencorrect if the context were taken into account (right column).
by introducing two more loss balancing weights λ4, λ5.
2.2 Details of Multiple Information and Fusion Networks
In Section 3.3 of the main paper, we introduced the baseline network “no-frills” [3]. We now give detailed definitions of the multiple information X andthe fusion networks F (·) used in this baseline and our modified baseline.
The CNN features used for HOI classification are denoted as x̂ = {x̂h, x̂o},where x̂h and x̂o are for human and object appearance, respectively. These CNNfeatures are directly extracted from the final fully connected (FC) layer of thedetector. The multiple information consists of human appearance x̂h, objectappearance x̂o, as well as human pose (k̂), object category (ô) and bounding
boxes (b̂ including both object and human bounding boxes). All of these input
cues X = {x̂, k̂, ô, b̂} are fed to our target model. There are four separate networkstreams: human appearance (fh(·)), object appearance (fo(·)), bounding box
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
ECCV
#3850ECCV
#3850
ECCV-20 submission ID 3850 5
Action Prediction Module
(512 → 117)
Fig2. Overall Architecture
HumanAppearance (𝑓ℎ)
(1024 → 512)
ObjectAppearance (𝑓𝑜)
(1024 → 512)
Pose and Label (𝑓𝑘 )(588 → 512)
Box and Label (𝑓𝑏)(342 → 512)
ActionScore
Human Appearance ( ො𝑥ℎ)
Object Appearance (ො𝑥𝑜)
Human Pose (𝑘)
Object Category (ො𝑜)
Box Configuration (𝑏)
Average
Fig. 5. Overall illustration of the basic components of our network architecture (Moresymbols added). The design choices for the hierarchical action prediction module aredetailed in Fig. 7.
and object category (fb(·)), and human pose and object category (fk(·)). Eachindividual type of information is first fed through a separate network of two FClayers to generate a fixed dimension (number of actions N) feature. Then, allthe features are added together and sent through a sigmoid activation to get theprobability prediction for the action a:
Â(a) = p(a|X) = sigmoid(F (X)). (4)
The multi-information fusion procedure F (X) is expressed as
F (X) = fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô), (5)
where the operation || denotes concatenation.However, instead of directly adding up the multiple information, we average
them and forward this through another action prediction module fsub of a few FClayers to obtain the final action probability prediction as illustrated in Fig. 5.For a naive approach (denoted as the Modified Baseline), we simply use asub-network fsub of one FC layer as the action prediction module. Then Eq. (4)and Eq. (5) are modified to
 = p(a|X) = sigmoid(fsub(F (X))), (6)
F (X) =fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô)
nstream. (7)
In this case, nstream = 4.
2.3 Training Details
We employ Faster R-CNN [7] with a Feature Pyramid Network (FPN) [5] andResNet-101 [4] backbone as an object detector and freeze it while training anHOI classifier. This detector was originally trained on the MS COCO dataset [6],which has the same 80 object categories as the HICO-Det dataset.The HOI clas-sifier consists of four streams similar to [1,2], extracting features from instance
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
ECCV
#3850ECCV
#3850
6 ECCV-20 submission ID 3850
Table 1. The list of 54 anchors made out of all 117 actions from HICO-Det dataset.
flush tag stab wave brush with hunttoast pay move teach no interaction eat atmilk squeeze greet stop at stir installpoint sign paint shear release controlbreak light lose zip lift pack
cut with type on talk on set slide operatespin assemble pour lie on turn chaseherd flip buy hose kick rowpeel dry hop on direct adjust kiss
appearance, spatial location, and human pose. The pose and spatial streams arecomposed of two FC layers whereas the human and object appearance streamsconsist of one FC layer of size 512. The dimension of the two hidden layers inthe pose and spatial streams is 512. The outputs of the four streams are consoli-dated via average pooling and passed through an FC layer of size 512 to performHOI classification. We consider all detections for which the detection confidenceis greater than 0.01 and create human-object pairs for each image. The image-centric training strategy [7] is also applied. In other words, pair candidates fromone image make up each mini-batch. We adopt SGD and set the initial learn-ing rate to 1e-3, weight decay to 1e-4, and momentum to 0.9. For the ratio ofpositive to negative samples in training, while [3] uses 1:1000, we suspect 1:600is more reasonable because intuitively the ratio of the positive and the negativesample is likely to be 1:(Number of classes). We empirically found out 1:600 givesbetter performance than 1:1000. We train the framework for 100000 iterations.All experiments are conducted on a single Nvidia K40 GPU.
3 More Illustrations for Our Methods
3.1 Illustration for NES
Fig. 6 shows a step-by-step example of the anchor selection process among eightaction classes (from ‘A’ to ‘H’). The color of each node denotes whether or notits exclusiveness value ei is included in the exclusiveness pool E . The i-th nodeis blue if ei ∈ E , orange if selected as an anchor action, gray if ei 6∈ E ) The colorof the edge denotes whether or not the co-occurrence value cij is included in theco-occurrence pool C.
3.2 Design Choices for Hierarchical Architecture
Fig. 7 illustrates the design choices for our hierarchical architecture (ModifiedBaseline, MultiTask, TwoStream, and Hierarchical).
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
ECCV
#3850ECCV
#3850
ECCV-20 submission ID 3850 7
4 Extras
4.1 List of Anchors
Table 1 shows the list of maximum number of anchor actions from the HICO-Detdataset in the order in which they are selected. For example, when setting thenumber of anchor actions to be 10, we take from the first action (‘flush’) to thetenth action (‘teach’) as the set of anchors, and the rest are associated to ‘other.’
4.2 Example of Co-occurrence Matrices for Other Objects
Fig. 8 shows more examples of the co-occurrence matrices Co for object o, con-structed from the HICO-Det dataset.
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
ECCV
#3850ECCV
#3850
8 ECCV-20 submission ID 3850
STEP0. START (Results :)
H
B
G F
D
CA
E
STEP1. ‘A’ is the most exclusive. ‘A’ is the first anchor. (Results : [A])
H
B
G F
D
CA
E
STEP2. Select ‘C’ as the next anchor. (Results : [A] [C])
H
B
G F
D
CA
E
STEP3. ‘B’ can co-occurs with ‘C’. Remove ‘B’ from the candidate. (Results : [A] [C])
H
B
G F
D
CA
E
STEP4. Select ‘E’ as the next anchor and remove ‘D’. (Results : [A] [C] [E])
H
B
G F
D
CA
E
STEP5. Select ‘F’ and remove ‘G’. (Results : [A] [C] [E] [F])
H
B
G F
D
CA
E
STEP6. Select ‘H’ as the last anchor. (Results : [A] [C] [E] [F] [H])
H
B
G F
D
CA
E
STEP7. Assign regular actions to action groups. (Results : [A] [C:B] [E:D] [F:B,G] [H:D,G])
H
B
G F
D
CA
E
Fig. 6. Illustration of the proposed anchor action selection process.
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
ECCV
#3850ECCV
#3850
ECCV-20 submission ID 3850 9
𝑓(∙)(512 → 𝑁, sigmoid)
Probability for All actions
Fused vector(512)
Action Score (𝑵)
⋮
(a) Modified Baseline
𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)
𝑓𝒢 (∙)(512 → 𝑁, sigmoid)
Probability for Anchor actions(as an auxiliary label)
Probability for All actions
( 𝒟 + 1)
⋮
Fused vector(512)
Action Score (𝑵)
⋮
(b) MultiTask
𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)
𝑓𝒢 (∙)(512 → (𝑁 − 𝒟 ), sigmoid)
Probability for Anchor actions
Probability for Regular actions
( 𝒟 + 1) ( 𝒟 )
(𝑁 − 𝒟 )
⋮
⋮
Fused vector(512)
Action Score (𝑵)
⋮
(c) TwoStream
𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)
𝑓𝒢1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)
𝑓𝒢2(∙)(512 → (𝑁 − 𝒟 ), sigmoid)
𝑓𝒢 𝒟 +1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)
⋮
Probability for Anchor actions
Probability for Regular actions
( 𝒟 + 1)
( 𝑁 − 𝒟 × 𝒟 + 1 )
( 𝒟 )
(𝑁 − 𝒟 )
MatMul
⋮
⋮⋮
Fused vector(512)
Action Score (𝑵)
⋮
⋮⋮
⋮
(d) Hierarchical
Fig. 7. Design choices for the action prediction module of our network. (a) ModifiedBaseline architecture without leveraging the prior of anchor actions. (b) MultiTaskwhich utilizes the anchor actions as in a multi-task learning manner. (c) TwoStreamwhich separately predicts the anchor and the regular actions but without using thehierarchical modeling between anchor and regular actions. (d) The proposed hierarchi-cal target (Hierarchical). Anchor probability is directly used as a final score, and weexploit multiple conditional sub-networks to further compute the probabilities for theregular actions.
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
ECCV
#3850ECCV
#3850
10 ECCV-20 submission ID 3850
boat
boar
ddr
ive exit
inspe
ctjum
plau
nch
no in
tera
ction
repa
irrid
ero
wsa
ilsit
on
stand
on tie
wash
boarddrive
exitinspect
jumplaunch
no interactionrepair
riderowsail
sit onstand on
tiewash
0
0.2
0.4
0.6
0.8
1bus
boar
ddir
ect
drive ex
itins
pect
load
no in
tera
ction ride
sit o
nwa
shwa
ve
board
direct
drive
exit
inspect
load
no interaction
ride
sit on
wash
wave0
0.2
0.4
0.6
0.8
1car
boar
ddir
ect
drive
hose
inspe
ctjum
ploa
d
no in
tera
ction
park ride
wash
board
direct
drive
hose
inspect
jump
load
no interaction
park
ride
wash0
0.2
0.4
0.6
0.8
1cow
feed
herd
hold
hug
kiss
lasso milk
no in
tera
ction pe
trid
ewa
lk
feed
herd
hold
hug
kiss
lasso
milk
no interaction
pet
ride
walk0
0.2
0.4
0.6
0.8
1
dog
carry
chas
edr
yfe
edgr
oom
hold
hosehug
inspe
ctkis
s
no in
tera
ction pe
tru
nsc
ratch
strad
dletrain
walk
wash
carrychase
dryfeed
groomholdhosehug
inspectkiss
no interactionpetrun
scratchstraddle
trainwalk
wash 0
0.2
0.4
0.6
0.8
1elephant
feed hold
hop
onho
se hug
kiss
no in
tera
ction pe
trid
ewa
lkwa
shwa
tch
feed
hold
hop on
hose
hug
kiss
no interaction
pet
ride
walk
wash
watch0
0.2
0.4
0.6
0.8
1horse
feed
groo
mho
ldho
p onhug
jump
kiss
load
no in
tera
ction pe
tra
ce ride
run
strad
dletrain
walk
wash
feedgroom
holdhop on
hugjumpkissload
no interactionpet
raceriderun
straddletrainwalk
wash0
0.2
0.4
0.6
0.8
1motorcycle
hold
hop
onins
pect
jump
no in
tera
ction
park
push
race ride
sit o
nstr
addle turn
walk
wash
holdhop oninspect
jumpno interaction
parkpushraceride
sit onstraddle
turnwalk
wash0
0.2
0.4
0.6
0.8
1
pizza
buy
carry
cook cut
eat
hold
mak
e
no in
tera
ction
pick u
psli
desm
ell
buy
carry
cook
cut
eat
hold
make
no interaction
pick up
slide
smell0
0.2
0.4
0.6
0.8
1 sheep
carry feed
herd
hold
hug
kiss
no in
tera
ction pe
trid
esh
ear
walk
wash
carry
feed
herd
hold
hug
kiss
no interaction
pet
ride
shear
walk
wash0
0.2
0.4
0.6
0.8
1skis
adjus
tca
rry hold
inspe
ctjum
p
no in
tera
ction
pick u
pre
pair
ride
stand
on
wear
adjust
carry
hold
inspect
jump
no interaction
pick up
repair
ride
stand on
wear0
0.2
0.4
0.6
0.8
1sports ball
block
carry
catch
dribb
le hitho
ldins
pect
kick
no in
tera
ction
pick u
pse
rve
sign
spin
thro
w
blockcarrycatch
dribblehit
holdinspect
kickno interaction
pick upserve
signspin
throw0
0.2
0.4
0.6
0.8
1
Fig. 8. Example of co-occurrence matrices Co for each object o constructed from HICO-Det dataset.
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
ECCV
#3850ECCV
#3850
ECCV-20 submission ID 3850 11
References
1. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-objectinteractions. In: IEEE Winter Conference on Applications of Computer Vision(WACV) (2018)
2. Gao, C., Zou, Y., Huang, J.B.: ican: Instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (BMVC) (2018)
3. Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection:Factorization, layout encodings, and training techniques. In: IEEE InternationalConference on Computer Vision (ICCV) (2019)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
5. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (2017)
6. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conferenceon Computer Vision (ECCV) (2014)
7. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Advances in Neural Information Pro-cessing Systems (NIPS) (2015)
—Supplementary Material—Detecting Human-Object Interactions with Action Co-occurrence Priors
Top Related