Download - ECCV |Supplementary Material| Detecting Human-Object … · 2020. 8. 5. · 1.4 Analysis of False Predictions Fig. 4 shows examples of false predictions by our model. We found three

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

ECCV

#3850ECCV

#3850

—Supplementary Material—Detecting Human-Object Interactions with

Action Co-occurrence Priors

Anonymous ECCV submission

Paper ID 3850

The contents of this supplementary material include additional experimentalresults, implementation details of our model, more qualitative results and anal-ysis which have not been shown in the main paper due to the space limitation.

1 Additional Experimental Results

In this section, we provide additional qualitative results.We also analyze theeffect of using different numbers of anchor actions introduced in Section 3.2 of themain paper, performance across different classes, and analyze false predictionsin our model’s predictions.

1.1 Qualitative Results

Fig. 1 shows examples of HOI detection results that our model predicts correctlywith high probability. We show each image with the predicted HOI class followedby the probability computed by our model.

1.2 Number of Anchor Actions

In addition, we investigate the effect of using different numbers of anchor actions|D| in Fig. 2. We measure the relative performance improvement from the +Hi-erarchical model to the Modified Baseline model by changing the numberof anchor actions |D| at intervals of five.

In principle, the more anchor actions we use, the better performance that canbe attained. On one hand, the selected anchor actions can be more distinguish-able from one another with more anchor action categories and training samples.On the other hand, the remaining regular actions can also benefit from strongerco-occurrence priors. However, more anchor action will also result in more sub-networks to optimize, and this will cause over-fitting to a certain extent. Throughobservations, we found that an increase in the number of parameters of the HOIdetector often causes a severe performance decrease. Thus, there is a trade-offbetween a large and small number of anchors, which requires us to empiricallyselect the best anchor number. As depicted in Fig. 2, the hierarchical architec-ture shows the best overall mAP score (Full) with 15 anchors and the best mAPscore on rare classes with 10 anchors. We finally use an experimentally overallbest-performing choice of 15 anchor actions (maximum anchor action number is54).

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

ECCV

#3850ECCV

#3850

2 ECCV-20 submission ID 3850

sit_on-couch (80.48)sit_at-dining_table (76.02)fly-airplane (61.35)

straddle-bicycle (61.44)ride-boat (74.93)

hold-bird (96.17) carry-bottle (64.25)

drive-bus (67.74) lie_on-couch (74.00)

wield-baseball_bat (56.88) read-book (84.80) lift-fork (61.91) catch-frisbee (56.60)

operate-hair_drier (52.935) type_on-keyboard (67.2) pull-kite (73.59) cut_with-knife (65.17) control-mouse (93.4)

Fig. 1. Examples of bounding boxes and HOI detection scores from our model. Bound-ing boxes for humans are colored red, and bounding boxes for objects are coloreds blue.Each image is displayed with the predicted action+object class followed by the prob-ability computed by our model.

1.3 Performance across Different Classes

Fig. 3 shows the improvements in mAP scores across different object and actionclasses. We compare our model with the baseline model.In order to visualize onwhich classes our method improves performance, the figure is organized in theorder of the improvements of the scores of each class. It is found that our methodimproves mAP scores in 64 out of the 80 object classes (80%) and 90 out of the117 actions (77%).

1.4 Analysis of False Predictions

Fig. 4 shows examples of false predictions by our model. We found three commonreasons that the prediction is evaluated as false. First is the wrong prediction ofobject label from the object detector (e.g.,‘couch’ as ‘chair’ or ‘sheep’ as ‘cow’)which is shown in the left column of Fig. 4. Second, the prediction is correct butthe ground truth label for the prediction in a test image is missing (e.g.,‘ride-bus’ or ‘hold-horse’) which is shown in the middle column of Fig. 4. Last, aHOI detector could have been predicted correctly if there were a sophisticatedway to take context (the third object or the background) into account (e.g.,the

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

ECCV

#3850ECCV

#3850

ECCV-20 submission ID 3850 3

0 5 10 15 20 25 30 35 40 45 50 55

Number of Anchors

-2

0

2

4

6

8

10

12

14

Rel

ativ

e m

AP

Impr

ovem

ent (

%)

FullRareNon-rare

Fig. 2. Performance of the hierarchical architecture with different number of anchorsat intervals of five. The model with 15 and 10 anchors show the best performanceoverall and on rare classes, respectively.

background for predicting ‘repair-bicycle’ or an object that person carries forpredicting ‘load-bus’) which is shown in the right column of Fig. 4. The last issuecan be solved by devising a better network architecture for effectively encodingcontext, which is a direction orthogonal to our work. All of these issues (theerrors in the object detector, missing labels in datasets, and encoding context)are fundamental issues in HOI detection, which can be interesting topics forfuture work.

2 Implementation Details

In this section, we provide more implementation details of our method which arenot discussed in the main paper due to the space limitation.

2.1 Knowledge Distillation in Object-agnostic Level

In Knowledge Distillation via ACP Projection (Section 3.4 in the main paper),the action co-occurrence matrices Co and C

′

o used for ACP Projection are object-aware (described in Eq. (17) and Eq. (18) in the main paper). We also use theobject-agnostic action co-occurrence matrix C and C

′.

We thus can generate two more object-aware teacher labels when constructingteacher objectives:

Ŷproj = joint(Ĥ, Ô, project(Â, C,C′)) ∈ RM , (1)

Y gtproj = joint(Hgt, Ogt, project(Agt, C, C

′)) ∈ RM (2)

Finally, the total loss function we used to train the ACP model is expressed as:

Ltot =λ1L(Ŷ , Y gt) + λ2L(Ŷ , ŶprojO) + λ3L(Ŷ , Y gtprojO)

+ λ4L(Ŷ , Ŷproj) + λ5L(Ŷ , Y gtproj),(3)

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

ECCV

#3850ECCV

#3850


giraf

fe

potte

d pla

nt

zebr

a tvbe

dbe

arsin

k

sand

wich

toas

ter

couc

h

mou

se

keyb

oard

knife

cake

bowl

wine

glas

s

carro

t

oran

gech

air

refri

gera

torpiz

zaclo

ck

hot d

og fork

skat

eboa

rd

pers

on

traffic

light do

g

lapto

p

bicyc

le

base

ball g

love

stop

signdo

nut

shee

pbo

ttle

tenn

is ra

cket

rem

ote

truck

dining

table

sciss

ors

tedd

y bea

r

hand

bag

boat tie sk

isbir

dcu

p

spoo

n

mot

orcy

cleov

en

suitc

aseho

rse

frisb

ee car

spor

ts ba

ll

benc

hbu

s

bana

na

snow

boar

dtra

in

base

ball b

at

surfb

oard

broc

coli

fire

hydr

ant

airpla

ne

micr

owav

ekit

eap

ple

eleph

ant

umbr

ella

toile

tco

wbo

okva

se

toot

hbru

sh

back

pack

cell p

hone ca

t

park

ing m

eter

hair

drier

Object Class

-20

-10

0

10

20A

P Im

prov

emen

t

light

brea

k

sque

eze zip

cont

roltu

rn lickblo

ckfe

ed cut

stop

atco

ok

scra

tchlie

onto

astpe

t

boar

dsli

depe

el

shea

rra

cesti

ck hit

oper

atedr

ag stirwa

vesign fill

pick u

phe

rd tiesta

bea

t atbu

y

cut w

ithloseex

itgr

eetloa

d

no in

tera

ctionte

ach sip

serv

e

adjus

t

relea

sedir

ectpa

rk

strad

dlekicksa

ilkis

s

watchca

rrytrainpo

inthu

nt

type

on setpu

ll

hop

onpa

ckm

ovewa

lkjum

pgr

indre

pairpo

urm

ilk eat

stand

und

ersit

onop

en

inspe

ct

mak

ecle

anspinins

tall

stand

onru

n

brus

h wi

thwa

sh liftrowho

ld

launc

hsw

ingridepu

shth

rowch

eckho

sehugwe

arsit

at

dribb

lere

ad flipca

tch

drink

withpic

k

groo

m dry

asse

mblesm

ellblo

wwi

elddr

ive

text

on tag fly

talk

onlas

so

chas

epa

yflu

shpa

int

Action Class

-20

-10

0

10

20

30

40

AP

Impr

ovem

ent

Fig. 3. The AP score improvements across different object and action classes sorted bythe amount of improvement compared to the baseline model [3]. Our method improvesmAP scores in 64 out of the 80 object classes (80%) and 90 out of the 117 actions(77%).

lie_on-chair (58.87) ride-bus (87.03) hop_on-bicycle (1.48)

feed-cow (13.85) hold-horse (43.39) inspect-bus (7.36)

Fig. 4. Examples of the false predictions from our model. The false cases include wrongprediction of object label from the object detector (left column), correct prediction butmissing ground truth labels (middle column), and predictions that could have beencorrect if the context were taken into account (right column).

by introducing two more loss balancing weights λ4, λ5.

2.2 Details of Multiple Information and Fusion Networks

In Section 3.3 of the main paper, we introduced the baseline network “no-frills” [3]. We now give detailed definitions of the multiple information X andthe fusion networks F (·) used in this baseline and our modified baseline.

The CNN features used for HOI classification are denoted as x̂ = {x̂h, x̂o},where x̂h and x̂o are for human and object appearance, respectively. These CNNfeatures are directly extracted from the final fully connected (FC) layer of thedetector. The multiple information consists of human appearance x̂h, objectappearance x̂o, as well as human pose (k̂), object category (ô) and bounding

boxes (b̂ including both object and human bounding boxes). All of these input

cues X = {x̂, k̂, ô, b̂} are fed to our target model. There are four separate networkstreams: human appearance (fh(·)), object appearance (fo(·)), bounding box

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

ECCV

#3850ECCV

#3850


Action Prediction Module

(512 → 117)

Fig2. Overall Architecture

HumanAppearance (𝑓ℎ)

(1024 → 512)

ObjectAppearance (𝑓𝑜)

(1024 → 512)

Pose and Label (𝑓𝑘 )(588 → 512)

Box and Label (𝑓𝑏)(342 → 512)

ActionScore

Human Appearance ( ො𝑥ℎ)

Object Appearance (ො𝑥𝑜)

Human Pose (𝑘)

Object Category (ො𝑜)

Box Configuration (𝑏)

Average

Fig. 5. Overall illustration of the basic components of our network architecture (Moresymbols added). The design choices for the hierarchical action prediction module aredetailed in Fig. 7.

and object category (fb(·)), and human pose and object category (fk(·)). Eachindividual type of information is first fed through a separate network of two FClayers to generate a fixed dimension (number of actions N) feature. Then, allthe features are added together and sent through a sigmoid activation to get theprobability prediction for the action a:

Â(a) = p(a|X) = sigmoid(F (X)). (4)

The multi-information fusion procedure F (X) is expressed as

F (X) = fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô), (5)

where the operation || denotes concatenation.However, instead of directly adding up the multiple information, we average

them and forward this through another action prediction module fsub of a few FClayers to obtain the final action probability prediction as illustrated in Fig. 5.For a naive approach (denoted as the Modified Baseline), we simply use asub-network fsub of one FC layer as the action prediction module. Then Eq. (4)and Eq. (5) are modified to

Â = p(a|X) = sigmoid(fsub(F (X))), (6)

F (X) =fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô)

nstream. (7)

In this case, nstream = 4.

2.3 Training Details

We employ Faster R-CNN [7] with a Feature Pyramid Network (FPN) [5] andResNet-101 [4] backbone as an object detector and freeze it while training anHOI classifier. This detector was originally trained on the MS COCO dataset [6],which has the same 80 object categories as the HICO-Det dataset.The HOI clas-sifier consists of four streams similar to [1,2], extracting features from instance

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

ECCV

#3850ECCV

#3850


Table 1. The list of 54 anchors made out of all 117 actions from HICO-Det dataset.

flush tag stab wave brush with hunttoast pay move teach no interaction eat atmilk squeeze greet stop at stir installpoint sign paint shear release controlbreak light lose zip lift pack

cut with type on talk on set slide operatespin assemble pour lie on turn chaseherd flip buy hose kick rowpeel dry hop on direct adjust kiss

appearance, spatial location, and human pose. The pose and spatial streams arecomposed of two FC layers whereas the human and object appearance streamsconsist of one FC layer of size 512. The dimension of the two hidden layers inthe pose and spatial streams is 512. The outputs of the four streams are consoli-dated via average pooling and passed through an FC layer of size 512 to performHOI classification. We consider all detections for which the detection confidenceis greater than 0.01 and create human-object pairs for each image. The image-centric training strategy [7] is also applied. In other words, pair candidates fromone image make up each mini-batch. We adopt SGD and set the initial learn-ing rate to 1e-3, weight decay to 1e-4, and momentum to 0.9. For the ratio ofpositive to negative samples in training, while [3] uses 1:1000, we suspect 1:600is more reasonable because intuitively the ratio of the positive and the negativesample is likely to be 1:(Number of classes). We empirically found out 1:600 givesbetter performance than 1:1000. We train the framework for 100000 iterations.All experiments are conducted on a single Nvidia K40 GPU.

3 More Illustrations for Our Methods

3.1 Illustration for NES

Fig. 6 shows a step-by-step example of the anchor selection process among eightaction classes (from ‘A’ to ‘H’). The color of each node denotes whether or notits exclusiveness value ei is included in the exclusiveness pool E . The i-th nodeis blue if ei ∈ E , orange if selected as an anchor action, gray if ei 6∈ E ) The colorof the edge denotes whether or not the co-occurrence value cij is included in theco-occurrence pool C.

3.2 Design Choices for Hierarchical Architecture

Fig. 7 illustrates the design choices for our hierarchical architecture (ModifiedBaseline, MultiTask, TwoStream, and Hierarchical).

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

ECCV

#3850ECCV

#3850


4 Extras

4.1 List of Anchors

Table 1 shows the list of maximum number of anchor actions from the HICO-Detdataset in the order in which they are selected. For example, when setting thenumber of anchor actions to be 10, we take from the first action (‘flush’) to thetenth action (‘teach’) as the set of anchors, and the rest are associated to ‘other.’

4.2 Example of Co-occurrence Matrices for Other Objects

Fig. 8 shows more examples of the co-occurrence matrices Co for object o, con-structed from the HICO-Det dataset.

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

ECCV

#3850ECCV

#3850


STEP0. START (Results :)

H

B

G F

D

CA

E

STEP1. ‘A’ is the most exclusive. ‘A’ is the first anchor. (Results : [A])

H

B

G F

D

CA

E

STEP2. Select ‘C’ as the next anchor. (Results : [A] [C])

H

B

G F

D

CA

E

STEP3. ‘B’ can co-occurs with ‘C’. Remove ‘B’ from the candidate. (Results : [A] [C])

H

B

G F

D

CA

E

STEP4. Select ‘E’ as the next anchor and remove ‘D’. (Results : [A] [C] [E])

H

B

G F

D

CA

E

STEP5. Select ‘F’ and remove ‘G’. (Results : [A] [C] [E] [F])

H

B

G F

D

CA

E

STEP6. Select ‘H’ as the last anchor. (Results : [A] [C] [E] [F] [H])

H

B

G F

D

CA

E

STEP7. Assign regular actions to action groups. (Results : [A] [C:B] [E:D] [F:B,G] [H:D,G])

H

B

G F

D

CA

E

Fig. 6. Illustration of the proposed anchor action selection process.

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

ECCV

#3850ECCV

#3850


𝑓(∙)(512 → 𝑁, sigmoid)

Probability for All actions

Fused vector(512)

Action Score (𝑵)

⋮

(a) Modified Baseline

𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)

𝑓𝒢 (∙)(512 → 𝑁, sigmoid)

Probability for Anchor actions(as an auxiliary label)

Probability for All actions

( 𝒟 + 1)

⋮

Fused vector(512)

Action Score (𝑵)

⋮

(b) MultiTask


𝑓𝒢 (∙)(512 → (𝑁 − 𝒟 ), sigmoid)

Probability for Anchor actions

Probability for Regular actions

( 𝒟 + 1) ( 𝒟 )

(𝑁 − 𝒟 )

⋮

⋮

Fused vector(512)

Action Score (𝑵)

⋮

(c) TwoStream


𝑓𝒢1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

𝑓𝒢2(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

𝑓𝒢 𝒟 +1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

⋮

Probability for Anchor actions

Probability for Regular actions

( 𝒟 + 1)

( 𝑁 − 𝒟 × 𝒟 + 1 )

( 𝒟 )

(𝑁 − 𝒟 )

MatMul

⋮

⋮⋮

Fused vector(512)

Action Score (𝑵)

⋮

⋮⋮

⋮

(d) Hierarchical

Fig. 7. Design choices for the action prediction module of our network. (a) ModifiedBaseline architecture without leveraging the prior of anchor actions. (b) MultiTaskwhich utilizes the anchor actions as in a multi-task learning manner. (c) TwoStreamwhich separately predicts the anchor and the regular actions but without using thehierarchical modeling between anchor and regular actions. (d) The proposed hierarchi-cal target (Hierarchical). Anchor probability is directly used as a final score, and weexploit multiple conditional sub-networks to further compute the probabilities for theregular actions.

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

ECCV

#3850ECCV

#3850


boat

boar

ddr

ive exit

inspe

ctjum

plau

nch

no in

tera

ction

repa

irrid

ero

wsa

ilsit

on

stand

on tie

wash

boarddrive

exitinspect

jumplaunch

no interactionrepair

riderowsail

sit onstand on

tiewash

0

0.2

0.4

0.6

0.8

1bus

boar

ddir

ect

drive ex

itins

pect

load

no in

tera

ction ride

sit o

nwa

shwa

ve

board

direct

drive

exit

inspect

load

no interaction

ride

sit on

wash

wave0

0.2

0.4

0.6

0.8

1car

boar

ddir

ect

drive

hose

inspe

ctjum

ploa

d

no in

tera

ction

park ride

wash

board

direct

drive

hose

inspect

jump

load

no interaction

park

ride

wash0

0.2

0.4

0.6

0.8

1cow

feed

herd

hold

hug

kiss

lasso milk

no in

tera

ction pe

trid

ewa

lk

feed

herd

hold

hug

kiss

lasso

milk

no interaction

pet

ride

walk0

0.2

0.4

0.6

0.8

1

dog

carry

chas

edr

yfe

edgr

oom

hold

hosehug

inspe

ctkis

s

no in

tera

ction pe

tru

nsc

ratch

strad

dletrain

walk

wash

carrychase

dryfeed

groomholdhosehug

inspectkiss

no interactionpetrun

scratchstraddle

trainwalk

wash 0

0.2

0.4

0.6

0.8

1elephant

feed hold

hop

onho

se hug

kiss

no in

tera

ction pe

trid

ewa

lkwa

shwa

tch

feed

hold

hop on

hose

hug

kiss

no interaction

pet

ride

walk

wash

watch0

0.2

0.4

0.6

0.8

1horse

feed

groo

mho

ldho

p onhug

jump

kiss

load

no in

tera

ction pe

tra

ce ride

run

strad

dletrain

walk

wash

feedgroom

holdhop on

hugjumpkissload

no interactionpet

raceriderun

straddletrainwalk

wash0

0.2

0.4

0.6

0.8

1motorcycle

hold

hop

onins

pect

jump

no in

tera

ction

park

push

race ride

sit o

nstr

addle turn

walk

wash

holdhop oninspect

jumpno interaction

parkpushraceride

sit onstraddle

turnwalk

wash0

0.2

0.4

0.6

0.8

1

pizza

buy

carry

cook cut

eat

hold

mak

e

no in

tera

ction

pick u

psli

desm

ell

buy

carry

cook

cut

eat

hold

make

no interaction

pick up

slide

smell0

0.2

0.4

0.6

0.8

1 sheep

carry feed

herd

hold

hug

kiss

no in

tera

ction pe

trid

esh

ear

walk

wash

carry

feed

herd

hold

hug

kiss

no interaction

pet

ride

shear

walk

wash0

0.2

0.4

0.6

0.8

1skis

adjus

tca

rry hold

inspe

ctjum

p

no in

tera

ction

pick u

pre

pair

ride

stand

on

wear

adjust

carry

hold

inspect

jump

no interaction

pick up

repair

ride

stand on

wear0

0.2

0.4

0.6

0.8

1sports ball

block

carry

catch

dribb

le hitho

ldins

pect

kick

no in

tera

ction

pick u

pse

rve

sign

spin

thro

w

blockcarrycatch

dribblehit

holdinspect

kickno interaction

pick upserve

signspin

throw0

0.2

0.4

0.6

0.8

1

Fig. 8. Example of co-occurrence matrices Co for each object o constructed from HICO-Det dataset.

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

ECCV

#3850ECCV

#3850


References

1. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-objectinteractions. In: IEEE Winter Conference on Applications of Computer Vision(WACV) (2018)

2. Gao, C., Zou, Y., Huang, J.B.: ican: Instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (BMVC) (2018)

3. Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection:Factorization, layout encodings, and training techniques. In: IEEE InternationalConference on Computer Vision (ICCV) (2019)

4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

5. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (2017)

6. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conferenceon Computer Vision (ECCV) (2014)

7. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Advances in Neural Information Pro-cessing Systems (NIPS) (2015)

—Supplementary Material—Detecting Human-Object Interactions with Action Co-occurrence Priors