ECCV |Supplementary Material| Detecting Human-Object … · 2020. 8. 5. · 1.4 Analysis of False...

11
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 —Supplementary Material— Detecting Human-Object Interactions with Action Co-occurrence Priors Anonymous ECCV submission Paper ID 3850 The contents of this supplementary material include additional experimental results, implementation details of our model, more qualitative results and anal- ysis which have not been shown in the main paper due to the space limitation. 1 Additional Experimental Results In this section, we provide additional qualitative results.We also analyze the effect of using different numbers of anchor actions introduced in Section 3.2 of the main paper, performance across different classes, and analyze false predictions in our model’s predictions. 1.1 Qualitative Results Fig. 1 shows examples of HOI detection results that our model predicts correctly with high probability. We show each image with the predicted HOI class followed by the probability computed by our model. 1.2 Number of Anchor Actions In addition, we investigate the effect of using different numbers of anchor actions |D| in Fig. 2. We measure the relative performance improvement from the +Hi- erarchical model to the Modified Baseline model by changing the number of anchor actions |D| at intervals of five. In principle, the more anchor actions we use, the better performance that can be attained. On one hand, the selected anchor actions can be more distinguish- able from one another with more anchor action categories and training samples. On the other hand, the remaining regular actions can also benefit from stronger co-occurrence priors. However, more anchor action will also result in more sub- networks to optimize, and this will cause over-fitting to a certain extent. Through observations, we found that an increase in the number of parameters of the HOI detector often causes a severe performance decrease. Thus, there is a trade-off between a large and small number of anchors, which requires us to empirically select the best anchor number. As depicted in Fig. 2, the hierarchical architec- ture shows the best overall mAP score (Full) with 15 anchors and the best mAP score on rare classes with 10 anchors. We finally use an experimentally overall best-performing choice of 15 anchor actions (maximum anchor action number is 54).

Transcript of ECCV |Supplementary Material| Detecting Human-Object … · 2020. 8. 5. · 1.4 Analysis of False...

  • 000

    001

    002

    003

    004

    005

    006

    007

    008

    009

    010

    011

    012

    013

    014

    015

    016

    017

    018

    019

    020

    021

    022

    023

    024

    025

    026

    027

    028

    029

    030

    031

    032

    033

    034

    035

    036

    037

    038

    039

    040

    041

    042

    043

    044

    000

    001

    002

    003

    004

    005

    006

    007

    008

    009

    010

    011

    012

    013

    014

    015

    016

    017

    018

    019

    020

    021

    022

    023

    024

    025

    026

    027

    028

    029

    030

    031

    032

    033

    034

    035

    036

    037

    038

    039

    040

    041

    042

    043

    044

    ECCV

    #3850ECCV

    #3850

    —Supplementary Material—Detecting Human-Object Interactions with

    Action Co-occurrence Priors

    Anonymous ECCV submission

    Paper ID 3850

    The contents of this supplementary material include additional experimentalresults, implementation details of our model, more qualitative results and anal-ysis which have not been shown in the main paper due to the space limitation.

    1 Additional Experimental Results

    In this section, we provide additional qualitative results.We also analyze theeffect of using different numbers of anchor actions introduced in Section 3.2 of themain paper, performance across different classes, and analyze false predictionsin our model’s predictions.

    1.1 Qualitative Results

    Fig. 1 shows examples of HOI detection results that our model predicts correctlywith high probability. We show each image with the predicted HOI class followedby the probability computed by our model.

    1.2 Number of Anchor Actions

    In addition, we investigate the effect of using different numbers of anchor actions|D| in Fig. 2. We measure the relative performance improvement from the +Hi-erarchical model to the Modified Baseline model by changing the numberof anchor actions |D| at intervals of five.

    In principle, the more anchor actions we use, the better performance that canbe attained. On one hand, the selected anchor actions can be more distinguish-able from one another with more anchor action categories and training samples.On the other hand, the remaining regular actions can also benefit from strongerco-occurrence priors. However, more anchor action will also result in more sub-networks to optimize, and this will cause over-fitting to a certain extent. Throughobservations, we found that an increase in the number of parameters of the HOIdetector often causes a severe performance decrease. Thus, there is a trade-offbetween a large and small number of anchors, which requires us to empiricallyselect the best anchor number. As depicted in Fig. 2, the hierarchical architec-ture shows the best overall mAP score (Full) with 15 anchors and the best mAPscore on rare classes with 10 anchors. We finally use an experimentally overallbest-performing choice of 15 anchor actions (maximum anchor action number is54).

  • 045

    046

    047

    048

    049

    050

    051

    052

    053

    054

    055

    056

    057

    058

    059

    060

    061

    062

    063

    064

    065

    066

    067

    068

    069

    070

    071

    072

    073

    074

    075

    076

    077

    078

    079

    080

    081

    082

    083

    084

    085

    086

    087

    088

    089

    045

    046

    047

    048

    049

    050

    051

    052

    053

    054

    055

    056

    057

    058

    059

    060

    061

    062

    063

    064

    065

    066

    067

    068

    069

    070

    071

    072

    073

    074

    075

    076

    077

    078

    079

    080

    081

    082

    083

    084

    085

    086

    087

    088

    089

    ECCV

    #3850ECCV

    #3850

    2 ECCV-20 submission ID 3850

    sit_on-couch (80.48)sit_at-dining_table (76.02)fly-airplane (61.35)

    straddle-bicycle (61.44)ride-boat (74.93)

    hold-bird (96.17) carry-bottle (64.25)

    drive-bus (67.74) lie_on-couch (74.00)

    wield-baseball_bat (56.88) read-book (84.80) lift-fork (61.91) catch-frisbee (56.60)

    operate-hair_drier (52.935) type_on-keyboard (67.2) pull-kite (73.59) cut_with-knife (65.17) control-mouse (93.4)

    Fig. 1. Examples of bounding boxes and HOI detection scores from our model. Bound-ing boxes for humans are colored red, and bounding boxes for objects are coloreds blue.Each image is displayed with the predicted action+object class followed by the prob-ability computed by our model.

    1.3 Performance across Different Classes

    Fig. 3 shows the improvements in mAP scores across different object and actionclasses. We compare our model with the baseline model.In order to visualize onwhich classes our method improves performance, the figure is organized in theorder of the improvements of the scores of each class. It is found that our methodimproves mAP scores in 64 out of the 80 object classes (80%) and 90 out of the117 actions (77%).

    1.4 Analysis of False Predictions

    Fig. 4 shows examples of false predictions by our model. We found three commonreasons that the prediction is evaluated as false. First is the wrong prediction ofobject label from the object detector (e.g.,‘couch’ as ‘chair’ or ‘sheep’ as ‘cow’)which is shown in the left column of Fig. 4. Second, the prediction is correct butthe ground truth label for the prediction in a test image is missing (e.g.,‘ride-bus’ or ‘hold-horse’) which is shown in the middle column of Fig. 4. Last, aHOI detector could have been predicted correctly if there were a sophisticatedway to take context (the third object or the background) into account (e.g.,the

  • 090

    091

    092

    093

    094

    095

    096

    097

    098

    099

    100

    101

    102

    103

    104

    105

    106

    107

    108

    109

    110

    111

    112

    113

    114

    115

    116

    117

    118

    119

    120

    121

    122

    123

    124

    125

    126

    127

    128

    129

    130

    131

    132

    133

    134

    090

    091

    092

    093

    094

    095

    096

    097

    098

    099

    100

    101

    102

    103

    104

    105

    106

    107

    108

    109

    110

    111

    112

    113

    114

    115

    116

    117

    118

    119

    120

    121

    122

    123

    124

    125

    126

    127

    128

    129

    130

    131

    132

    133

    134

    ECCV

    #3850ECCV

    #3850

    ECCV-20 submission ID 3850 3

    0 5 10 15 20 25 30 35 40 45 50 55

    Number of Anchors

    -2

    0

    2

    4

    6

    8

    10

    12

    14

    Rel

    ativ

    e m

    AP

    Impr

    ovem

    ent (

    %)

    FullRareNon-rare

    Fig. 2. Performance of the hierarchical architecture with different number of anchorsat intervals of five. The model with 15 and 10 anchors show the best performanceoverall and on rare classes, respectively.

    background for predicting ‘repair-bicycle’ or an object that person carries forpredicting ‘load-bus’) which is shown in the right column of Fig. 4. The last issuecan be solved by devising a better network architecture for effectively encodingcontext, which is a direction orthogonal to our work. All of these issues (theerrors in the object detector, missing labels in datasets, and encoding context)are fundamental issues in HOI detection, which can be interesting topics forfuture work.

    2 Implementation Details

    In this section, we provide more implementation details of our method which arenot discussed in the main paper due to the space limitation.

    2.1 Knowledge Distillation in Object-agnostic Level

    In Knowledge Distillation via ACP Projection (Section 3.4 in the main paper),the action co-occurrence matrices Co and C

    o used for ACP Projection are object-aware (described in Eq. (17) and Eq. (18) in the main paper). We also use theobject-agnostic action co-occurrence matrix C and C

    ′.

    We thus can generate two more object-aware teacher labels when constructingteacher objectives:

    Ŷproj = joint(Ĥ, Ô, project(Â, C,C′)) ∈ RM , (1)

    Y gtproj = joint(Hgt, Ogt, project(Agt, C, C

    ′)) ∈ RM (2)

    Finally, the total loss function we used to train the ACP model is expressed as:

    Ltot =λ1L(Ŷ , Y gt) + λ2L(Ŷ , ŶprojO) + λ3L(Ŷ , Y gtprojO)

    + λ4L(Ŷ , Ŷproj) + λ5L(Ŷ , Y gtproj),(3)

  • 135

    136

    137

    138

    139

    140

    141

    142

    143

    144

    145

    146

    147

    148

    149

    150

    151

    152

    153

    154

    155

    156

    157

    158

    159

    160

    161

    162

    163

    164

    165

    166

    167

    168

    169

    170

    171

    172

    173

    174

    175

    176

    177

    178

    179

    135

    136

    137

    138

    139

    140

    141

    142

    143

    144

    145

    146

    147

    148

    149

    150

    151

    152

    153

    154

    155

    156

    157

    158

    159

    160

    161

    162

    163

    164

    165

    166

    167

    168

    169

    170

    171

    172

    173

    174

    175

    176

    177

    178

    179

    ECCV

    #3850ECCV

    #3850

    4 ECCV-20 submission ID 3850

    giraf

    fe

    potte

    d pla

    nt

    zebr

    a tvbe

    dbe

    arsin

    k

    sand

    wich

    toas

    ter

    couc

    h

    mou

    se

    keyb

    oard

    knife

    cake

    bowl

    wine

    glas

    s

    carro

    t

    oran

    gech

    air

    refri

    gera

    torpiz

    zaclo

    ck

    hot d

    og fork

    skat

    eboa

    rd

    pers

    on

    traffic

    light do

    g

    lapto

    p

    bicyc

    le

    base

    ball g

    love

    stop

    signdo

    nut

    shee

    pbo

    ttle

    tenn

    is ra

    cket

    rem

    ote

    truck

    dining

    table

    sciss

    ors

    tedd

    y bea

    r

    hand

    bag

    boat tie sk

    isbir

    dcu

    p

    spoo

    n

    mot

    orcy

    cleov

    en

    suitc

    aseho

    rse

    frisb

    ee car

    spor

    ts ba

    ll

    benc

    hbu

    s

    bana

    na

    snow

    boar

    dtra

    in

    base

    ball b

    at

    surfb

    oard

    broc

    coli

    fire

    hydr

    ant

    airpla

    ne

    micr

    owav

    ekit

    eap

    ple

    eleph

    ant

    umbr

    ella

    toile

    tco

    wbo

    okva

    se

    toot

    hbru

    sh

    back

    pack

    cell p

    hone ca

    t

    park

    ing m

    eter

    hair

    drier

    Object Class

    -20

    -10

    0

    10

    20A

    P Im

    prov

    emen

    t

    light

    brea

    k

    sque

    eze zip

    cont

    roltu

    rn lickblo

    ckfe

    ed cut

    stop

    atco

    ok

    scra

    tchlie

    onto

    astpe

    t

    boar

    dsli

    depe

    el

    shea

    rra

    cesti

    ck hit

    oper

    atedr

    ag stirwa

    vesign fill

    pick u

    phe

    rd tiesta

    bea

    t atbu

    y

    cut w

    ithloseex

    itgr

    eetloa

    d

    no in

    tera

    ctionte

    ach sip

    serv

    e

    adjus

    t

    relea

    sedir

    ectpa

    rk

    strad

    dlekicksa

    ilkis

    s

    watchca

    rrytrainpo

    inthu

    nt

    type

    on setpu

    ll

    hop

    onpa

    ckm

    ovewa

    lkjum

    pgr

    indre

    pairpo

    urm

    ilk eat

    stand

    und

    ersit

    onop

    en

    inspe

    ct

    mak

    ecle

    anspinins

    tall

    stand

    onru

    n

    brus

    h wi

    thwa

    sh liftrowho

    ld

    launc

    hsw

    ingridepu

    shth

    rowch

    eckho

    sehugwe

    arsit

    at

    dribb

    lere

    ad flipca

    tch

    drink

    withpic

    k

    groo

    m dry

    asse

    mblesm

    ellblo

    wwi

    elddr

    ive

    text

    on tag fly

    talk

    onlas

    so

    chas

    epa

    yflu

    shpa

    int

    Action Class

    -20

    -10

    0

    10

    20

    30

    40

    AP

    Impr

    ovem

    ent

    Fig. 3. The AP score improvements across different object and action classes sorted bythe amount of improvement compared to the baseline model [3]. Our method improvesmAP scores in 64 out of the 80 object classes (80%) and 90 out of the 117 actions(77%).

    lie_on-chair (58.87) ride-bus (87.03) hop_on-bicycle (1.48)

    feed-cow (13.85) hold-horse (43.39) inspect-bus (7.36)

    Fig. 4. Examples of the false predictions from our model. The false cases include wrongprediction of object label from the object detector (left column), correct prediction butmissing ground truth labels (middle column), and predictions that could have beencorrect if the context were taken into account (right column).

    by introducing two more loss balancing weights λ4, λ5.

    2.2 Details of Multiple Information and Fusion Networks

    In Section 3.3 of the main paper, we introduced the baseline network “no-frills” [3]. We now give detailed definitions of the multiple information X andthe fusion networks F (·) used in this baseline and our modified baseline.

    The CNN features used for HOI classification are denoted as x̂ = {x̂h, x̂o},where x̂h and x̂o are for human and object appearance, respectively. These CNNfeatures are directly extracted from the final fully connected (FC) layer of thedetector. The multiple information consists of human appearance x̂h, objectappearance x̂o, as well as human pose (k̂), object category (ô) and bounding

    boxes (b̂ including both object and human bounding boxes). All of these input

    cues X = {x̂, k̂, ô, b̂} are fed to our target model. There are four separate networkstreams: human appearance (fh(·)), object appearance (fo(·)), bounding box

  • 180

    181

    182

    183

    184

    185

    186

    187

    188

    189

    190

    191

    192

    193

    194

    195

    196

    197

    198

    199

    200

    201

    202

    203

    204

    205

    206

    207

    208

    209

    210

    211

    212

    213

    214

    215

    216

    217

    218

    219

    220

    221

    222

    223

    224

    180

    181

    182

    183

    184

    185

    186

    187

    188

    189

    190

    191

    192

    193

    194

    195

    196

    197

    198

    199

    200

    201

    202

    203

    204

    205

    206

    207

    208

    209

    210

    211

    212

    213

    214

    215

    216

    217

    218

    219

    220

    221

    222

    223

    224

    ECCV

    #3850ECCV

    #3850

    ECCV-20 submission ID 3850 5

    Action Prediction Module

    (512 → 117)

    Fig2. Overall Architecture

    HumanAppearance (𝑓ℎ)

    (1024 → 512)

    ObjectAppearance (𝑓𝑜)

    (1024 → 512)

    Pose and Label (𝑓𝑘 )(588 → 512)

    Box and Label (𝑓𝑏)(342 → 512)

    ActionScore

    Human Appearance ( ො𝑥ℎ)

    Object Appearance (ො𝑥𝑜)

    Human Pose (𝑘)

    Object Category (ො𝑜)

    Box Configuration (𝑏)

    Average

    Fig. 5. Overall illustration of the basic components of our network architecture (Moresymbols added). The design choices for the hierarchical action prediction module aredetailed in Fig. 7.

    and object category (fb(·)), and human pose and object category (fk(·)). Eachindividual type of information is first fed through a separate network of two FClayers to generate a fixed dimension (number of actions N) feature. Then, allthe features are added together and sent through a sigmoid activation to get theprobability prediction for the action a:

    Â(a) = p(a|X) = sigmoid(F (X)). (4)

    The multi-information fusion procedure F (X) is expressed as

    F (X) = fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô), (5)

    where the operation || denotes concatenation.However, instead of directly adding up the multiple information, we average

    them and forward this through another action prediction module fsub of a few FClayers to obtain the final action probability prediction as illustrated in Fig. 5.For a naive approach (denoted as the Modified Baseline), we simply use asub-network fsub of one FC layer as the action prediction module. Then Eq. (4)and Eq. (5) are modified to

    Â = p(a|X) = sigmoid(fsub(F (X))), (6)

    F (X) =fh(x̂h) + fo(x̂o) + fk(k̂||ô) + fb(b̂||ô)

    nstream. (7)

    In this case, nstream = 4.

    2.3 Training Details

    We employ Faster R-CNN [7] with a Feature Pyramid Network (FPN) [5] andResNet-101 [4] backbone as an object detector and freeze it while training anHOI classifier. This detector was originally trained on the MS COCO dataset [6],which has the same 80 object categories as the HICO-Det dataset.The HOI clas-sifier consists of four streams similar to [1,2], extracting features from instance

  • 225

    226

    227

    228

    229

    230

    231

    232

    233

    234

    235

    236

    237

    238

    239

    240

    241

    242

    243

    244

    245

    246

    247

    248

    249

    250

    251

    252

    253

    254

    255

    256

    257

    258

    259

    260

    261

    262

    263

    264

    265

    266

    267

    268

    269

    225

    226

    227

    228

    229

    230

    231

    232

    233

    234

    235

    236

    237

    238

    239

    240

    241

    242

    243

    244

    245

    246

    247

    248

    249

    250

    251

    252

    253

    254

    255

    256

    257

    258

    259

    260

    261

    262

    263

    264

    265

    266

    267

    268

    269

    ECCV

    #3850ECCV

    #3850

    6 ECCV-20 submission ID 3850

    Table 1. The list of 54 anchors made out of all 117 actions from HICO-Det dataset.

    flush tag stab wave brush with hunttoast pay move teach no interaction eat atmilk squeeze greet stop at stir installpoint sign paint shear release controlbreak light lose zip lift pack

    cut with type on talk on set slide operatespin assemble pour lie on turn chaseherd flip buy hose kick rowpeel dry hop on direct adjust kiss

    appearance, spatial location, and human pose. The pose and spatial streams arecomposed of two FC layers whereas the human and object appearance streamsconsist of one FC layer of size 512. The dimension of the two hidden layers inthe pose and spatial streams is 512. The outputs of the four streams are consoli-dated via average pooling and passed through an FC layer of size 512 to performHOI classification. We consider all detections for which the detection confidenceis greater than 0.01 and create human-object pairs for each image. The image-centric training strategy [7] is also applied. In other words, pair candidates fromone image make up each mini-batch. We adopt SGD and set the initial learn-ing rate to 1e-3, weight decay to 1e-4, and momentum to 0.9. For the ratio ofpositive to negative samples in training, while [3] uses 1:1000, we suspect 1:600is more reasonable because intuitively the ratio of the positive and the negativesample is likely to be 1:(Number of classes). We empirically found out 1:600 givesbetter performance than 1:1000. We train the framework for 100000 iterations.All experiments are conducted on a single Nvidia K40 GPU.

    3 More Illustrations for Our Methods

    3.1 Illustration for NES

    Fig. 6 shows a step-by-step example of the anchor selection process among eightaction classes (from ‘A’ to ‘H’). The color of each node denotes whether or notits exclusiveness value ei is included in the exclusiveness pool E . The i-th nodeis blue if ei ∈ E , orange if selected as an anchor action, gray if ei 6∈ E ) The colorof the edge denotes whether or not the co-occurrence value cij is included in theco-occurrence pool C.

    3.2 Design Choices for Hierarchical Architecture

    Fig. 7 illustrates the design choices for our hierarchical architecture (ModifiedBaseline, MultiTask, TwoStream, and Hierarchical).

  • 270

    271

    272

    273

    274

    275

    276

    277

    278

    279

    280

    281

    282

    283

    284

    285

    286

    287

    288

    289

    290

    291

    292

    293

    294

    295

    296

    297

    298

    299

    300

    301

    302

    303

    304

    305

    306

    307

    308

    309

    310

    311

    312

    313

    314

    270

    271

    272

    273

    274

    275

    276

    277

    278

    279

    280

    281

    282

    283

    284

    285

    286

    287

    288

    289

    290

    291

    292

    293

    294

    295

    296

    297

    298

    299

    300

    301

    302

    303

    304

    305

    306

    307

    308

    309

    310

    311

    312

    313

    314

    ECCV

    #3850ECCV

    #3850

    ECCV-20 submission ID 3850 7

    4 Extras

    4.1 List of Anchors

    Table 1 shows the list of maximum number of anchor actions from the HICO-Detdataset in the order in which they are selected. For example, when setting thenumber of anchor actions to be 10, we take from the first action (‘flush’) to thetenth action (‘teach’) as the set of anchors, and the rest are associated to ‘other.’

    4.2 Example of Co-occurrence Matrices for Other Objects

    Fig. 8 shows more examples of the co-occurrence matrices Co for object o, con-structed from the HICO-Det dataset.

  • 315

    316

    317

    318

    319

    320

    321

    322

    323

    324

    325

    326

    327

    328

    329

    330

    331

    332

    333

    334

    335

    336

    337

    338

    339

    340

    341

    342

    343

    344

    345

    346

    347

    348

    349

    350

    351

    352

    353

    354

    355

    356

    357

    358

    359

    315

    316

    317

    318

    319

    320

    321

    322

    323

    324

    325

    326

    327

    328

    329

    330

    331

    332

    333

    334

    335

    336

    337

    338

    339

    340

    341

    342

    343

    344

    345

    346

    347

    348

    349

    350

    351

    352

    353

    354

    355

    356

    357

    358

    359

    ECCV

    #3850ECCV

    #3850

    8 ECCV-20 submission ID 3850

    STEP0. START (Results :)

    H

    B

    G F

    D

    CA

    E

    STEP1. ‘A’ is the most exclusive. ‘A’ is the first anchor. (Results : [A])

    H

    B

    G F

    D

    CA

    E

    STEP2. Select ‘C’ as the next anchor. (Results : [A] [C])

    H

    B

    G F

    D

    CA

    E

    STEP3. ‘B’ can co-occurs with ‘C’. Remove ‘B’ from the candidate. (Results : [A] [C])

    H

    B

    G F

    D

    CA

    E

    STEP4. Select ‘E’ as the next anchor and remove ‘D’. (Results : [A] [C] [E])

    H

    B

    G F

    D

    CA

    E

    STEP5. Select ‘F’ and remove ‘G’. (Results : [A] [C] [E] [F])

    H

    B

    G F

    D

    CA

    E

    STEP6. Select ‘H’ as the last anchor. (Results : [A] [C] [E] [F] [H])

    H

    B

    G F

    D

    CA

    E

    STEP7. Assign regular actions to action groups. (Results : [A] [C:B] [E:D] [F:B,G] [H:D,G])

    H

    B

    G F

    D

    CA

    E

    Fig. 6. Illustration of the proposed anchor action selection process.

  • 360

    361

    362

    363

    364

    365

    366

    367

    368

    369

    370

    371

    372

    373

    374

    375

    376

    377

    378

    379

    380

    381

    382

    383

    384

    385

    386

    387

    388

    389

    390

    391

    392

    393

    394

    395

    396

    397

    398

    399

    400

    401

    402

    403

    404

    360

    361

    362

    363

    364

    365

    366

    367

    368

    369

    370

    371

    372

    373

    374

    375

    376

    377

    378

    379

    380

    381

    382

    383

    384

    385

    386

    387

    388

    389

    390

    391

    392

    393

    394

    395

    396

    397

    398

    399

    400

    401

    402

    403

    404

    ECCV

    #3850ECCV

    #3850

    ECCV-20 submission ID 3850 9

    𝑓(∙)(512 → 𝑁, sigmoid)

    Probability for All actions

    Fused vector(512)

    Action Score (𝑵)

    (a) Modified Baseline

    𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)

    𝑓𝒢 (∙)(512 → 𝑁, sigmoid)

    Probability for Anchor actions(as an auxiliary label)

    Probability for All actions

    ( 𝒟 + 1)

    Fused vector(512)

    Action Score (𝑵)

    (b) MultiTask

    𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)

    𝑓𝒢 (∙)(512 → (𝑁 − 𝒟 ), sigmoid)

    Probability for Anchor actions

    Probability for Regular actions

    ( 𝒟 + 1) ( 𝒟 )

    (𝑁 − 𝒟 )

    Fused vector(512)

    Action Score (𝑵)

    (c) TwoStream

    𝑓𝑎𝑛𝑐ℎ𝑜𝑟(∙)(512 → ( 𝒟 + 1) , softmax)

    𝑓𝒢1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

    𝑓𝒢2(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

    𝑓𝒢 𝒟 +1(∙)(512 → (𝑁 − 𝒟 ), sigmoid)

    Probability for Anchor actions

    Probability for Regular actions

    ( 𝒟 + 1)

    ( 𝑁 − 𝒟 × 𝒟 + 1 )

    ( 𝒟 )

    (𝑁 − 𝒟 )

    MatMul

    ⋮⋮

    Fused vector(512)

    Action Score (𝑵)

    ⋮⋮

    (d) Hierarchical

    Fig. 7. Design choices for the action prediction module of our network. (a) ModifiedBaseline architecture without leveraging the prior of anchor actions. (b) MultiTaskwhich utilizes the anchor actions as in a multi-task learning manner. (c) TwoStreamwhich separately predicts the anchor and the regular actions but without using thehierarchical modeling between anchor and regular actions. (d) The proposed hierarchi-cal target (Hierarchical). Anchor probability is directly used as a final score, and weexploit multiple conditional sub-networks to further compute the probabilities for theregular actions.

  • 405

    406

    407

    408

    409

    410

    411

    412

    413

    414

    415

    416

    417

    418

    419

    420

    421

    422

    423

    424

    425

    426

    427

    428

    429

    430

    431

    432

    433

    434

    435

    436

    437

    438

    439

    440

    441

    442

    443

    444

    445

    446

    447

    448

    449

    405

    406

    407

    408

    409

    410

    411

    412

    413

    414

    415

    416

    417

    418

    419

    420

    421

    422

    423

    424

    425

    426

    427

    428

    429

    430

    431

    432

    433

    434

    435

    436

    437

    438

    439

    440

    441

    442

    443

    444

    445

    446

    447

    448

    449

    ECCV

    #3850ECCV

    #3850

    10 ECCV-20 submission ID 3850

    boat

    boar

    ddr

    ive exit

    inspe

    ctjum

    plau

    nch

    no in

    tera

    ction

    repa

    irrid

    ero

    wsa

    ilsit

    on

    stand

    on tie

    wash

    boarddrive

    exitinspect

    jumplaunch

    no interactionrepair

    riderowsail

    sit onstand on

    tiewash

    0

    0.2

    0.4

    0.6

    0.8

    1bus

    boar

    ddir

    ect

    drive ex

    itins

    pect

    load

    no in

    tera

    ction ride

    sit o

    nwa

    shwa

    ve

    board

    direct

    drive

    exit

    inspect

    load

    no interaction

    ride

    sit on

    wash

    wave0

    0.2

    0.4

    0.6

    0.8

    1car

    boar

    ddir

    ect

    drive

    hose

    inspe

    ctjum

    ploa

    d

    no in

    tera

    ction

    park ride

    wash

    board

    direct

    drive

    hose

    inspect

    jump

    load

    no interaction

    park

    ride

    wash0

    0.2

    0.4

    0.6

    0.8

    1cow

    feed

    herd

    hold

    hug

    kiss

    lasso milk

    no in

    tera

    ction pe

    trid

    ewa

    lk

    feed

    herd

    hold

    hug

    kiss

    lasso

    milk

    no interaction

    pet

    ride

    walk0

    0.2

    0.4

    0.6

    0.8

    1

    dog

    carry

    chas

    edr

    yfe

    edgr

    oom

    hold

    hosehug

    inspe

    ctkis

    s

    no in

    tera

    ction pe

    tru

    nsc

    ratch

    strad

    dletrain

    walk

    wash

    carrychase

    dryfeed

    groomholdhosehug

    inspectkiss

    no interactionpetrun

    scratchstraddle

    trainwalk

    wash 0

    0.2

    0.4

    0.6

    0.8

    1elephant

    feed hold

    hop

    onho

    se hug

    kiss

    no in

    tera

    ction pe

    trid

    ewa

    lkwa

    shwa

    tch

    feed

    hold

    hop on

    hose

    hug

    kiss

    no interaction

    pet

    ride

    walk

    wash

    watch0

    0.2

    0.4

    0.6

    0.8

    1horse

    feed

    groo

    mho

    ldho

    p onhug

    jump

    kiss

    load

    no in

    tera

    ction pe

    tra

    ce ride

    run

    strad

    dletrain

    walk

    wash

    feedgroom

    holdhop on

    hugjumpkissload

    no interactionpet

    raceriderun

    straddletrainwalk

    wash0

    0.2

    0.4

    0.6

    0.8

    1motorcycle

    hold

    hop

    onins

    pect

    jump

    no in

    tera

    ction

    park

    push

    race ride

    sit o

    nstr

    addle turn

    walk

    wash

    holdhop oninspect

    jumpno interaction

    parkpushraceride

    sit onstraddle

    turnwalk

    wash0

    0.2

    0.4

    0.6

    0.8

    1

    pizza

    buy

    carry

    cook cut

    eat

    hold

    mak

    e

    no in

    tera

    ction

    pick u

    psli

    desm

    ell

    buy

    carry

    cook

    cut

    eat

    hold

    make

    no interaction

    pick up

    slide

    smell0

    0.2

    0.4

    0.6

    0.8

    1 sheep

    carry feed

    herd

    hold

    hug

    kiss

    no in

    tera

    ction pe

    trid

    esh

    ear

    walk

    wash

    carry

    feed

    herd

    hold

    hug

    kiss

    no interaction

    pet

    ride

    shear

    walk

    wash0

    0.2

    0.4

    0.6

    0.8

    1skis

    adjus

    tca

    rry hold

    inspe

    ctjum

    p

    no in

    tera

    ction

    pick u

    pre

    pair

    ride

    stand

    on

    wear

    adjust

    carry

    hold

    inspect

    jump

    no interaction

    pick up

    repair

    ride

    stand on

    wear0

    0.2

    0.4

    0.6

    0.8

    1sports ball

    block

    carry

    catch

    dribb

    le hitho

    ldins

    pect

    kick

    no in

    tera

    ction

    pick u

    pse

    rve

    sign

    spin

    thro

    w

    blockcarrycatch

    dribblehit

    holdinspect

    kickno interaction

    pick upserve

    signspin

    throw0

    0.2

    0.4

    0.6

    0.8

    1

    Fig. 8. Example of co-occurrence matrices Co for each object o constructed from HICO-Det dataset.

  • 450

    451

    452

    453

    454

    455

    456

    457

    458

    459

    460

    461

    462

    463

    464

    465

    466

    467

    468

    469

    470

    471

    472

    473

    474

    475

    476

    477

    478

    479

    480

    481

    482

    483

    484

    485

    486

    487

    488

    489

    490

    491

    492

    493

    494

    450

    451

    452

    453

    454

    455

    456

    457

    458

    459

    460

    461

    462

    463

    464

    465

    466

    467

    468

    469

    470

    471

    472

    473

    474

    475

    476

    477

    478

    479

    480

    481

    482

    483

    484

    485

    486

    487

    488

    489

    490

    491

    492

    493

    494

    ECCV

    #3850ECCV

    #3850

    ECCV-20 submission ID 3850 11

    References

    1. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-objectinteractions. In: IEEE Winter Conference on Applications of Computer Vision(WACV) (2018)

    2. Gao, C., Zou, Y., Huang, J.B.: ican: Instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (BMVC) (2018)

    3. Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection:Factorization, layout encodings, and training techniques. In: IEEE InternationalConference on Computer Vision (ICCV) (2019)

    4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    5. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (2017)

    6. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conferenceon Computer Vision (ECCV) (2014)

    7. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Advances in Neural Information Pro-cessing Systems (NIPS) (2015)

    —Supplementary Material—Detecting Human-Object Interactions with Action Co-occurrence Priors