Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial...

12
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2012–2023, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2012 Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1 Jinsoo Choi 1 Tae-Hyun Oh 2 In So Kweon 1 1 KAIST, South Korea. 2 MIT CSAIL, Cambridge, MA. 1 {djnjusa,jinsc37,iskweon77}@kaist.ac.kr 2 [email protected] Abstract Constructing an organized dataset comprised of a large number of images and several cap- tions for each image is a laborious task, which requires vast human effort. On the other hand, collecting a large number of images and sentences separately may be immensely eas- ier. In this paper, we develop a novel data- efficient semi-supervised framework for train- ing an image captioning model. We leverage massive unpaired image and caption data by learning to associate them. To this end, our proposed semi-supervised learning method as- signs pseudo-labels to unpaired samples via Generative Adversarial Networks to learn the joint distribution of image and caption. To evaluate, we construct scarcely-paired COCO dataset, a modified version of MS COCO cap- tion dataset. The empirical results show the ef- fectiveness of our method compared to several strong baselines, especially when the amount of the paired samples are scarce. 1 Introduction Image captioning is a task of automatically gener- ating a natural language description of a given im- age. It is a highly useful task, in that 1) it extracts the essence from an image into a self-descriptive form of representation, and 2) the output format is a natural language, which exhibits free-form and manageable characteristics useful to applica- tions such as language based image or region re- trieval [Johnson et al., 2016; Karpathy and Fei- Fei, 2015; Kim et al., 2019], video summariza- tion [Choi et al., 2018], navigation [Wang et al., 2018a], vehicle control [Kim et al., 2018b]. Im- age captioning exhibits free-form characteristics, since it is not confined to a few number of pre- defined classes. This enables descriptive analysis of a given image. Recent research on image captioning has made impressive progress [Anderson et al., 2018; Vinyals et al., 2015]. Despite this progress, the Paired Data ( ) Unpaired Image Data ( ) Unpaired Text Data ( ) A pizza with basil leaves served on a plate. A woman cutting a large cake with one lit candle. A piece of cake and coffee are on an outdoor table. A white dog has a purple Frisbee in ŝƚƐ mouth. A chocolate cake and a fork ready to be eat. A woman cutting a pizza with a fork and knife. Figure 1: The proposed data setup utilizes both “paired” and “unpaired” image-caption data. We de- note paired data as D p , and unpaired image and caption datasets as D x u and D y u respectively. majority of works are trained via supervised learn- ing that requires a large corpus of caption labeled images such as the MS COCO caption dataset [Lin et al., 2014]. Specifically, the dataset is con- structed with 1.2M images that were asked the annotators to provide five grammatically plausi- ble sentences for each image. Constructing such human-labeled datasets are an immensely labori- ous task and time-consuming. This is a challenge of image captioning, because the task heavily re- quires large data, i.e., data hungry task. In this work, we present a novel way of leverag- ing unpaired image and caption data to train data hungry image captioning neural networks. We consider a scenario in which we have a small scale paired image-caption data of a specific domain. We are motivated by the fact that images can be easily obtained from the web and captions can be easily augmented and synthesized by replacing or adding different words for given sentences as done in [Zhang et al., 2015]. Moreover, given a suf- ficient amount of descriptive captions, it is easy to crawl corresponding but noisy images through Google or Flickr image databases [Thomee et al., 2016] to build an image corpus. In this way, we can easily construct a scalable unpaired dataset of images and captions, which requires no (or at least minimal) human effort. Due to the unpaired nature of images (input) and captions (output supervision), the conven-

Transcript of Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial...

Page 1: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2012–2023,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

2012

Image Captioning with Very Scarce Supervised Data:Adversarial Semi-Supervised Learning Approach

Dong-Jin Kim1 Jinsoo Choi1 Tae-Hyun Oh2 In So Kweon1

1KAIST, South Korea. 2MIT CSAIL, Cambridge, MA.1{djnjusa,jinsc37,iskweon77}@kaist.ac.kr [email protected]

Abstract

Constructing an organized dataset comprisedof a large number of images and several cap-tions for each image is a laborious task, whichrequires vast human effort. On the otherhand, collecting a large number of images andsentences separately may be immensely eas-ier. In this paper, we develop a novel data-efficient semi-supervised framework for train-ing an image captioning model. We leveragemassive unpaired image and caption data bylearning to associate them. To this end, ourproposed semi-supervised learning method as-signs pseudo-labels to unpaired samples viaGenerative Adversarial Networks to learn thejoint distribution of image and caption. Toevaluate, we construct scarcely-paired COCOdataset, a modified version of MS COCO cap-tion dataset. The empirical results show the ef-fectiveness of our method compared to severalstrong baselines, especially when the amountof the paired samples are scarce.

1 Introduction

Image captioning is a task of automatically gener-ating a natural language description of a given im-age. It is a highly useful task, in that 1) it extractsthe essence from an image into a self-descriptiveform of representation, and 2) the output formatis a natural language, which exhibits free-formand manageable characteristics useful to applica-tions such as language based image or region re-trieval [Johnson et al., 2016; Karpathy and Fei-Fei, 2015; Kim et al., 2019], video summariza-tion [Choi et al., 2018], navigation [Wang et al.,2018a], vehicle control [Kim et al., 2018b]. Im-age captioning exhibits free-form characteristics,since it is not confined to a few number of pre-defined classes. This enables descriptive analysisof a given image.

Recent research on image captioning has madeimpressive progress [Anderson et al., 2018;Vinyals et al., 2015]. Despite this progress, the

Paired Data (𝑫𝒑) Unpaired Image Data (𝑫𝒖𝒙) Unpaired Text Data (𝑫𝒖

𝒚)

A pizza with basil leavesserved on a plate.

A woman cutting a large cakewith one lit candle.

A piece of cake and coffee areon an outdoor table.

A white dog has a purpleFrisbee in it’smouth.

A chocolate cake and afork ready to be eat.

A woman cutting a pizzawith a fork and knife.

Figure 1: The proposed data setup utilizes both“paired” and “unpaired” image-caption data. We de-note paired data asDp, and unpaired image and captiondatasets as Dx

u and Dyu respectively.

majority of works are trained via supervised learn-ing that requires a large corpus of caption labeledimages such as the MS COCO caption dataset [Linet al., 2014]. Specifically, the dataset is con-structed with 1.2M images that were asked theannotators to provide five grammatically plausi-ble sentences for each image. Constructing suchhuman-labeled datasets are an immensely labori-ous task and time-consuming. This is a challengeof image captioning, because the task heavily re-quires large data, i.e., data hungry task.

In this work, we present a novel way of leverag-ing unpaired image and caption data to train datahungry image captioning neural networks. Weconsider a scenario in which we have a small scalepaired image-caption data of a specific domain.We are motivated by the fact that images can beeasily obtained from the web and captions can beeasily augmented and synthesized by replacing oradding different words for given sentences as donein [Zhang et al., 2015]. Moreover, given a suf-ficient amount of descriptive captions, it is easyto crawl corresponding but noisy images throughGoogle or Flickr image databases [Thomee et al.,2016] to build an image corpus. In this way, wecan easily construct a scalable unpaired dataset ofimages and captions, which requires no (or at leastminimal) human effort.

Due to the unpaired nature of images (input)and captions (output supervision), the conven-

Page 2: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2013

tional supervision loss can no longer be used di-rectly. We propose to algorithmically assign su-pervision labels, termed as pseudo-labels, to makeunpaired data paired. The pseudo-label is usedas a learned supervision label. To develop themechanism of pseudo-labeling, we use generativeadversarial network (GAN) [Goodfellow et al.,2014], for searching pseudo-labels for unpaireddata. That is, in order to find the appropriatepseudo-labels for unpaired samples, we utilize anadversarial training method for training a discrim-inator model. Thereby, the discriminator learns todistinguish between real and fake image-captionpairs and to retrieve pseudo labels as well as toenrich the captioner training.

Our main contributions are summarized as fol-lows. (1) We propose a novel framework for train-ing an image captioner with the unpaired image-caption data and a small amount of paired data.(2) In order to facilitate training with unpaireddata, we devise a new semi-supervised learningapproach by the novel usage of the GAN dis-criminator. (3) We construct our scarcely-pairedCOCO dataset, which is the modified version ofthe MS COCO dataset without pairing informa-tion. On this dataset, we show the effectivenessof our method in various challenging setups, com-pared to strong competing methods.

2 Related Work

The goal of our work is to deal with unpairedimage-caption data for image captioning. There-fore, we mainly focus on image captioning andunpaired data handling literature.Data Issues in Image Captioning. Since theintroduction of large-scale datasets such as MSCOCO data [Lin et al., 2014], image captioninghas been extensively studied in vision and lan-guage society [Anderson et al., 2018; Rennie etal., 2017; Vinyals et al., 2015; Xu et al., 2015] byvirtue of the advancement of deep convolutionalneural networks [Krizhevsky et al., 2012]. As neu-ral network architectures become more advanced,they require much larger dataset scale for gener-alization [Shalev-Shwartz and Ben-David, 2014].Despite the extensive study on better network ar-chitectures, the data issues for image captioning,such as noisy data, partially missing data, and un-paired data have been relatively barely studied.

Unpaired image-caption data has been just re-cently discussed. Gu et al. [2018] introduce

a third modal information, Chinese captions, forlanguage pivoting [Utiyama and Isahara, 2007;Wu and Wang, 2007]. Feng et al. [2019] proposean unpaired captioning framework which trains amodel without image or sentence labels via learn-ing a visual concept detector with external data,OpenImage dataset [Krasin et al., 2017]. Chen etal. [2017] approach image captioning as a domainadaptation by utilizing large paired MS COCOdata as the source domain and adapting on a sep-arate unpaired image or caption dataset as the tar-get domain. Liu et al. [2018] use self-retrieval forcaptioning to facilitate training a model with par-tially labeled data, where the self-retrieval mod-ule tries to retrieve corresponding images basedon the generated captions. As a separate lineof work, there are novel object captioning meth-ods [Anne Hendricks et al., 2016; Venugopalan etal., 2017] that additionally exploit unpaired imageand caption data corresponding to a novel word.

Most of aforementioned works [Gu et al., 2018;Anne Hendricks et al., 2016; Venugopalan et al.,2017; Feng et al., 2019] exploit large auxiliarysupervised data which is often beyond image-caption data. To the best of our knowledge, weare the first to study how to handle unpaired im-age and caption data for image captioning withoutany auxiliary information but by leveraging scarcepaired image-caption data only. Although [Chenet al., 2017] does not use auxiliary informationas well, it requires large amounts of paired sourcedata, of which data regime is different from ours.[Liu et al., 2018] is also this case, where they usethe full paired MS COCO caption dataset with anadditional large unlabeled image set. Our methodlies on a very scarce paired source data regime, ofwhich scale is only 1% of the COCO dataset.

Multi-modality in Unpaired Data Handling. Byvirtue of the advancement on generative modelingtechniques, e.g., GAN [Goodfellow et al., 2014],multi-modal translation recently emerged as apopular field. Among many possible modalities,image-to-image translation between two different(and unpaired) domains has been mostly explored.To tackle this issue, the cycle-consistency con-straint between unpaired data is exploited in Cy-cleGAN [Zhu et al., 2017] and DiscoGAN [Kim etal., 2017], and it is further improved in UNIT [Liuet al., 2017].

In this work, we regard image captioning asa multi-modal translation. Our work has similar

Page 3: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2014

motivation to the unpaired image-to-image trans-lation [Zhu et al., 2017], and unsupervised ma-chine translation [Artetxe et al., 2018; Lample etal., 2018a,b] works or machine translation withmonolingual data [Zhang et al., 2018]. However,we show that the cycle-consistency does not workon our problem setup due to a significant modal-ity gap. Instead, our results suggest that the tra-ditional label propagation based semi-supervisedframework [Zhou et al., 2004] is more effectivefor our task.Semi-supervised Learning. Our method is mo-tivated by the generative model based semi-supervised learning [Chongxuan et al., 2017; Ganet al., 2017; Kingma et al., 2014], which mostlydeals with classification labels. In contrast, weleverage caption data and devise a novel model totrain the captioning model.

3 Proposed Framework

In this section, we first brief the standard imagecaption learning, and describe how we can lever-age the unpaired dataset. Then, we introduce anadversarial learning method for obtaining a GANmodel that is used for assigning pseudo-labels andencouraging to match the distribution of latent fea-tures of images and captions. Lastly, we describea connection to CycleGAN.

3.1 ModelLet us denote a dataset with Np image-captionpairs as Dp = {(xi, yi)}

Npi=1. A typical image cap-

tioning task is defined as follows: given an im-age xi, we want a model to generate a caption yithat best describe the image. Traditionally, a cap-tioning model is trained on a large paired dataset(x, y) ∈ Dp, e.g., MS COCO dataset, by minimiz-ing the negative log likelihood against the groundtruth caption as follows:

minθ

∑(x,y)∈Dp

LCE(y,x; θ), (1)

where LCE = −log qθ(y|x) denotes the cross en-tropy loss, θ the set of learnable parameters, andq(·) the probability output of the model. Moti-vated by encoder-decoder based neural machinetranslation literature [Cho et al., 2014], tradi-tional captioning frameworks are typically imple-mented as a encoder-decoder architecture [Vinyalset al., 2015], i.e., CNN-RNN. The CNN encoderF (x; θenc) produces a latent feature vector zx

from a given input image x, and the RNN decoderH(zx; θdec) is followed to generate a caption y ina natural language form from zx, as depicted inFig. 2. The term q(·) in Eq. (1) is typically im-plemented by the sum of the cross entropy of eachword token. For simplicity, we omit θ from here ifdeducible, e.g., H(zx).

Learning with Unpaired Data. Our problemdeals with unpaired data samples, where the imageand caption sets Dxu={xi}

Nxi=0 and Dyu={yi}

Nyi=0

are not paired. Given the unpaired datasets, due tothe missing supervision, the loss in Eq. (1) cannotbe directly computed. Motivated by the nearestneighbor aware semi-supervised framework [Shiet al., 2018], we artificially generate pseudo-labelsfor respective unpaired datasets.

Specifically, we retrieve a best match captionyi in Dyu given a query image xi, and assign itas a pseudo-label, and vice versa (xi for yi). Weabuse the pseudo-label as a function for simplic-ity, e.g., yi = y(xi). To retrieve a semanticallymeaningful match, we need a measure to assessmatches. We use a discriminator network to de-termine real or fake pairs by GAN, which we willdescribe later. With the retrieved pseudo-labels,now we can compute Eq. (1) by modifying it as:

minθλx

∑x∈Dx

u

LCE(y(x),x; θ) + λy∑y∈Dy

u

LCE(y, x(y); θ),

(2)

where λ{·} denote the balance parameters.

Discriminator Learning by Unpaired FeatureMatching. We train the criterion to find a se-mantically meaningful match, so that pseudo-labels for each modality are effectively retrieved.We learn a discriminator for that by using a smallamount of paired supervised dataset.

We adopt a caption encoder, G(y; θcap), whichembeds the caption y into a feature zy. This is im-plemented with a single layer LSTM, and we takethe output of the last time step as the caption rep-resentation zy. Likewise, given an image x, weobtain zx by the image encoder F (x; θenc). Now,we have a comparable feature space of zx and zy.We utilize the discriminator to distinguish whetherthe pair (zx, zy) from true paired data (x, y) ∈ Dp,i.e., the pair belongs to the real distribution p(x, y)or not. We could use random pairs of (x, y)independently sampled from respective unpaireddatasets, but we found that it is detrimental dueto uninformative pairs. Instead, we conditionally

Page 4: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2015

CNN ( )(shared)

𝑫𝑫(shared)

Paired

LSTM (𝐺𝐺) (shared)

LSTM (𝐻𝐻)

𝑻𝑻𝒄𝒄→𝒗𝒗

𝑻𝑻𝒗𝒗→𝒄𝒄

LSTM (𝐺𝐺)(shared)

Capt

ion

feat

(𝑧𝑧𝑦𝑦

)Im

age

feat

(𝑧𝑧𝑥𝑥

)Ca

ptio

n fe

at

(𝑧𝑧𝑦𝑦

)Im

age

feat

(𝑧𝑧𝑥𝑥

)

Unpaired Caption(𝐷𝐷𝑢𝑢

𝑦𝑦)

Paired Data(𝐷𝐷𝑝𝑝)

Unpaired Image(𝐷𝐷𝑢𝑢𝑥𝑥)

Unpaired / Paired

Unpaired / Paired

𝑫𝑫(shared)

𝑫𝑫(shared)

Prediction ( �𝑦𝑦)

Pseudo-label ( �𝑦𝑦)

Pseudo-label Search Module𝑧𝑧𝑥𝑥

𝑧𝑧1𝑦𝑦

𝑧𝑧2𝑦𝑦

𝑧𝑧3𝑦𝑦

𝑧𝑧4𝑦𝑦

1

2

3

4

3

1

4

2

Scores Ranked Scores

𝑫𝑫𝑦𝑦3

CNN ( )(shared)

Pseudo-label Search Module

𝑳𝑳𝑪𝑪𝑪𝑪

Caption (𝑦𝑦)

Caption (𝑦𝑦)

Image (𝑥𝑥)

Image (𝑥𝑥)

: Forward: Back-prop (𝐷𝐷): Back-prop (θ, {T})

Figure 2: Description of the proposed method. Dotted arrows denote the path of the gradients via back-propagation.Given any image and caption pair, CNN and RNN (LSTM) encoders encode input image and caption into therespective feature spaces. A discriminator (D) is trained to discriminate whether the given feature pairs are real orfake, while the encoders are trained to fool the discriminator. The learned discriminator is also used to assign themost likely pseudo-labels to unpaired samples through the pseudo-label search module.

synthesize zx or zy, to form a synthesized pair thatappears to be as realistic as possible. We use thefeature transformer networks zy =Tv→c(z

x) andzx=Tc→v(z

y), where v→c denotes the mappingfrom visual data to caption data and vice versa, andz denotes the conditionally synthesized feature.{T} are implemented with multi-layer-perceptronwith four FC layers with ReLU nonlinearity.

The discriminator D(·, ·) learns to distinguishfeatures, real or not. At the same time, the otherassociated networks F , G, T{·} are learned to foolthe discriminator by matching the distribution ofpaired and unpaired data. Motivated by [Chongx-uan et al., 2017], the formulation of this adversar-ial training is as follows:minθ,{T}

maxD

E(zx,zy)∼(F,G)◦Dp

[log(D(zx, zy)) + Lreg(zx, zy, {T})]

+ 12

(E

x∼p(x)[log(1−D(F (x), Tv→c(F (x))))]

+ Ey∼p(y)

[log(1−D(Tc→v(G(y)), G(y)))]

),

(3)where Lreg(z

x, zy, {T})=λreg(‖ T

v→c(zx)−zy‖2F + ‖zx− T

c→v(zy)‖2F ),

and we use the distribution notation p(·) to flex-ibly refer to any type of the dataset regardlessof Dp and Du. Note that the first log term isnot used for updating any learnable parametersrelated to θ, {T}, but only used for updating D.Through alternating training of the discriminator

(D) and generator (θ, {T}) similar to [Kim etal., 2018a], the latent feature distribution ofpaired and unpaired data should be close toeach other, i.e., p(zx, zy) = p(zx, Tv→c(z

x)) =p(Tc→v(z

y), zy) (refer to the proof in [Chongxuanet al., 2017]). In addition, as the generator istrained, the decision boundary of the discriminatortightens. If the unpaired datasets are sufficientlylarge such that semantically meaningful matchesexist between the different modality datasets, andif the discriminator D is plausibly learned, wecan use D to retrieve a proper pseudo-label. Ourarchitecture is illustrated in Fig. 2.

Pseudo-label Assignment. Given an image x ∈Dxu, we retrieve a caption in the unpaired dataset,i.e., y ∈ Dyu, that has the highest score obtained bythe discriminator, i.e., the most likely caption to bepaired with the given image as

yi = y(xi) = argmaxy∈Dyu

D (F (xi), G(y)) , (4)

vice versa for unpaired captions:

xi = x(yi) = argmaxx∈Dxu

D (F (x), G(yi)) . (5)

By this retrieval process over all the unpaireddataset, we have image-caption pairs {(xi, yi)}from the paired data and the pairs with pseudo-labels {(xj , yj)} and {(xk, yk)} from the unpaired

Page 5: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2016

data. However, these pseudo-labels are not noise-free, thus treating them equally with the pairedones is detrimental. Motivated by learning withnoisy labels [Lee et al., 2018; Wang et al., 2018b],we define a confidence score for each of the as-signed pseudo-labels. We use the output scorefrom the discriminator, as the confidence score,i.e., αxi=D(xi, yi) and αyi=D(xi, yi), where wedenote D(x, y)=D(F (x), G(y)), and α ∈ [0, 1]due to the sigmoid function of the final layer of thediscriminator. Therefore, we utilize the confidencescores to assign weights to the unpaired samples.We compute the weighted loss as follows:

minθ

∑(x,y)∈Dp

LCE(y,x; θ) + λx∑

x∈Dxuαxi LCE(y(x),x; θ)

+λy∑y∈Dyu

αyiLCE(y, x(y); θ). (6)

We jointly train the model on both paired andunpaired data. To ease the training further, we addan additional triplet loss function:∑

(xp,yp)∈Dp,xu∈Dxu,yu∈D

yu

logpθ(yp|xp)pθ(yp|xu)

+logpθ(yp|xp)pθ(yu|xp)

, (7)

by regarding random unpaired samples as nega-tive. This slightly improves the performance.

3.2 Connection with CycleGANAs a representative fully unpaired method, Cy-cleGAN [Zhu et al., 2017] would be the strongbaseline at this point, which has been popularlyused for unpaired distribution matching. Sinceit is designed for image-to-image translation, wedescribe the modifications to it to fit our task,so as to understand relative performance of ourmethod over the CycleGAN baseline. When ap-plying the cycle-consistency on matching betweenimages and captions, since the input modalities aretotally different, we modify it to a translation prob-lem over feature spaces as follows:

minθ,{T}

maxD{x,y}

Lcycle({T})

+ Ex∼p(x)

[log(Dx(F (x))) + log(1−Dx(Tv→c(F (x))))]

+ Ey∼p(y)

[log(Dy(G(y))) + log(1−Dy(Tc→v(G(y))))],

(8)

where {T} and D{x,y} denotes the feature transla-tors and the discriminators for image domain andcaption domain, and

Lcycle({T}) = Ex∼p(x)[‖Tc→v(Tv→c(F (x)))− F (x)‖2]+Ey∼p(y)[‖Tv→c(Tc→v(G(y)))−G(y)‖2].

(9)

The discriminator D is learned to distinguishwhether the latent features is from the image orthe caption distribution. This is different with ourmethod, in that we distinguish the correct seman-tic match of pairs. We experimented with Cycle-GAN purely on unpaired datasets, but we were notable to train it; hence, we add the supervised loss(Eq. (1)) with the paired data, which is a fair set-ting with ours.

4 Experiments

In this section, we describe the experimentalsetups, competing methods and provide perfor-mance evaluations of unpaired captioning withboth quantitative and qualitative results.

4.1 Implementation Details

We implement our neural networks by PyTorch li-brary [Paszke et al., 2017]. We use ResNet101 [Heet al., 2016] model as a backbone CNN encoderfor image input, and initialize with weights pre-trained on ImageNet [Russakovsky et al., 2015].Each region is represented as 2,048-dimensionaloutput from the last residual block of this network.For NIC model [Vinyals et al., 2015] and Att2inmodel [Rennie et al., 2017] in Table 3, we use vi-sual features with size 2048 × 7 × 7 . For Up-Down [Anderson et al., 2018] and the model in Ta-ble 3, we use Faster R-CNN [Ren et al., 2015] pre-trained on Visual Genome [Krishna et al., 2017]object dataset, thereby we extract visual featuresfor 10 to 100 regions.

We set the channel size of the hidden layers ofLSTMs to be 1024, 512 for the attention layer,and 1024 for the word embeddings. For inferencestage, we empirically choose the beam size to be3 when generating a description, which shows thebest performance.

We use a minibatch size of 100, the Adam [Baand Kingma, 2015] optimizer for training (learn-ing rate lr=5e−4, b1=0.9, b2=0.999). For hyper-parameters, we empirically choose λx and λy tobe equal to 0.1. The total loss function for trainingour model is as follows:

L = Lcap + 0.1LGAN + 0.1Ltriplet, (10)

where Lcap denotes the captioning loss defined inEq. (6), LGAN the loss for adversarial training de-fined in Eq. (3), and Ltriplet the triplet loss definedin Eq. (7).

Page 6: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2017

(A) (B) (C) BLEU1 BLEU4 METEOR CIDErFully paired (100%) 72.5 29.4 25.0 95.7Paired only (1%) 58.1 13.4 15.9 36.0CycleGAN (Zhu et al. ) 58.7 14.1 16.2 37.7Ours ver1 X 60.1 15.7 16.5 45.3Ours ver2 X X 61.3 17.1 19.1 51.7Ours (final) X X X 63.0 18.7 20.7 55.2[Gu et al., 2018] 46.2 5.4 13.2 17.7[Feng et al., 2019] 58.9 18.6 17.9 54.9

Table 1: Captioning performance comparison on MSCOCO testset. The “Paired only” baseline is trainedonly with 1% of paired data from our scarcely-pairedCOCO dataset, while CycleGAN and Ours {ver1, ver2,final}, are trained with our scarcely-paired COCOdataset (1% of paired data and unpaired image and cap-tion datasets). We indicate the ablation study by: (A)the usage of the proposed GAN that distinguishes realor fake image-caption pairs, (B) pseudo-labeling, and(C) noise handling by sample re-weighting. We alsocompare with [Gu et al., 2018] and [Feng et al., 2019],which are trained with unpaired datasets.

4.2 Experimental Setups

We utilize MS COCO caption [Lin et al., 2014]as our target dataset, which contains 123k imageswith 5 caption labels per image. To validate ourmodel, we follow Karpathy splits [Karpathy andFei-Fei, 2015], which have been broadly used inimage captioning literature. The Karpathy splitscontain 113k training, 5k validation, and 5k testimages in total. In our experiments, in order tosimulate the scenario that both paired and unpaireddata exist, we use two different data setups: 1) theproposed scarcely-paired COCO, and 2) partiallylabeled COCO [Liu et al., 2018] setup.

For the scarcely-paired COCO setup, we re-move the pairing information of the MS COCOdataset, while leaving a small fraction of pairs un-altered. We randomly select only one percent ofthe total data, i.e., 1,133 training images for thepaired data, and remove the pairing informationof the rest to obtain unpaired data. We call thesmall fraction of samples given as pairs as paireddata (Dp) and call the samples without pairing in-formation as unpaired data (Du). We study theeffects of other ratios of paired data used for train-ing (in Figs. 4 and 3). This dataset allows evalu-ating the proposed framework to measure whethersuch small paired data can be leveraged to learnplausible pseudo-label assignment, and what per-formance can be achieved compared to the fullysupervised case.

For partially labeled COCO setup, we follow[Liu et al., 2018] and use the whole MS COCOdata (paired) and add the “Unlabeled-COCO”

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ratio of paired data.

78

80

82

84

86

88

90

92

94

96

98

CID

Er

BaselineOursReference

Figure 3: Performance w.r.t. amount of paired data fortraining. Baseline denotes our Paired Only baseline,Ours is our final model, and Reference is Paired Onlytrained with full paired data.

split of the officially MS-COCO [Lin et al., 2014]for unpaired images, which involves 123k imageswithout any caption label. To compute the crossentropy loss, we use the pseudo-label assignmentfor the Unlabeled-COCO images.

For evaluation, we use the following met-rics conventionally used in image caption-ing: BLEU [Papineni et al., 2002], ROUGUE-L [Lin, 2004], SPICE [Anderson et al., 2016],METEOR [Denkowski and Lavie, 2014], andCIDEr [Vedantam et al., 2015]. All the evaluationis done on MS COCO testset.

In order to avoid high time complexity ofthe pseudo-labeling process, we do not searchthe pseudo-labels from the whole unpaired data.Pseudo-label retrieval is done on a subset of one-hundredth of the unpaired data (yielding 1000samples). Thus, the complexity of each minibatchbecomes O(B ×N × 0.01), where B denotes theminibatch size, N denotes the size of the unpaireddataset. Since we also apply label-noise handling,a plausible pseudo-label assignment is sufficientin helping the learning process. We can improveperformance by using a larger subset or by using afast approximate nearest neighbor search.

4.3 Evaluation on Scarcely-paired COCOWe follow the same setting with [Vinyals et al.,2015], if not mentioned. Table 1 shows the re-sults on the scarcely-paired COCO dataset. Wecompare with several baselines: Paired Only; wetrain our model only on the small fraction (1%)of the paired data, CycleGAN; we train our modelwith the cycle-consistency loss [Zhu et al., 2017](as in Eq. (8)). Additionally, we train variants ofour model denoted as Ours (ver1, ver2, and fi-nal). The model denoted as Our ver1, is trained

Page 7: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2018

Baseline (1%): a player playing a swinging on her long.

Baseline (10%): a woman playing tennis on a court with a racket.

Baseline (100%): a woman is playing tennis on a court.

Ours (1%) : a man with a tennis racket on a court.

Baseline (1%) : a black bear wearing skis with a chain on the trees.

Baseline (10%) : a black bear is standing in the grass.

Baseline (100%) : a black bear is walking through a field.

Ours (1%) : a black bear walking throughthe grass near the forest.

Baseline (1%): a bunch of children white skis on.

Baseline (10%): a group of people standing on top of a snow covered slope.

Baseline (100%): a group of people standing on top of a snow covered slope.

Ours (1%) : a group of people standing on a snow covered slope.

Baseline (1%) : a baseball player playing with out a red.

Baseline (10%) : a baseball player is getting ready to hit a ball.

Baseline (100%) : a group of people playing a game of Frisbee.

Ours (1%) : a baseball player playing isswinging a bat at a ball.

Baseline (1%) : a train rolling down tracks at the track.

Baseline (10%) : a train on track near a station.

Baseline (100%) : a train is traveling down the tracks in a city.

Ours (1%) : a train traveling down the tracksin front of a train station.

Baseline (1%): a little girl skier markers on to snow the slope.

Baseline (10%): a person skiing down a snowy slope.

Baseline (100%): a man riding skis down a snow covered slope.

Ours (1%) : a person standing on skis in a snow covered field.

Baseline (1%): a zebra standing over ledge in a French together.

Baseline (10%): a zebra standing on top of a lush green field.

Baseline (100%): a zebra standing in a field with a tree in the background.

Ours (1%) : a zebra standing on a grassy filed.

Baseline (1%) : a bunch of sheep in front of the zoo standing.

Baseline (10%) : a group of sheep standing next to each other.

Baseline (100%) : a herd of sheep standing on top of a lush green field.

Ours (1%) a group of sheep standing next to a fence.

Figure 4: Sampled qualitative results of our model. We compare with the baseline models trained only on N% ofpaired samples out of the full MS COCO. Despite the use of only 1% paired data, our model generates plausiblecaptions similar to that of the baseline models trained with more data (10% and above).

via our GAN model (Eq. (3)) that distinguishesreal or fake image-caption pairs. For Our ver2, weadd training with pseudo-labels for unpaired datawith Eq. (6) while setting the confidence scoresαx=αy=1 for all training samples. For our finalmodel denoted as Our (final), we apply the noisehandling technique by the confidence scores αx

and αy computed by Eq. (6) and re-weighting theloss for each sample. We present the performanceof the fully supervised (Fully paired) model using100% of the COCO training data for reference.

As shown in Table 1, in a scarce data regime,utilizing the unpaired data improves the captioningperformance in terms of all metrics by noticeablemargins. Also, our models show favorable perfor-mance compared to the CycleGAN model in allmetrics. Our final model with pseudo-labels andnoise handling achieves the best performance inall metrics across all the competitor, hereafter ref-ered to this model as our model.

We also compare the recent unpaired imagecaptioning methods [Gu et al., 2018; Feng et al.,2019] in Table 1. Both of the methods are eval-uated on MS COCO testset. In the case of Gu etal., AIC-ICC image-to-Chinese dataset [Wu et al.,2017] is used as unpaired images Dxu and cap-tions from MS COCO are used as unpaired cap-tions Dyu. Note that this is not a fair comparison to

our method but they have more advantages, in thatGu et al. use large amounts of additional labeleddata (10M Chinese-English parallel sentences ofAIC-MT dataset [Wu et al., 2017]) and Feng etal. use 36M samples of additional OpenImagesdataset, whereas our model only uses a smallamount of paired samples (1k) and 122k unpaireddata. Despite far lower reliance on paired data, ourmodel shows favorable performance against recentunpaired captioning works.

We study our final model against our PairedOnly baseline according to varying amounts ofpaired training data in Fig. 3, so that we can seehow much information can be gained from the un-paired data. From 100% to 10%, as the amount ofpaired samples decreases, the fluency and the ac-curacy of the descriptions get worse. In particular,we observed that most of the captions generatedfrom the Paired Only baseline trained with 10% ofpaired data (11,329 pairs) show erroneous gram-matical structure. In contrast, by leveraging un-paired data, our method can generate more fluentand accurate captions, compared to Paired Onlytrained on the same amount of paired data. It isworthwhile to note that our model trained with60% of paired data (67,972 pairs) achieves similarperformance to the Paired Only baseline trainedwith fully paired data (113,287 pairs), which sig-

Page 8: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2019

BLEU1 BLEU2 BLEU3 BLEU4 ROUGE-L SPICE METEOR CIDErSelf-Retrieval [Liu et al., 2018] (w/o unlabeled) 79.8 62.3 47.1 34.9 56.6 20.5 27.5 114.6Self-Retrieval [Liu et al., 2018] (with unlabeled) 80.1 63.1 48.0 35.8 57.0 21.0 27.4 117.1Baseline (w/o unlabeled) 77.7 61.6 46.9 36.2 56.8 20.0 26.7 114.2Ours (w/o unlabeled) 80.8 65.3 49.9 37.6 58.7 22.7 28.4 122.6Ours (with unlabeled) 81.2 66.0 50.9 39.1 59.4 23.8 29.4 125.5

Table 2: Comparison with the semi-supervised image captioning method, “Self-Retrieval” [Liu et al., 2018]. Ourmethod shows improved performance even without Unlabeled-COCO data (denoted as w/o unlabeled) as well aswith Unlabeled-COCO (with unlabeled), although our model is not originally proposed for such scenario.

a bus is parked on a street.a large bear standing on a grass. a motorcycle parked on a street.

Figure 5: Examples of the pseudo-labels (captions) as-signed to the unpaired images. Our model is able tosufficiently assign image-caption pairs through the pro-posed adversarial training.

nifies that our method is able to save near half ofthe human labeling effort of constructing a dataset.

We also show qualitative samples of our resultsin Fig. 4. Paired Only baseline trained with 1%paired data produces erroneous captions, and thebaseline with 10% paired data starts to produceplausible captions. It is based on ten times morepaired samples, compared to our model that usesonly 1% of them. We highlight that, in the two ex-amples on the top row of Fig. 4, our model gener-ates more accurate captions than the Paired Onlybaseline trained on the 100% paired data (“base-ball” to “Frisbee” on the top-left, and “man” to“woman” on the top-right). This suggests that un-paired data with our method effectively boosts theperformance especially when paired data is scarce.

In order to demonstrate the meaningfulness ofour pseudo-label assignment, we show the pseudo-labels assigned to the unpaired samples. We showthe pseudo-labels (captions) assigned to unlabeledimages from Unlabeled-COCO images in Fig. 5.As shown in the figure, despite not knowing thereal pairs of these images, the pseudo labels aresufficiently assigned by the model. Note that eventhough there is no ground truth caption for unla-beled images in the searching pool, the model isable to find the most likely (semantically corre-lated) image-caption pair for the given images.

Fig. 6 highlights interesting result samples ofour caption generation, where the results con-tain words that do not exist in the paired data ofthe scarcely-paired COCO dataset. The semanticmeaning of the words such as “growling,” “grow-

Two bears growling in thewater.

A broccoli grows on agreen plant.

Cows being herded by afence in a field.

Figure 6: Example captions containing words that donot exist in the paired datasetDp. The novel words thatare not in Dp but in Dy

u are highlighted in bold.

ing,” and “herded,” which involve in the unpairedcaption data but not in the paired data, may havebeen learned properly via pseudo-label assignmentduring training. These examples suggest that ourmethod is capable to infer the semantic meaning ofunpaired words to some extent, which would havebeen unable to be learned with only little paireddata. This would be an evidence that our frame-work is capable to align abstract semantic spacesbetween two modalities, i.e., visual and caption.

4.4 Evaluation on Partially Labeled COCO

For a more realistic setup, we compare with therecent semi-supervised image captioning method,called Self-Retrieval [Liu et al., 2018], on theirproblem regime, i.e., “partially labeled” COCOsetup, where the full amount of the paired MSCOCO (113k) plus 123k uncaptioned images ofthe Unlabeled-COCO set are used (no additionalunpaired caption is used). While their regime isnot our direct scope, but we show the effectivenessof our method in the regime. Then, we extend ourframework by replacing our backbone architecturewith recent advanced image caption architectures.In this setup, a separate unpaired caption data Dyudoes not exist, thus we use captions from pairedCOCO data Dp as pseudo-labels.

Table 2 shows the comparison with Self-Retrieval. For a fair comparison with them,we replace the cross entropy loss from our losswith the policy gradient method [Rennie et al.,2017] to directly optimize our model with CIDErscore as in [Liu et al., 2018]. As our baseline

Page 9: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2020

BLEU1 BLEU2 BLEU3 BLEU4 ROUGE-L SPICE METEOR CIDErNIC [Vinyals et al., 2015] 72.5 55.1 40.4 29.4 52.7 18.2 25.0 95.7NIC +Ours 74.2 56.9 42.0 31.0 54.1 19.4 28.8 97.8Att2in [Rennie et al., 2017] 74.5 58.1 44.0 33.1 54.6 19.4 26.2 104.3Att2in +Ours 75.5 58.8 45.0 34.3 55.4 20.0 26.8 105.3Up-Down [Anderson et al., 2018] 76.7 60.7 47.1 36.4 56.6 20.6 27.6 113.6Up-Down +Ours 77.3 64.1 47.6 37.0 57.1 21.0 27.8 114.6

Table 3: Evaluation of our method with different backbone architectures. All models are reproduced and trainedwith the cross entropy loss. By adding Unlabeled-COCO images, our training method was applied in a semi-supervised way, which shows consistent improvement in all the metrics.

model (denoted as Baseline), we train a modelonly with policy gradient method without the pro-posed GAN model. When only using the 100%paired MS COCO dataset (denoted as w/o unla-beled), our model already shows improved per-formance over Self-Retrieval. Moreover, whenadding Unlabeled-COCO images (denoted as withunlabeled), our model performs favorably againstSelf-Retrieval in all the metrics. The results sug-gest that our method is also advantageous in thesemi-supervised setup.

To further validate our method in the semi-supervised setup, we compare different backbonearchitectures [Anderson et al., 2018; Rennie etal., 2017; Vinyals et al., 2015] in our frame-work, where their methods are developed for fully-supervised methods. We use the same data setupwith the above, but we replace CNN (F ) andLSTM (H) from our framework with the imageencoder and the caption decoder from their im-age captioning models. Then, these models canbe trained in our framework as it is by alternat-ing between the discriminator update and pseudo-labeling. Table 3 shows the comparison. Trainingwith the additional Unlabeled-COCO data via ourtraining scheme consistently improves all base-lines in all metrics.

4.5 Captioning with Web-Crawled DataTo simulate a scenario involving crawled datafrom the web, we use the setup suggested byFeng et al. [2019]. They collect a sentence cor-pus by crawling the image descriptions from Shut-terstock1 as unpaired caption data Dyu, whereby2.2M sentences are collected. For unpaired im-age data Dxu, they use only the images from theMS COCO data, while the captions are not usedfor training. For training our method, we lever-age from 0.5% to 1% of paired MS COCO dataas our paired data Dp. The results are shown inTable 4 with the performance obtained by Feng et

1https://www.shutterstock.com

BLEU4 ROUGE-L SPICE METEOR CIDErPaired only (0.5% paired) 4.4 33.7 3.6 10.8 8.6Ours (0.5% paired) 5.4 34.6 4.2 12.0 10.5Paired only (0.7% paired) 3.5 36.1 3.7 11.4 8.9Ours (0.7% paired) 8.5 39.0 5.2 13.6 20.2Paired only (0.8% paired) 8.8 39.1 5.9 13.2 21.9Ours (0.8% paired) 12.2 41.6 7.6 15.1 29.0Paired only (1% paired) 13.4 41.9 8.3 15.9 36.0Ours (1% paired) 15.2 43.3 9.4 16.9 39.7[Feng et al., 2019] 4.8 27.2 7.0 11.4 23.3

Table 4: Performance comparison with web-crawleddata (Shutterstock). On top of unpaired image and cap-tion data, our method is trained with 0.5 – 1% of paireddata, while Feng et al. use 36M additional images ofthe OpenImage dataset.

al. Note again that Feng et al. exploits externallarge-scale data, i.e., 36M images in the OpenIm-ages dataset. With 0.5% of paired only data (566pairs), our baseline shows lower scores in termsof BLEU4 and METEOR than Feng et al., whileour proposed model shows comparable or favor-able performance in BLEU4, ROUGE-L and ME-TEOR. Our method starts to have higher scores inall metrics from 0.8% of paired data (906 pairs),even without additional 36M images.

5 Conclusion

We introduce a method to train an image cap-tioning model with a large scale unpaired imageand caption data, given a small amount of paireddata. Our framework achieves favorable perfor-mance compared to various methods and setups.Unpaired captions and images are the data that canbe easily collected from the web. It can facilitateapplication specific captioning models, where la-beled data is scarce.

Acknowledgements. This work was supportedby Institute for Information & communica-tions Technology Planning & Evaluation(IITP)grant funded by the Korea government(MSIT)(No.2017-0-01780, The technology developmentfor event recognition/relational reasoning andlearning knowledge based system for video under-standing)

Page 10: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2021

ReferencesPeter Anderson, Basura Fernando, Mark Johnson, and

Stephen Gould. Spice: Semantic propositional im-age caption evaluation. In European Conference onComputer Vision (ECCV). Springer, 2016.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. Bottom-up and top-down attention for im-age captioning and vqa. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),2018.

Lisa Anne Hendricks, Subhashini Venugopalan, Mar-cus Rohrbach, Raymond Mooney, Kate Saenko, andTrevor Darrell. Deep compositional captioning: De-scribing novel object categories without paired train-ing data. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. Unsupervised neural machinetranslation. In International Conference on Learn-ing Representations (ICLR), 2018.

Jimmy Ba and Diederik Kingma. Adam: A methodfor stochastic optimization. In International Con-ference on Learning Representations (ICLR), 2015.

Tseng-Hung Chen, Yuan-Hong Liao, Ching-YaoChuang, Wan-Ting Hsu, Jianlong Fu, and Min Sun.Show, adapt and tell: Adversarial training of cross-domain image captioner. In IEEE InternationalConference on Computer Vision (ICCV), 2017.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase rep-resentations using rnn encoder-decoder for statis-tical machine translation. In Conference on Em-pirical Methods in Natural Language Processing(EMNLP), 2014.

Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. Contex-tually customized video summaries via natural lan-guage. In IEEE Winter Conference on Applicationsof Computer Vision (WACV), 2018.

LI Chongxuan, Taufik Xu, Jun Zhu, and Bo Zhang.Triple generative adversarial nets. In Advancesin Neural Information Processing Systems (NIPS),2017.

Michael Denkowski and Alon Lavie. Meteor univer-sal: Language specific translation evaluation for anytarget language. In The workshop on statistical ma-chine translation, 2014.

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsu-pervised image captioning. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),2019.

Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu,Yizhe Zhang, Hao Liu, Chunyuan Li, and Lawrence

Carin. Triangle generative adversarial networks. InAdvances in Neural Information Processing Systems(NIPS), 2017.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adver-sarial nets. In Advances in Neural Information Pro-cessing Systems (NIPS), 2014.

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang.Unpaired image captioning by language pivoting. InEuropean Conference on Computer Vision (ECCV),2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition.In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016.

Justin Johnson, Andrej Karpathy, and Li Fei-Fei.Densecap: Fully convolutional localization net-works for dense captioning. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),2016.

Andrej Karpathy and Li Fei-Fei. Deep visual-semanticalignments for generating image descriptions. InIEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015.

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung KwonLee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial net-works. In International Conference on MachineLearning (ICML), 2017.

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, YoungjinYoon, and In So Kweon. Disjoint multi-task learn-ing between heterogeneous human-centric tasks. InIEEE Winter Conference on Applications of Com-puter Vision (WACV). IEEE, 2018.

Jinkyu Kim, Anna Rohrbach, Trevor Darrell, JohnCanny, and Zeynep Akata. Textual explanations forself-driving vehicles. In European Conference onComputer Vision (ECCV), 2018.

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In SoKweon. Dense relational captioning: Triple-streamnetworks for relationship-based captioning. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), 2019.

Durk P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. Semi-supervised learn-ing with deep generative models. In Advances inneural information processing systems, pages 3581–3589, 2014.

Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Fer-rari, Sami Abu-El-Haija, Alina Kuznetsova, Has-san Rom, Jasper Uijlings, Stefan Popov, AndreasVeit, et al. Openimages: A public dataset forlarge-scale multi-label and multi-class image clas-sification. Dataset available from https://github.com/openimages, 2:3, 2017.

Page 11: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2022

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. Visual genome: Connecting language and vi-sion using crowdsourced dense image annotations.International Journal of Computer Vision (IJCV),123(1):32–73, 2017.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. Imagenet classification with deep convolutionalneural networks. In Advances in Neural InformationProcessing Systems (NIPS), 2012.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. Unsupervised machinetranslation using monolingual corpora only. InInternational Conference on Learning Representa-tions (ICLR), 2018.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, et al. Phrase-based & neural un-supervised machine translation. In Conference onEmpirical Methods in Natural Language Processing(EMNLP), 2018.

Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Lin-jun Yang. Cleannet: Transfer learning for scalableimage classifier training with label noise. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), 2018.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. Microsoft coco: Com-mon objects in context. In European Conference onComputer Vision (ECCV). Springer, 2014.

Chin-Yew Lin. Rouge: A package for automatic eval-uation of summaries. Text Summarization BranchesOut, 2004.

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsu-pervised image-to-image translation networks. InAdvances in Neural Information Processing Systems(NIPS), 2017.

Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen,and Xiaogang Wang. Show, tell and discriminate:Image captioning by self-retrieval with partially la-beled data. In European Conference on ComputerVision (ECCV), 2018.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluationof machine translation. In Association for Compu-tational Linguistics (ACL). Association for Compu-tational Linguistics, 2002.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. Automatic differentiation in pytorch. InNIPS-W, 2017.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. Faster R-CNN: Towards real-time object de-tection with region proposal networks. In Advancesin Neural Information Processing Systems (NIPS),2015.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. Self-critical se-quence training for image captioning. In IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR), 2017.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV),115(3):211–252, 2015.

Shai Shalev-Shwartz and Shai Ben-David. Under-standing machine learning: From theory to algo-rithms. Cambridge university press, 2014.

Weiwei Shi, Yihong Gong, Chris Ding, Zhiheng MaX-iaoyu Tao, and Nanning Zheng. Transductive semi-supervised deep learning using min-max features. InEuropean Conference on Computer Vision (ECCV),2018.

Bart Thomee, David A Shamma, Gerald Fried-land, Benjamin Elizalde, Karl Ni, Douglas Poland,Damian Borth, and Li-Jia Li. Yfcc100m: the newdata in multimedia research. Communications of theACM, 59(2):64–73, 2016.

Masao Utiyama and Hitoshi Isahara. A comparison ofpivot methods for phrase-based statistical machinetranslation. In Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL-HLT), 2007.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. Cider: Consensus-based image descriptionevaluation. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2015.

Subhashini Venugopalan, Lisa Anne Hendricks, Mar-cus Rohrbach, Raymond Mooney, Trevor Darrell,and Kate Saenko. Captioning images with diverseobjects. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. Show and tell: A neural image cap-tion generator. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015.

Xin Wang, Wenhan Xiong, Hongmin Wang, andWilliam Yang Wang. Look before you leap: Bridg-ing model-free and model-based reinforcementlearning for planned-ahead vision-and-languagenavigation. In European Conference on ComputerVision (ECCV), 2018.

Page 12: Image Captioning with Very Scarce Supervised Data: Adversarial Semi ... · Adversarial Semi-Supervised Learning Approach Dong-Jin Kim 1Jinsoo Choi Tae-Hyun Oh2 In So Kweon 1KAIST,

2023

Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey,Hongyuan Zha, Le Song, and Shu-Tao Xia. Itera-tive learning with open-set noisy labels. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), 2018.

Hua Wu and Haifeng Wang. Pivot language ap-proach for phrase-based statistical machine transla-tion. Machine Translation, 21(3):165–181, 2007.

Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, BaomingYan, Rui Liang, Wenjia Wang, Shipei Zhou, Gu-osen Lin, Yanwei Fu, et al. Ai challenger: A large-scale dataset for going deeper in image understand-ing. arXiv preprint arXiv:1711.06475, 2017.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. Show, attend and tell: Neu-ral image caption generation with visual attention.In International Conference on Machine Learning(ICML), 2015.

Xiang Zhang, Junbo Zhao, and Yann LeCun.Character-level convolutional networks for text clas-sification. In Advances in Neural Information Pro-cessing Systems (NIPS), 2015.

Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and En-hong Chen. Joint training for neural machine trans-lation models with monolingual data. In AAAI Con-ference on Artificial Intelligence (AAAI), 2018.

Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Ja-son Weston, and Bernhard Scholkopf. Learningwith local and global consistency. In Advancesin Neural Information Processing Systems (NIPS),2004.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation usingcycle-consistent adversarial networks. In IEEE In-ternational Conference on Computer Vision (ICCV),2017.