Learning to Describe E-Commerce Images from Noisy Online...

16
Learning to Describe E-Commerce Images from Noisy Online Data Takuya Yashima, Naoaki Okazaki, Kentaro Inui, Kota Yamaguchi, and Takayuki Okatani Tohoku University, Sendai, Japan Abstract. Recent study shows successful results in generating a proper language description for the given image, where the focus is on detecting and describing the contextual relationship in the image, such as the kind of object, relationship between two objects, or the action. In this paper, we turn our attention to more subjective components of descriptions that contain rich expressions to modify objects – namely attribute ex- pressions. We start by collecting a large amount of product images from the online market site Etsy, and consider learning a language generation model using a popular combination of a convolutional neural network (CNN) and a recurrent neural network (RNN). Our Etsy dataset contains unique noise characteristics often arising in the online market. We first apply natural language processing techniques to extract high-quality, learnable examples in the real-world noisy data. We learn a generation model from product images with associated title descriptions, and ex- amine how e-commerce specific meta-data and fine-tuning improve the generated expression. The experimental results suggest that we are able to learn from the noisy online data and produce a product description that is closer to a man-made description with possibly subjective at- tribute expressions. 1 Introduction Imagine you are a shop owner and trying to sell a handmade miniature doll. How would you advertise your product? Probably giving a good description is one of the effective strategies. For example, stating Enchanting and unique fairy doll with walking stick, the perfect gift for children would sound more appealing to customers than just stating miniature doll for sale. In this paper, we consider automatically generating good natural language descriptions for product images which have rich and appealing expressions. Natural language generation has become a popular topic as the vision com- munity makes a significant progress in deep models to generate a word sequence given an image [27,12]. Existing generation attempts focus mostly on detecting and describing the contextual relationship in the image [18], such as a kind of object in the scene (e.g., a man in the beach ) or the action of the subject given a scene (e.g., a man is holding a surfboard ). In this paper, we turn our atten- tion to generating proper descriptions for product images with rich attribute expressions.

Transcript of Learning to Describe E-Commerce Images from Noisy Online...

Page 1: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images fromNoisy Online Data

Takuya Yashima, Naoaki Okazaki, Kentaro Inui, Kota Yamaguchi, andTakayuki Okatani

Tohoku University, Sendai, Japan

Abstract. Recent study shows successful results in generating a properlanguage description for the given image, where the focus is on detectingand describing the contextual relationship in the image, such as the kindof object, relationship between two objects, or the action. In this paper,we turn our attention to more subjective components of descriptionsthat contain rich expressions to modify objects – namely attribute ex-pressions. We start by collecting a large amount of product images fromthe online market site Etsy, and consider learning a language generationmodel using a popular combination of a convolutional neural network(CNN) and a recurrent neural network (RNN). Our Etsy dataset containsunique noise characteristics often arising in the online market. We firstapply natural language processing techniques to extract high-quality,learnable examples in the real-world noisy data. We learn a generationmodel from product images with associated title descriptions, and ex-amine how e-commerce specific meta-data and fine-tuning improve thegenerated expression. The experimental results suggest that we are ableto learn from the noisy online data and produce a product descriptionthat is closer to a man-made description with possibly subjective at-tribute expressions.

1 Introduction

Imagine you are a shop owner and trying to sell a handmade miniature doll.How would you advertise your product? Probably giving a good description isone of the effective strategies. For example, stating Enchanting and unique fairydoll with walking stick, the perfect gift for children would sound more appealingto customers than just stating miniature doll for sale. In this paper, we considerautomatically generating good natural language descriptions for product imageswhich have rich and appealing expressions.

Natural language generation has become a popular topic as the vision com-munity makes a significant progress in deep models to generate a word sequencegiven an image [27,12]. Existing generation attempts focus mostly on detectingand describing the contextual relationship in the image [18], such as a kind ofobject in the scene (e.g., a man in the beach) or the action of the subject givena scene (e.g., a man is holding a surfboard). In this paper, we turn our atten-tion to generating proper descriptions for product images with rich attributeexpressions.

Page 2: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

2 Takuya Yashima et al.

Attributes have been extensively studied in the community [3,23,14]. Typicalassumption is that there are visually recognizable attributes, and we can builda supervised dataset for recognition. However, as we deal with open-world vo-cabulary on the web, we often face much complex concepts consisting of phrasesrather than a single word. The plausible approach would be to model attributesin terms of a language sequence instead of individual words. The challenge is thatattribute expressions can be subjective and ambiguous. Attribute-rich expres-sions, such as antique copper flower decoration, or enchanting and unique fairydoll, require higher-level judgement on the concept out of lower-level appearancecues. Even humans do not always agree on the meaning of abstract concepts,such as coolness or cuteness. The concept ambiguity brings a major challenge inbuilding a large-scale corpus of conceptually obscure attributes [29,20].

We attempt to learn attribute expressions using large-scale e-commerce data.Product images in e-commerce websites typically depict a single object withoutmuch consideration to the contextual relationship within an image, in contrastto natural images [21,18,25]. Product descriptions must convey, specific color,shape, pattern, material, or even subjective and abstract concepts out of thesingle image with a short title, or with a longer description in the item detail forthose interested in buying the product, e.g., Beautiful hand felted and heatheredpurple & fuschia wool bowl. Although e-commerce data look appealing in termsof data availability and scale, descriptions and meta-data such as tags containweb-specific noise due to the nature of online markets, such as fragmented textsfor search optimization or imbalance of distribution arising from shop owners.Naively learning a generation model results in poor product description, e.g.,made to order. In this paper, we apply natural language processing to extractimages and texts suitable for learning a generation model.

Our language generation model is based on the popular image-to-sequencearchitecture consisting of a convolutional neural network (CNN) and a recurrentneural network (RNN) [27,12]. We learn a generation model using the prod-uct images and associated titles from the pre-processed dataset, and show thatwe are able to generate a reasonable description to the given product image.We also examine how e-commerce meta-data (product category) and optimiza-tion to the dataset (fine-tuning) affect the generation process. We annotate ahandful of images using crowdsourcing and compare the quality of generated at-tribute expressions using machine translation metrics. The results suggest thate-commerce meta-data together with fine-tuning generate a product descriptioncloser to human.

We summarize our contribution in the following.

– We propose to learn attributes in the form of natural language expression,to deal with the exponentially many combination of open-world modifiervocabulary.

– We collect a large-scale dataset of product images from online market Etsy,as well as human annotation of product descriptions using crowdsourcing forevaluation purpose. We release data to the academic community1.

1 http://vision.is.tohoku.ac.jp/~kyamagu/research/etsy-dataset

Page 3: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 3

– We propose a simple yet effective data cleansing approach to transform e-commerce data into a corpus suitable for learning.

– Our empirical study shows that our model can generate a description withattribute expressions using noisy e-commerce data. The study also suggestsutilizing e-commerce meta-data can further improve the description quality.

2 Related Work

Language generation Generating a natural language description from thegiven image has been an active topic of research in the vision community [21,11,27,12,6,28].Early attempts have used retrieval-based approach to generate a sentence [21,11],and recently a deep-learning approach becomes a popular choice. For example,Vinyals et al. [27] shows that they can generate a high-quality language descrip-tion of the scene image using a combination of a CNN and a RNN. Karpathyet al. [12] also shows that their model can generate partial descriptions of givenimage regions, as well as a whole image. Antol et al. [1] studies a model whichis able to generate sentences in answer to various questions about given images.

In this paper, we are interested in generating a natural language expressionthat is rich in attribute. Previous work mostly focuses on natural images wherethe major goal is to understand the scene semantics and spatial arrangement,and produce an objective description. The closest to our motivation is perhapsthe work by Mathews et al. [20] that studies a model to generate more expressivedescription with sentiment. They build a new dataset by asking crowd workersto re-write description of images contained in MS-COCO, and report successfulgeneration with sentiment, for instance, beautiful, happy, or great. We take adifferent approach of utilizing e-commerce data to build an attribute-rich corpusof descriptions.

Attribute recognition Semantic or expressive attributes have been activelystudied in the community as a means of ontological entity [16] or localizablevisual elements [3], expecting that these semantic information can be useful formany applications. In this work, we consider attribute expressions as a naturallanguage description that modifies an object (specifically, a product) and conveysdetails possibly with abstract words. The attribute expressions are from open-world vocabulary in the real-world e-commerce data. In that sense, we have asimilar spirit to weakly supervised learning [5,9,24]. We propose to use a sequencegeneration model rather than attempting to learn a classifier from exponentiallymany combinations of attribute expressions.

Vision in e-commerce Several attempts have been made to apply computervision in e-commerce applications [19,14,8,7,13], perhaps for the usefulness in aspecific scenario such as better user experience in retrieval or product recommen-dation. The earlier work by Berg et al. [3] propose a method of automatic visualattribute discovery using web data, specifically product images from shopping

Page 4: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

4 Takuya Yashima et al.

Fig. 1: Our language generation model combining CNN and RNN.

websites. Our work has the same motivation that we wish to learn language de-scription of attributes from the e-commerce data, though we use variety of prod-ucts and try to capture abstract attributes using language generation model.Very recently, Zakrewsky et al. [30] reports an attempt of popularity predictionof products offered in Etsy. The results suggest the potential usefulness of imagefeature for selling strategies, such as advertisement.

3 Language generation model

Our language generation is based on the combination of convolutional neuralnetworks (CNN) to obtain image representation and recurrent neural networks(RNN), using LSTM cells to translate the representation into a sequence ofwords [27,12,32]. In addition to the image input, we also consider inserting e-commerce meta-data to the RNN. In this paper, we utilize the category of prod-uct as extra information available in the e-commerce scenario, and feed into theRNN as a one-hot vector. Note that each product could have more than onecategory, such as a main category and sub categories, but in this paper we useonly the main category for simplicity. Figure 1 illustrates our generation model.

Let us denote the input product image I’s feature by zv = CNN (I), the one-hot vector of the product category in meta-data by zc, and the one-hot vectorof the currently generated word at description position t by xt. Our sequencegenerator is then expressed by:

Hin =

{[1;Whi [zv; zc] ;0] (t = 1)

[1;Whxxt;ht−1] (otherwise)(1)

(i, f, o, g) = WLSTMHin, (2)

ct = f � ct−1 + i� g, (3)

ht = o� tanh(ct), (4)

yt = softmax(Wohht + bo), (5)

where Whi,Whx,WLSTM ,Woh, bo are weights and biases of the network. Welearn these parameters from the dataset. Gates i, f, o, g are controlling whethereach input or output is used or not, allowing the model to deal with the vanishing

Page 5: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 5

Fig. 2: Product data in Etsy dataset.

gradient problem. We feed the image and category input only at the beginning.The output yt represents an unnormalized probability of each word, and hasthe length equal to the vocabulary size + 1 to represent a special END token toindicate the end of a description.

To learn the network, we use product image, title and category information.The learning procedure starts by setting h0 = 0, y1 to the desired word in thedescription (yt = y1 indicates the first word in the sequence), and x1 to a specialSTART symbol to indicate the beginning of the sequence. We feed the rest of thewords from the ground truth until we reach the special END token at the end.We learn the model to maximize the log probability in the dataset. At test time,we first set h0 = 0, x1 to the START token, and feed the image representation zvwith the category vector zc. Once we get an output, we draw a word accordingto yt and set the word to xt, the word predicted at the previous step (so whent = 2, each xt is yt−1). We repeat the process until we observe the END token.

When training, we use Stochastic Gradient Descent, set the initial learningrate to 1.0e-3, and lower as the process iterates. In this paper, we do not back-propagate the gradient to CNN and separately train CNN and RNN. We evaluatehow different CNN models perform in Section 5.

4 Building corpus from e-commerce data

We collect and build the image-text corpus from the online market site Etsy. Weprepare pairs of a product image and title as well as product meta-data suitablefor learning attribute expressions. The challenge here is how to choose gooddescriptions for learning. In this section, we briefly describe the e-commerce dataand our approach to extract useful data using syntactic analysis and clustering.

Page 6: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

6 Takuya Yashima et al.

4.1 E-commerce data

Etsy is an online market for hand-made products [31]. Figure 2 shows someexamples of product images from the website. Each listing contains various in-formation, such as image, title, detailed description, tags, materials, shop owner,price, etc. We crawled product listings from the website and downloaded overtwo million product images.

We apply various pre-processing steps to transform the crawled raw data intoa useful corpus to learn attribute expressions. Note that this semi-automatedapproach to build a corpus is distinct from the previous language generationefforts where the approach is to start from supervised dataset with clean annno-tations [18]. Our corpus is from the real-world market, and as common in anyWeb data [21,25], the raw listing data from Etsy contain a lot of useless data forlearning, due to a huge amount of near-duplicates and inappropriate languageuse for search engine optimization. For example, we observed the following titles:

– Army of two Airsoft Paintball BB Softair Gun Prop Helmet Salem CostumeCosplay Goggle Mask Maske Masque jason MA102 et

– teacher notepad set - bird on apple / teacher personalized stationary / per-sonalized stationery / teacher notepad / teacher gift / notepad

Using such raw data to learn a generation model results in poor language qualityin the output.

4.2 Syntactic analysis

One common observation in Etsy is that there are fragments of noun phrases,often considered as a list of keywords targeting at search engine optimization.Although generating search-optimized description could be useful in some ap-plication, we are not aiming at learning keyword fragments in this paper. Weattempt to identify such fragmented description by syntactic analysis.

We first apply Stanford Parser [4] and estimate part-of-speech (POS) tags,such as noun or verb, for each word in the title. In this paper, we define mal-formed descriptions by the following criteria:

– more than 5 noun phrases in a row, or

– more than 5 special symbols such as slash, dash, and comma.

Figure 3 shows a few accepted and rejected examples from Etsy data. Note thatdue to the discrepancy between our Etsy titles and the corpus Stanford Parser istrained on, we found even the syntactic parser frequently failed to assign correctPOS tags for each word. We did not apply any special pre-processing for suchcases since most of the failed POS tagging resulted in the sequence of nouns,which in turn leads to reject.

Page 7: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 7

peppermint,tangerine,and hemp lip balms

coffee bag backpack green jadeite jade beaded,natural green white color beads size 8mm charm necklace

(a) Syntactically accepted

chunky pink and purple butterfly cuddle critter cape set newborn to 3 months photo prop

travel journal diary notebook sketch book - keep calm and go to canada - ivory

texas license plate bird house

(b) Syntactically rejected

Fig. 3: Accepted and rejected examples after syntactic analysis. Some productshave grammatically invalid title due to the optimization to search engine.

4.3 Near-duplicate removal

Another e-commerce specific issue is that there is a huge amount of near-duplicates.Near-duplicates are commonly occurring phenomena in web data. Here, we de-fine near-duplicate items as products whose titles are similar and differ only ina small part of the descriptions such as shown in Table 1. Those near-duplicatesadd a strong bias towards specific phrasing and affect the quality of the gener-ated description. Without a precaution, the trained model always generates asimilar phrase for any kind of images. In Etsy, we observe near-duplicates amongthe products offered by the same shop and listed in the same kind of categories,such as a series of products under the same category, as shown in Table 1. Wefind that such textual near-duplicates also exhibit visually similar appearance.Note that near-duplicates can happen for visually similar items but with differ-ent description, such as items under the same category but from a different shop.However, we find that such cases are considerably rare compared to the textualnear-duplicates in Etsy, perhaps due to the nature of a hand-made market wheremajority of products are one-of-a-kind.

Page 8: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

8 Takuya Yashima et al.

Table 1: Examples of near-duplicate products.

CUSTOM iPad SLEEVE 1, 2,New, 3 Black Chevron LimeGreen Personalized Monogram

CUSTOM iPad SLEEVE 1, 2,New, 3 Light Pink Chevron GrayFancy Script PERSONALIZEDMonogram

CUSTOM iPad SLEEVE 1, 2,New, 3 Black Damask Hot PinkPersonalized Monogram

CUSTOM iPad SLEEVE 1, 2,New, 3 Dark Blue Lattice LimeGreen PERSONALIZED Mono-gram

CUSTOM iPad SLEEVE 1, 2,New, 3 Blue Diagonal GreenPERSONALIZED Monogram

CUSTOM iPad SLEEVE 1, 2,New, 3 Light Blue Orange FloralPattern Teal PERSONALIZEDMonogram

We automatically identify near-duplicates using shop identity and title de-scription. We apply the following procedure to sub-sample product images fromthe raw online data.

1. Group products if they are sold by the same shop and belonging to the samecategory.

2. For each group, measure the similarity of the descriptions between all pairswithin the group.

3. Depending on the pairwise similarity, divide the group into sub-groups byDBSCAN clustering.

4. Randomly sub-sample pre-defined number of product images from each sub-group.

Our approach is based on the observation that the same shop tend to sell near-duplicates. We divide products into sub-groups to diversify the variation of de-scriptions. We divide the group into sub-groups based on thresholding on theJaccard similarity:

JG =S1 ∩ S2 · · · ∩ Sn

S1 ∪ S2 · · · ∪ Sn, (6)

where Si represents a set of words in the title description. Low-similarity within acluster indicates the group contains variety of descriptions, and consists of subtlydifferent products. We apply Density-Based Spatial Clustering of Applicationswith Noise (DBSCAN) [10] implemented in sklearn to obtain sub-clusters. Fig-ure 4 shows an example of groups. The purpose of the sub-sampling approach isto trim excessive amount of similar products while keeping variety.

After clustering, we randomly pick a certain number of products from eachcluster. We determine the number of samples per cluster KG using the following

Page 9: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 9

Fig. 4: Near-duplicates clustering. Our clustering only uses meta-data and tex-tual descriptions to identify near-duplicates, but the resulting clusters tend tobe visually coherent.

threshold:

KG =1

N

m∑k=1

nk + σ. (7)

Here, N is the total number of the groups, nk is the number of the products inthe group k and σ is the standard deviation of the distribution of the number ofproducts in the whole groups. We leave out some of the products if the numberof products in a certain group or a subgroup is far above the average. After thesub-sampling process, all kinds of products should have been equally distributedin the corpus.

Out of over 2 million products from the raw Etsy products, we first selected400k image-text pairs by syntactic analysis, and applied near-duplicate removal.We obtained approximately 340k product images for our corpus. We take 75% ofthe images for training and reserved the rest (25%) for testing in our experiment.We picked up 100 images from testing set for human annotation, and did notuse for quantitative evaluation due to the difficulty in obtaining ground truth.

5 Experiment

We evaluate our learned generation model by measuring how close the generatedexpressions are to human. For that purpose, we collect human annotations toa small number of testing images and measure the performance using machine-translation metrics.

5.1 Human annotation

We use crowdsourcing to collect human description of the product images. Wedesigned a crowdsourcing task to describe the given product image. Figure 5depicts our annotation interface. We asked workers to come up with a descriptiveand appealing title to sell the given product. During the annotation task, we

Page 10: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

10 Takuya Yashima et al.

Fig. 5: Crowdsourcing task to collect human description.

provide workers the original title and the category information to make surethey understand what kind of products they are trying to sell. We used AmazonMechanical Turk and asked 5 workers per image to collect reference descriptionsfor 100 test images. As seen in Figure 5, our images are quite different fromnatural scene images. Often we observed phrases rather than a complete sentencein the human annotation. This observation suggests that the essence of attributeexpression is indeed in the modifiers to the object rather than the recognitionof subject-verb relationships.

5.2 Evaluation metrics

We use seven metrics to evaluate the quality of our generated descriptions:BLEU-1 [22], BLEU-2, BLEU-3, BLEU-4, ROUGE [17], METEOR [2], andCIDEr [26]. These metrics have been widely used in natural language process-ing, such as machine translation, automatic text summarization, and recently inlanguage generation. Although our goal is to produce attribute-aware phrasesbut not necessarily sentences, these metrics can be directly used to evaluateour model using the reference human description. BLEU-N evaluates descrip-tions based on precision on N-grams, ROUGE is also based on N-grams butintended for recall, METEOR is designed for image descriptions, and CIDEr isalso proposed for image descriptions and using N-gram based method. We usethe coco-caption implementation [26] to calculate the above evaluation metrics.

5.3 Experimental conditions

We use AlexNet [15] for the CNN architecture of our generation model, andextract a 4,096 dimensional representation from fc7 layer given an input image

Page 11: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 11

Table 2: Evaluation results.Method Bleu1 Bleu2 Bleu3 Bleu4 Rouge Meteor CIDEr

Category+Fine-tune 15.1 6.55 2.58 × 10−5 5.56 × 10−8 12.9 4.69 11.2Category+Pre-train 9.43 3.72 1.65 × 10−5 3.74 × 10−8 9.74 3.70 8.01Fine-tune 8.95 3.94 2.03× 100 3.06× 10−4 4.98 2.24 1.88Pre-train 8.77 3.26 1.50 × 10−5 3.50 × 10−8 9.32 3.36 6.87MS-COCO 1.01 2.13 8.30 × 10−6 1.70 × 10−8 8.31 2.40 2.79

and its product category. In order to see how domain-knowledge affects thequality of language generation, we compare a CNN model trained on ImageNet,and a model fine-tuned to predict 32 product categories in Etsy dataset. Ourrecurrent neural network consists of a single hidden layer with 4,096 dimensionalimage feature and a 32 dimensional one-hot category indicator for an input. Weuse LSTM implementation of [12]. We compare the following models in theexperiment.

– Category+Fine-tune: Our proposed model with a fine-tuned CNN and acategory vector for RNN input.

– Category+Pre-train: Our proposed model with a pre-trained CNN and acategory vector for RNN input.

– Fine-tune: A fine-tuned CNN with RNN without a category vector.– Pre-train: A pre-trained CNN with RNN without a category vector.– MS-COCO: A reference CNN+RNN model trained on MS-COCO dataset [18]

without any training in our corpus.

We include MS-COCO model to evaluate how domain-transfer affects the qualityof generated description. Note that MS-COCO dataset contains more objectivedescriptions for explaining objects, actions, and scene in the given image.

5.4 Quantitative results

We summarize the performance evaluation in Table 2. Note that all scores arein percentage. Our Category+Fine-tune model achieves the best performance,except for BLEU-3 and BLEU-4, in which Fine-tune model achieves the best. Wesuspect overfitting happened in the Fine-tune only case where the model learnedto predict certain 3- or 4-word phrases such as Made to order, some happenedto be unexpectedly correct, and resulted in the sudden increase BLEU increase.However, we did not observe a similar improvement in other scores, such asROUGE or CIDEr. We conjecture this possibly-outlier result could be attributedto BLEU’s evaluation method.

From the result, we observe that introducing the category vector has thelargest impact on the description quality. We assume that category informationsupplements semantic knowledge in the image feature even if the category isnot apparent from the product appearance, and that results in stabler language

Page 12: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

12 Takuya Yashima et al.

Table 3: Comparison of generated descriptions.

Image

Original Zombie Chicken - NeedleFelted Wool Sculpture -MADE TO ORDER

Rooted Fairy Walking sticks

Category +Fine-tune

needle felted animal art print primitive doll

Category +Pre-train

needle felted cat custom made to order hand knit scarf

Fine-tune custom made to order hand made hand paintedric rac trims 11mm wide

custom made to order

Pre-train needle felted penguin wedding invitation black and white stripedleggings

MS-COCO a white teddy bear sit-ting on top of a greenfield

a close up of a pair ofscissors

a man sitting on a benchwith a bunch of bananas

Human Made-to-Order FeltedWool Zombie Chicken

A beautiful abstractillustration of a star,rooted to the Earth,surrounded by rabbits.

Enchanting and uniquefairy doll with walkingstick, the perfect gift forchildren

generation for difficult images. Note that in the e-commerce scenario, meta-dataare often available for free without expensive manual annotation.

The difference between Pre-train and Fine-tune models explains how domain-specific image feature contributes to better learning and helps the model to gen-erate high-quality descriptions. The result indicates that a pre-trained CNN isnot sufficient to capture the visual patterns in Etsy dataset. MS-COCO baselineis performing significantly worse than other models, indicating that the generaldescription generated by MS-COCO is far from attribute-centric description inproduct images. There is a significant difference between descriptions in the MS-COCO dataset and our Etsy corpus. The former tends to be complete, gram-matically correct descriptions focusing on the relationship between entities in thescene, whereas Etsy product titles tend to omit a verb and often do not requirerecognizing spatial arrangement of multiple entities. A product description canbe a fragment of phrases as seen in the actual data, and a long description canlook rather unnatural.

5.5 Qualitative results

Table 3 shows examples of generated descriptions by different models as well asthe original product titles. Fine-tune+category model seems to have generatedbetter expressions while other methods sometimes fail to generate meaningfuldescription (e.g., custom made to order). MS-COCO model is generating signifi-cantly different descriptions, and always tries to generate a description explainingtypes of objects and the relationship among them.

Page 13: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 13

reclaimed wood coffee ta-ble

handmade journal note-book

watercolor painting ofmoai statues at sunset

crochet barefoot sandalswith flower

hand painted ceramicmug

I’m going to be a bigbrother t-shirt

Fig. 6: Examples of generated descriptions. Our model correctly identifies at-tribute expressions (left 2 columns). The rightmost column shows failure casesdue to corpus issues.

Our model generates somewhat attribute-centric expressions such as needlefelted or primitive. Especially the latter expression is relatively abstract. Theseexamples confirms that at least we are able to automatically learn attribute ex-pressions from noisy online data. The descriptions tend to be noun phrases. Thistendency is probably due to the characteristics of e-commerce data containingphrases rather than a long, grammatically complete sentences. Our generationresults correctly reflect this characteristics. Figure 6 shows some examples ofgenerated descriptions by our model (category + fine-tune). Our model predictsattribute expressions such as reclaimed wood or hand-painted ceramic. We ob-serve failure due to corpus quality in some categories. For example, the paintingor the printed t-shirts in Figure 6 suffer from bias towards specific types ofproducts in the category. Sometimes our model gives a better description thanthe original title. For example, The middle in Table 3 shows a product entitledRooted, but it is almost impossible to guess the kind of product from the name,or maybe even from the appearance. Our model produces art print for this ex-ample, which seems to be much easier to understand the product kind and closerto our intuition, even if the result is not accurate.

6 Discussion

In this paper, we used a product image, a title and category information to gen-erate a description. However, there is other useful information in the e-commerce

Page 14: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

14 Takuya Yashima et al.

data, such as tags, materials, or popularity metrics [31]. Especially, a productdescription is likely to have more detailed information about the product, withmany attribute-like expressions having plenty of abstract or subjective words.E-commerce dataset looks promising in this respect since sellers are trying toattract more customers by “good” phrases which have a ring to it.

If we are able to find attribute expressions in the longer product description,we can expand our image-text corpus to a considerably larger scale. The chal-lenge here is that we then need to identify which description is relevant to thegiven image, because product descriptions contain irrelevant information also.For example, in Etsy, a product often contains textual description on shippinginformation. For a preliminary study, we applied a syntactic parser on Etsyproduct descriptions, but often observed an error in a parse tree, due to gram-matically broken descriptions in item listings. Identifying which description isrelevant or irrelevant seems like an interesting vision-language problem in thee-commerce scenario.

Finally, in this paper we left tags and material information in the item list-ings in Etsy dataset. These meta-data could be useful to learn a conventionalattribute or material classifier given an image, or to identify attribute-specificexpressions in the long product description.

7 Conclusion and future work

We studied the natural language generation from product images. In order tolearn a generation model, we collected product images from the online marketEtsy, and built a corpus to learn a generation model by applying dataset cleans-ing procedure based on syntactic analysis and near-duplicate removal. We alsocollected human descriptions for evaluation of the generated descriptions. Theempirical results suggest that our generation model fine-tuned on Etsy data withcategorical input successfully learns from noisy online data, and produces thebest language expression for the given product image. The result also indicatesa huge gap between the task nature of attribute-centric language generation anda general scene description.

In the future, we wish to improve our automatic corpus construction fromnoisy online data. We have left potentially-useful product meta-data in thisstudy. We hope to incorporate additional information such as product descrip-tion or tags to improve language learning process, as well as to realize a newapplication such as automatic title and keywords suggestion to shop owners.Also, we are interested in improving the deep learning architecture to the gen-erate attribute expressions.

Acknowledgement

This work was supported by JSPS KAKENHI Grant Numbers JP15H05919 andJP15H05318.

Page 15: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

Learning to Describe E-Commerce Images from Noisy Online Data 15

References

1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Ba-tra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InInternational Conference on Computer Vision (ICCV), 2015.

2. Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evalua-tion with improved correlation with human judgments. In Proceedings of the aclworkshop on intrinsic and extrinsic evaluation measures for machine translationand/or summarization, volume 29, pages 65–72, 2005.

3. Tamara L Berg, Alexander C Berg, and Jonathan Shih. Automatic attributediscovery and characterization from noisy web data. In ECCV, pages 663–676.Springer, 2010.

4. Danqi Chen and Christopher D Manning. A fast and accurate dependency parserusing neural networks. In EMNLP, pages 740–750, 2014.

5. Xinlei Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledgefrom web data. In ICCV, pages 1409–1416, Dec 2013.

6. Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He,Geoffrey Zweig, and Margaret Mitchell. Language Models for Image Captioning :The Quirks and What Works. Association for Computational Linguistics (ACL),pages 100–105, 2015.

7. Wei Di, Anurag Bhardwaj, Vignesh Jagadeesh, Robinson Piramuthu, and Eliza-beth Churchill. When relevance is not enough: Promoting visual attractiveness forfashion e-commerce. arXiv preprint arXiv:1406.3561, 2014.

8. Wei Di, Neel Sundaresan, Robinson Piramuthu, and Anurag Bhardwaj. Is a picturereally worth a thousand words?:-on the role of images in e-commerce. In Proceed-ings of the 7th ACM international conference on Web search and data mining,pages 633–642. ACM, 2014.

9. Santosh Divvala, Ali Farhadi, and Carlos Guestrin. Learning everything aboutanything: Webly-supervised visual concept learning. In CVPR, 2014.

10. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-basedalgorithm for discovering clusters in large spatial databases with noise. In KDD,1996.

11. Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image descriptionas a ranking task: Data, models and evaluation metrics. Journal of ArtificialIntelligence Research, pages 853–899, 2013.

12. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3128–3137, 2015.

13. M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C Berg, andTamara L Berg. Where to buy it: Matching street clothing photos in online shops.In ICCV, 2015.

14. Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Imagesearch with relative attribute feedback. In Computer Vision and Pattern Recogni-tion (CVPR), 2012 IEEE Conference on, pages 2973–2980. IEEE, 2012.

15. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classificationwith deep convolutional neural networks. In Advances in neural information pro-cessing systems, pages 1097–1105, 2012.

16. Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-basedclassification for zero-shot visual object categorization. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 36(3):453–465, 2014.

Page 16: Learning to Describe E-Commerce Images from Noisy Online Datavision.is.tohoku.ac.jp/~kyamagu/papers/yashima2016learning.pdf · We start by collecting a large amount of product images

16 Takuya Yashima et al.

17. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Textsummarization branches out: Proceedings of the ACL-04 workshop, volume 8, 2004.

18. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objectsin context. In Computer Vision–ECCV 2014, pages 740–755. Springer, 2014.

19. Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and ShuichengYan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and aux-iliary set. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Con-ference on, pages 3330–3337. IEEE, 2012.

20. Alexander Patrick Mathews, Lexing Xie, and Xuming He. Senticap: Generatingimage descriptions with sentiments. CoRR, abs/1510.01431, 2015.

21. Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing im-ages using 1 million captioned photographs. In Advances in Neural InformationProcessing Systems, pages 1143–1151, 2011.

22. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a methodfor automatic evaluation of machine translation. In Proceedings of the 40th annualmeeting on association for computational linguistics, pages 311–318. Associationfor Computational Linguistics, 2002.

23. Devi Parikh and Kristen Grauman. Relative attributes. In Dimitris N. Metaxas,Long Quan, Alberto Sanfeliu, and Luc J. Van Gool, editors, ICCV, pages 503–510.IEEE Computer Society, 2011.

24. Chen Sun, Chuang Gan, and Ram Nevatia. Automatic concept discovery fromparallel text and visual corpora. In ICCV, pages 2596–2604, 2015.

25. Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni,Douglas Poland, Damian Borth, and Li-Jia Li. The new data and new challengesin multimedia research. arXiv preprint arXiv:1503.01817, 2015.

26. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 4566–4575, 2015.

27. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show andtell: A neural image caption generator. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 3156–3164, 2015.

28. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov,Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image cap-tion generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.

29. Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Robust image sentimentanalysis using progressively trained and domain transferred deep networks. arXivpreprint arXiv:1509.06041, 2015.

30. S. Zakrewsky, K. Aryafar, and A. Shokoufandeh. Item Popularity Prediction inE-commerce Using Image Quality Feature Vectors. ArXiv e-prints, May 2016.

31. Stephen Zakrewsky, Kamelia Aryafar, and Ali Shokoufandeh. Item popularityprediction in e-commerce using image quality feature vectors. arXiv preprintarXiv:1605.03663, 2016.

32. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural networkregularization. CoRR, abs/1409.2329, 2014.